How much does Octo Base cost via Railwail?

No monthly minimum, no subscription. Start with €5 free credits.

What is the context window of Octo Base?

Octo Base supports a unknown context window — enough for typical AI workloads.

How fast is Octo Base?

Latency depends on prompt length and load — typically 200ms to 2s for short prompts. We measure p50/p95 in real-time on /rankings.

Is Octo Base better than Gemini Robotics (2025)?

It depends on your use case. Octo Base (UC Berkeley) and Gemini Robotics (2025) (Google DeepMind) are both strong choices in vla / robotics. Compare them side-by-side at /compare/octo-base-vs-gemini-robotics-2025.

Does Octo Base support image input (vision)?

Yes — Octo Base accepts image inputs in addition to text. Send images via the standard OpenAI-compatible `messages` array with `image_url` content blocks. Supported formats: image, text.

Octo Base

Name: Octo Base
Brand: Custom
SKU: octo-base
Availability: InStock

UC Berkeley

VLA / Robotics

Berkeley/Stanford 93M transformer diffusion policy. Pretrained on 800k Open-X-Embodiment episodes.

Research-only model

Octo Base runs on physical robot hardware and is not exposed via the Railwail API yet.

Not API-accessible

Read the research

TL;DR·Last updated June 24, 2026

Octo Base is vla / robotics AI model from UC Berkeley, priced at €0.000 per 1M input tokens with a unknown context window.

Try Octo Base

System Prompt

Message

Temperature

0.7

Max Tokens

Direct API access coming soon

Pricing

Price per Generation

Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Octo Base into your application.

Install

npm install railwail

JavaScript / TypeScript

import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("octo-base", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("octo-base", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("octo-base", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);

Specifications

Developer

UC Berkeley

Deep dive — UC Berkeley / Stanford (Octo Model Team)'s Octo Base

About UC Berkeley / Stanford (Octo Model Team)

Founded 2023 · Berkeley & Stanford, California, USA

The Octo project is a collaboration of academic labs led by Sergey Levine (UC Berkeley BAIR) and Chelsea Finn (Stanford IRIS), with contributions from CMU, Google DeepMind, and Toyota Research Institute. Octo was first released in May 2024 alongside the Open-X-Embodiment dataset effort, with the goal of producing a generalist, fully open-source robot policy that any researcher can fine-tune on a new robot in hours. Octo introduced the recipe of a transformer policy with a diffusion action head trained on 800k cross-embodiment demonstrations, and it has become a de-facto baseline in academic VLA / generalist-policy research. The team released both Octo-Small (27M) and Octo-Base (93M) under Apache-2.0, alongside code, checkpoints and a fine-tuning toolkit.

Visit UC Berkeley / Stanford (Octo Model Team) →

Architecture

Transformer policy with diffusion action head (Vision-Language-Action)

Octo-Base is a transformer-based generalist robot policy. Inputs are tokenised RGB views and a natural-language instruction (encoded with a T5-base text encoder), interleaved with learnable readout tokens. The transformer trunk consumes this sequence and emits action latents that are decoded by a diffusion head producing continuous action chunks (default 4-step lookahead, 7-DoF end-effector deltas). The model was pretrained on roughly 800k demonstrations from 25 datasets in the Open-X-Embodiment collection, covering 9 robots, both single-arm and bimanual setups. Octo is intentionally embodiment-agnostic: action and proprioception spaces are encoded via shared adapters so the same backbone can be fine-tuned to new robots with as little as a few hundred demos. The diffusion head gives smooth, multimodal trajectories that outperform discrete-token VLAs on dexterous tasks at this scale.

Parameters: 93M
Context: unknown

What it can do

Generalist VLA policy across many robot embodiments
Trained on ~800k demos from Open-X-Embodiment
Diffusion action head produces smooth continuous actions
Natural-language instruction conditioning (T5 encoder)
Multi-view image inputs (primary + wrist cameras)
Designed for fast fine-tuning on new robots and tasks
Apache-2.0 open weights, code and recipes
Strong academic baseline for VLA papers
Best for: research, fine-tuning to new embodiments, generalist-policy benchmarks.

Training & License

~800,000 robot trajectories drawn from 25 Open-X-Embodiment-compatible datasets across 9 robot embodiments (Franka, WidowX, Bridge, RT-1 Everyday Robots, Berkeley UR5, etc.). Trained on TPU v4 / v5 hardware.

License: Apache-2.0 - fully open weights, code, and dataset references. Research-friendly; commercial use permitted under the licence.

Known limitations

Modest 93M scale - underperforms 7B+ VLAs on hard generalisation
Optimised for 7-DoF end-effector control - bimanual humanoid action spaces need adapters
Limited language reasoning relative to LLM-backed VLAs
Image resolution capped (256x256)
Trained mostly on Western lab data - geographic bias
Long-horizon planning requires external prompt decomposition