Octo Base
Berkeley/Stanford 93M transformer diffusion policy. Pretrained on 800k Open-X-Embodiment episodes.
Octo Base is vla / robotics AI model from UC Berkeley, priced at โฌ0.000 per 1M input tokens with a unknown context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate Octo Base into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple โ just pass a string
const reply = await rw.run("octo-base", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("octo-base", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("octo-base", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive โ UC Berkeley / Stanford (Octo Model Team)'s Octo Base
The Octo project is a collaboration of academic labs led by Sergey Levine (UC Berkeley BAIR) and Chelsea Finn (Stanford IRIS), with contributions from CMU, Google DeepMind, and Toyota Research Institute. Octo was first released in May 2024 alongside the Open-X-Embodiment dataset effort, with the goal of producing a generalist, fully open-source robot policy that any researcher can fine-tune on a new robot in hours. Octo introduced the recipe of a transformer policy with a diffusion action head trained on 800k cross-embodiment demonstrations, and it has become a de-facto baseline in academic VLA / generalist-policy research. The team released both Octo-Small (27M) and Octo-Base (93M) under Apache-2.0, alongside code, checkpoints and a fine-tuning toolkit.
Visit UC Berkeley / Stanford (Octo Model Team) โOcto-Base is a transformer-based generalist robot policy. Inputs are tokenised RGB views and a natural-language instruction (encoded with a T5-base text encoder), interleaved with learnable readout tokens. The transformer trunk consumes this sequence and emits action latents that are decoded by a diffusion head producing continuous action chunks (default 4-step lookahead, 7-DoF end-effector deltas). The model was pretrained on roughly 800k demonstrations from 25 datasets in the Open-X-Embodiment collection, covering 9 robots, both single-arm and bimanual setups. Octo is intentionally embodiment-agnostic: action and proprioception spaces are encoded via shared adapters so the same backbone can be fine-tuned to new robots with as little as a few hundred demos. The diffusion head gives smooth, multimodal trajectories that outperform discrete-token VLAs on dexterous tasks at this scale.
- Parameters
- 93M
- Context
- unknown
- Generalist VLA policy across many robot embodiments
- Trained on ~800k demos from Open-X-Embodiment
- Diffusion action head produces smooth continuous actions
- Natural-language instruction conditioning (T5 encoder)
- Multi-view image inputs (primary + wrist cameras)
- Designed for fast fine-tuning on new robots and tasks
- Apache-2.0 open weights, code and recipes
- Strong academic baseline for VLA papers
- Best for: research, fine-tuning to new embodiments, generalist-policy benchmarks.
~800,000 robot trajectories drawn from 25 Open-X-Embodiment-compatible datasets across 9 robot embodiments (Franka, WidowX, Bridge, RT-1 Everyday Robots, Berkeley UR5, etc.). Trained on TPU v4 / v5 hardware.
License: Apache-2.0 - fully open weights, code, and dataset references. Research-friendly; commercial use permitted under the licence.
Known limitations
- Modest 93M scale - underperforms 7B+ VLAs on hard generalisation
- Optimised for 7-DoF end-effector control - bimanual humanoid action spaces need adapters
- Limited language reasoning relative to LLM-backed VLAs
- Image resolution capped (256x256)
- Trained mostly on Western lab data - geographic bias
- Long-horizon planning requires external prompt decomposition
Frequently asked questions
Related Models
View all VLA / RoboticsGemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
Start using Octo Base today
Get started with free credits. No credit card required. Access Octo Base and 100+ other models through a single API.