OpenVLA-7B
Stanford/Berkeley open VLA trained on 970k Open-X-Embodiment episodes. Supports LoRA fine-tuning.
OpenVLA-7B is vla / robotics AI model from OpenVLA, priced at €0.000 per 1M input tokens with a unknown context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate OpenVLA-7B into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("openvla-7b", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("openvla-7b", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("openvla-7b", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Stanford / UC Berkeley / Toyota Research Institute's OpenVLA-7B
OpenVLA is the result of an academic-industry consortium led by Moo Jin Kim and colleagues at Stanford, UC Berkeley, and Toyota Research Institute (with contributors from MIT, Google DeepMind and Physical Intelligence). Released in June 2024, it was the first fully open-weights 7-billion-parameter Vision-Language-Action model trained on the Open-X-Embodiment dataset. OpenVLA was designed as a direct, reproducible, and parameter-efficient alternative to Google's closed RT-2 / RT-2-X, with the explicit goal of letting any lab fine-tune a 7B-class VLA on a single A100 / H100. The model, code, training recipe and fine-tuning toolkits (including LoRA) are all released under MIT-style permissive licences. OpenVLA quickly became a standard baseline in academic VLA research and the starting point for many downstream policies (CogACT, π-0-FAST baselines, embodied agent demos).
Visit Stanford / UC Berkeley / Toyota Research Institute →OpenVLA combines a Llama-2-7B language backbone with a dual visual encoder that concatenates DINOv2 and SigLIP features, fused into the LLM via a Prismatic VLM-style projector. The model treats actions as discretised tokens: each continuous robot action dimension is binned into 256 bins, and the resulting tokens are appended to the LLM's vocabulary. Training is a single autoregressive next-token objective predicting both language and action tokens given image observations and a natural-language instruction. OpenVLA was trained on ~970k demonstration episodes from the Open-X-Embodiment dataset spanning 22+ robot embodiments and used Llama-2-7B as a pretrained text+code backbone, which the authors found markedly improves language grounding compared to scratch-trained VLAs. Parameter-efficient fine-tuning with LoRA is officially supported, making OpenVLA the de-facto open VLA workhorse.
- Parameters
- 7B
- Context
- unknown
- Fully open-weights 7B Vision-Language-Action model
- Llama-2-7B backbone with DINOv2 + SigLIP vision
- Discrete action-token decoding (256 bins per DoF)
- Trained on ~970k Open-X-Embodiment episodes
- LoRA fine-tuning officially supported
- Strong language-grounded manipulation across robots
- Fits on a single A100 / H100 with quantisation
- MIT-style permissive licence on weights and code
- Best for: research, reproducible VLA baselines, fine-tuning on new robots.
~970,000 robot demonstration episodes from the Open-X-Embodiment dataset (RT-X collection), spanning 22+ robot embodiments and a wide range of manipulation tasks. Llama-2-7B and the DINOv2 + SigLIP vision encoders provide web-scale pretraining priors.
License: MIT-style permissive licence on code and weights; Llama-2 components subject to Meta's Llama-2 Community Licence. Considered research-friendly open-weights.
Known limitations
- Discrete action tokens can limit smoothness vs diffusion policies
- Inference latency on 7B is non-trivial for high-frequency control
- Coverage skewed to Open-X tasks - novel embodiments need fine-tuning
- Single-image / few-camera setup by default
- English-only language conditioning
- Llama-2 licence restrictions still apply to derived weights
Frequently asked questions
Related Models
View all VLA / RoboticsGemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
Start using OpenVLA-7B today
Get started with free credits. No credit card required. Access OpenVLA-7B and 100+ other models through a single API.