OpenVLA-7B

OpenVLA
VLA / Robotics

Stanford/Berkeley open VLA trained on 970k Open-X-Embodiment episodes. Supports LoRA fine-tuning.

Research-only model
OpenVLA-7B runs on physical robot hardware and is not exposed via the Railwail API yet.
Not API-accessible
Read the research
TL;DR·Last updated May 16, 2026

OpenVLA-7B is vla / robotics AI model from OpenVLA, priced at €0.000 per 1M input tokens with a unknown context window.

Try OpenVLA-7B

0.7

Direct API access coming soon

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate OpenVLA-7B into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("openvla-7b", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("openvla-7b", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("openvla-7b", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Developer
OpenVLA
Category
VLA / Robotics
Supported Formats
image
text
Tags
stanford
berkeley
vla
robotics
research-only
open-weights

Deep dive — Stanford / UC Berkeley / Toyota Research Institute's OpenVLA-7B

About Stanford / UC Berkeley / Toyota Research Institute
Founded 2024 · Stanford & Berkeley, California, USA

OpenVLA is the result of an academic-industry consortium led by Moo Jin Kim and colleagues at Stanford, UC Berkeley, and Toyota Research Institute (with contributors from MIT, Google DeepMind and Physical Intelligence). Released in June 2024, it was the first fully open-weights 7-billion-parameter Vision-Language-Action model trained on the Open-X-Embodiment dataset. OpenVLA was designed as a direct, reproducible, and parameter-efficient alternative to Google's closed RT-2 / RT-2-X, with the explicit goal of letting any lab fine-tune a 7B-class VLA on a single A100 / H100. The model, code, training recipe and fine-tuning toolkits (including LoRA) are all released under MIT-style permissive licences. OpenVLA quickly became a standard baseline in academic VLA research and the starting point for many downstream policies (CogACT, π-0-FAST baselines, embodied agent demos).

Visit Stanford / UC Berkeley / Toyota Research Institute
Architecture
Vision-Language-Action (autoregressive discrete-token VLA)

OpenVLA combines a Llama-2-7B language backbone with a dual visual encoder that concatenates DINOv2 and SigLIP features, fused into the LLM via a Prismatic VLM-style projector. The model treats actions as discretised tokens: each continuous robot action dimension is binned into 256 bins, and the resulting tokens are appended to the LLM's vocabulary. Training is a single autoregressive next-token objective predicting both language and action tokens given image observations and a natural-language instruction. OpenVLA was trained on ~970k demonstration episodes from the Open-X-Embodiment dataset spanning 22+ robot embodiments and used Llama-2-7B as a pretrained text+code backbone, which the authors found markedly improves language grounding compared to scratch-trained VLAs. Parameter-efficient fine-tuning with LoRA is officially supported, making OpenVLA the de-facto open VLA workhorse.

Parameters
7B
Context
unknown
What it can do
  • Fully open-weights 7B Vision-Language-Action model
  • Llama-2-7B backbone with DINOv2 + SigLIP vision
  • Discrete action-token decoding (256 bins per DoF)
  • Trained on ~970k Open-X-Embodiment episodes
  • LoRA fine-tuning officially supported
  • Strong language-grounded manipulation across robots
  • Fits on a single A100 / H100 with quantisation
  • MIT-style permissive licence on weights and code
  • Best for: research, reproducible VLA baselines, fine-tuning on new robots.
Training & License

~970,000 robot demonstration episodes from the Open-X-Embodiment dataset (RT-X collection), spanning 22+ robot embodiments and a wide range of manipulation tasks. Llama-2-7B and the DINOv2 + SigLIP vision encoders provide web-scale pretraining priors.

License: MIT-style permissive licence on code and weights; Llama-2 components subject to Meta's Llama-2 Community Licence. Considered research-friendly open-weights.

Known limitations
  • Discrete action tokens can limit smoothness vs diffusion policies
  • Inference latency on 7B is non-trivial for high-frequency control
  • Coverage skewed to Open-X tasks - novel embodiments need fine-tuning
  • Single-image / few-camera setup by default
  • English-only language conditioning
  • Llama-2 licence restrictions still apply to derived weights

Frequently asked questions

Start using OpenVLA-7B today

Get started with free credits. No credit card required. Access OpenVLA-7B and 100+ other models through a single API.