Robotics / VLA
Vision-Language-Action models for robotics and embodied AI
Vision-language-action models for robotics and embodied AI
Vision-language-action (VLA) models bridge perception, language, and motor control. A VLA takes camera frames plus a natural-language instruction ('pick up the red mug') and outputs low-level robot actions β joint angles, gripper commands, end-effector poses. Most are research artifacts from labs like Physical Intelligence, Google DeepMind, Stanford, and Berkeley.
12 models available
Gemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
NVIDIA Cosmos-Predict-1
NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.
Octo Base
Berkeley/Stanford 93M transformer diffusion policy. Pretrained on 800k Open-X-Embodiment episodes.
Octo Small
Compact 27M variant of Octo. Faster inference on consumer GPUs, designed for low-latency control.
OpenVLA-7B
Stanford/Berkeley open VLA trained on 970k Open-X-Embodiment episodes. Supports LoRA fine-tuning.
Physical Intelligence Pi-0-FAST
Autoregressive Ο-0 variant using FAST action tokenizer. Faster inference at competitive task success.
Physical Intelligence Ο-0
Physical Intelligence's flagship VLA flow-matching policy. Generalist robot control, pretrained on 10k+ hrs robot data.
Physical Intelligence Ο-0.5
Upgraded Ο-0 with open-world generalization via knowledge insulation. Weights and fine-tuning open-sourced.
RDT-1B
Tsinghua's 1B diffusion-transformer bimanual manipulation policy. Predicts next 64 actions per inference.
Top robotics / vla picks
Hand-picked across four common criteria β resolved against the live catalog so the picks track price and performance changes.
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Learn moreGoogle DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Learn moreGoogle DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Learn moreGoogle DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Learn morePricing in this category is not yet standardized. Most of the models on this page run on dedicated GPU infrastructure β Vast.ai, Replicate, self-hosted β and you pay per second of inference compute rather than per call or per token. Plan around β¬0.001-β¬0.01 per inference step (one camera frame plus one instruction) on H100-class hardware. A continuous policy running at 10 Hz costs roughly β¬0.36-β¬3.60 per hour of robot operation, before energy and supervision costs.
The trade-off triangle is generalization, latency, and physical scope. Larger VLAs (RT-2-X, OpenVLA-7B) generalize to novel objects and instructions but inference at 1-3 Hz, which is too slow for closed-loop dexterous control. Smaller distilled models (Octo, Ο-0-fast, RDT-1B) hit 30-50 Hz but only generalize within their training distribution. For tabletop manipulation in a controlled cell, the small fast model is usually correct. For research that needs language and visual generalization, the larger model is.
Watch out for the sim-to-real gap: most VLA training data is collected in simulation or on specific robot embodiments. Deploying on a different arm, gripper, or camera geometry typically requires fine-tuning on a few hundred to a few thousand new demonstrations. Also watch out for safety β these models occasionally output unsafe joint trajectories; always run a low-level safety filter (joint limits, force limits, workspace bounds) between the policy and the hardware.
Top picks above cover the most generalizable research flagship, the cheapest run-on-shared-GPU option, the largest open-weights model, and the fastest realtime control policy. Commercial managed-API offerings will be added as providers launch them.
Popular use cases
Common patterns built with robotics / vla on Railwail.
Related comparisons
Side-by-side reviews of the most-compared models in this category.
Frequently asked questions
Start Building with AI
Access all models through a single API. Get free credits when you sign up β no credit card required.