Robotics / VLA

Vision-Language-Action models for robotics and embodied AI

Vision-language-action models for robotics and embodied AI

Vision-language-action (VLA) models bridge perception, language, and motor control. A VLA takes camera frames plus a natural-language instruction ('pick up the red mug') and outputs low-level robot actions β€” joint angles, gripper commands, end-effector poses. Most are research artifacts from labs like Physical Intelligence, Google DeepMind, Stanford, and Berkeley.

12 models available

Gemini Robotics (2025)

RoboticsGoogle DeepMind

Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.

Free
googledeepmindgemini

Gemini Robotics-ER

RoboticsGoogle DeepMind

Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.

Free
googledeepmindgemini

Google RT-2-X

RoboticsGoogle DeepMind

Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.

Free
googlevlarobotics

LeRobot SmolVLA

RoboticsCustom

HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.

Free
huggingfacelerobotvla

NVIDIA Cosmos-Predict-1

RoboticsCustom

NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.

Free
nvidiacosmosvla

Octo Base

RoboticsUC Berkeley

Berkeley/Stanford 93M transformer diffusion policy. Pretrained on 800k Open-X-Embodiment episodes.

Free
berkeleystanfordvla

Octo Small

RoboticsUC Berkeley

Compact 27M variant of Octo. Faster inference on consumer GPUs, designed for low-latency control.

Free
berkeleyvlarobotics

OpenVLA-7B

RoboticsOpenVLA

Stanford/Berkeley open VLA trained on 970k Open-X-Embodiment episodes. Supports LoRA fine-tuning.

Free
stanfordberkeleyvla

Physical Intelligence Pi-0-FAST

RoboticsPhysical Intelligence

Autoregressive Ο€-0 variant using FAST action tokenizer. Faster inference at competitive task success.

Free
physical-intelligencevlarobotics

Physical Intelligence Ο€-0

RoboticsPhysical Intelligence

Physical Intelligence's flagship VLA flow-matching policy. Generalist robot control, pretrained on 10k+ hrs robot data.

Free
physical-intelligencevlarobotics

Physical Intelligence Ο€-0.5

RoboticsPhysical Intelligence

Upgraded Ο€-0 with open-world generalization via knowledge insulation. Weights and fine-tuning open-sourced.

Free
physical-intelligencevlarobotics

RDT-1B

RoboticsCustom

Tsinghua's 1B diffusion-transformer bimanual manipulation policy. Predicts next 64 actions per inference.

Free
tsinghuavlarobotics

Top robotics / vla picks

Hand-picked across four common criteria β€” resolved against the live catalog so the picks track price and performance changes.

Best overall
Gemini Robotics (2025)

Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.

Learn more
Cheapest
Gemini Robotics (2025)

Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.

Learn more
Largest open weights
Gemini Robotics (2025)

Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.

Learn more
Fastest
Gemini Robotics (2025)

Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.

Learn more

Pricing in this category is not yet standardized. Most of the models on this page run on dedicated GPU infrastructure β€” Vast.ai, Replicate, self-hosted β€” and you pay per second of inference compute rather than per call or per token. Plan around €0.001-€0.01 per inference step (one camera frame plus one instruction) on H100-class hardware. A continuous policy running at 10 Hz costs roughly €0.36-€3.60 per hour of robot operation, before energy and supervision costs.

The trade-off triangle is generalization, latency, and physical scope. Larger VLAs (RT-2-X, OpenVLA-7B) generalize to novel objects and instructions but inference at 1-3 Hz, which is too slow for closed-loop dexterous control. Smaller distilled models (Octo, Ο€-0-fast, RDT-1B) hit 30-50 Hz but only generalize within their training distribution. For tabletop manipulation in a controlled cell, the small fast model is usually correct. For research that needs language and visual generalization, the larger model is.

Watch out for the sim-to-real gap: most VLA training data is collected in simulation or on specific robot embodiments. Deploying on a different arm, gripper, or camera geometry typically requires fine-tuning on a few hundred to a few thousand new demonstrations. Also watch out for safety β€” these models occasionally output unsafe joint trajectories; always run a low-level safety filter (joint limits, force limits, workspace bounds) between the policy and the hardware.

Top picks above cover the most generalizable research flagship, the cheapest run-on-shared-GPU option, the largest open-weights model, and the fastest realtime control policy. Commercial managed-API offerings will be added as providers launch them.

Frequently asked questions

Start Building with AI

Access all models through a single API. Get free credits when you sign up β€” no credit card required.