LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
LeRobot SmolVLA is vla / robotics AI model from Custom, priced at €0.000 per 1M input tokens with a unknown context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate LeRobot SmolVLA into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("smolvla", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("smolvla", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("smolvla", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Hugging Face (LeRobot team)'s LeRobot SmolVLA
SmolVLA is the flagship Vision-Language-Action model of Hugging Face's LeRobot project, an open-source robotics framework that brings the Transformers / Datasets philosophy to physical-AI research. SmolVLA was released in mid-2025 as a deliberately compact 450M-parameter VLA designed to be trainable and runnable on consumer hardware while still benefiting from community-scale pretraining. It is trained on 487 publicly contributed LeRobot community datasets - teleoperation episodes uploaded by hobbyists, university labs and small robotics companies - making it the first community-data-driven open VLA. The release includes pretraining and fine-tuning code, model checkpoints under Apache-2.0, and a tightly integrated stack with the LeRobot framework, hf-hub-hosted datasets, and the SO-100 / SO-ARM-100 low-cost robot arms.
Visit Hugging Face (LeRobot team) →SmolVLA is a 450M-parameter transformer that combines a SmolVLM-style vision-language encoder with an action expert that regresses continuous action chunks. The vision-language tower is initialised from the open SmolVLM family (compact VLMs released by Hugging Face) and is responsible for fusing multi-view RGB observations with the natural-language instruction; a smaller action-prediction head consumes the resulting tokens together with proprioception and outputs a short chunk of continuous joint or end-effector actions. The model is pretrained on 487 LeRobot-format community datasets, covering single-arm, dual-arm and mobile-base setups, with a strong tilt toward the popular SO-100 and Koch low-cost teleoperation arms. Post-pretraining, users fine-tune on their own LeRobot recording for a specific robot and task. The whole stack is designed to run pretraining on a few H100s and fine-tuning on a single consumer GPU.
- Parameters
- 450M
- Context
- unknown
- Compact 450M open VLA pretrained on community data
- Trained on 487 LeRobot community datasets
- SmolVLM-style vision-language tower + action expert
- Continuous action-chunk regression
- Runs fine-tuning on a single consumer GPU
- Tight integration with LeRobot framework on Hugging Face
- Apache-2.0 licence on weights and code
- Strong baseline for SO-100 and Koch low-cost arms
- Best for: hobbyists, educators, low-cost robot research.
487 publicly contributed LeRobot-format community datasets hosted on the Hugging Face Hub, dominated by teleoperation episodes from low-cost arms (SO-100, Koch) but also including dual-arm and mobile setups. Total scale on the order of millions of frames.
License: Apache-2.0 - fully open weights, code, and datasets (where contributors used compatible licences). Designed for both research and commercial use.
Known limitations
- Modest scale - underperforms 7B VLAs on hard tasks
- Dataset skew toward SO-100 / Koch low-cost arms
- Limited language reasoning vs LLM-backed VLAs
- Sensor coverage is mostly single RGB camera setups
- Community data quality varies
- Long-horizon behaviour limited without prompt decomposition
Frequently asked questions
Related Models
View all VLA / RoboticsGemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
NVIDIA Cosmos-Predict-1
NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.
Start using LeRobot SmolVLA today
Get started with free credits. No credit card required. Access LeRobot SmolVLA and 100+ other models through a single API.