Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
Google RT-2-X is vla / robotics AI model from Google DeepMind, priced at €0.000 per 1M input tokens with a unknown context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate Google RT-2-X into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("rt-2-x", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("rt-2-x", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("rt-2-x", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Google DeepMind's Google RT-2-X
RT-2-X is Google DeepMind's flagship Robotic Transformer 2 (RT-2) model retrained on the Open-X-Embodiment dataset - the first large-scale, multi-institution effort to assemble a unified robot-learning dataset spanning many labs and robots. Open-X-Embodiment was organised in 2023 by Google DeepMind together with 21+ academic and industry institutions (Stanford, UC Berkeley, CMU, MIT, Toyota Research Institute, etc.), producing the RT-X dataset of ~1 million trajectories across 22 robot embodiments. RT-2-X extends RT-2's Vision-Language-Action recipe - using a PaLM-E / PaLI-X style VLM as backbone and emitting actions as text tokens - to this cross-embodiment corpus, demonstrating positive transfer across robots and a new state of the art on generalist manipulation at the time of release. RT-2-X is research-only and not publicly callable; it remains a key academic reference and the conceptual parent of subsequent open VLAs.
Visit Google DeepMind →RT-2-X follows the RT-2 design: a large Vision-Language Model (PaLI-X or PaLM-E) is co-fine-tuned on web-scale vision-language data and on robot demonstration data, where robot actions are tokenised as strings of natural-language-like tokens (each action dimension binned and rendered as a token). The same next-token prediction objective therefore trains the model on both internet-scale image-text data and on robot trajectories, allowing the resulting policy to inherit web knowledge (object semantics, OCR, common sense) and route it to motor commands. RT-2-X is the version of this recipe trained on the Open-X-Embodiment / RT-X dataset - ~1 million trajectories across 22 robot embodiments - rather than only on Google's internal kitchen-robot dataset. Public results report 5B and 55B variants, with the 55B model showing the strongest generalisation, especially when prompted with unseen language commands or unseen object combinations.
- Parameters
- Up to 55B (RT-2-X variants: 5B and 55B)
- Context
- unknown
- Generalist VLA trained on Open-X-Embodiment (22 robots)
- Inherits web-scale knowledge from PaLI / PaLM-E backbones
- Discrete action-token decoding (text-like vocabulary)
- Positive transfer across robot embodiments
- Strong emergent semantic reasoning (e.g. 'pick up the extinct animal')
- 5B and 55B parameter variants
- Reference architecture for the modern VLA paradigm
- Co-training on internet data + robot demos
- Best for: research, citation, conceptual baseline for VLAs.
Co-trained on internet-scale vision-language data (PaLI / PaLM-E corpora) plus ~1 million robot trajectories from the Open-X-Embodiment (RT-X) dataset across 22 robot embodiments. Action targets are tokenised continuous controls.
License: Research-only - Google DeepMind has not publicly released the RT-2-X weights, code or API. Some Open-X-Embodiment data and smaller RT-X reproductions are available, but the proprietary RT-2-X checkpoints are not.
Known limitations
- Closed weights - no public API or download
- Inference latency too high for very fast control loops
- Discrete tokens limit smoothness vs diffusion / flow-matching policies
- Cross-embodiment transfer still constrained by action-space differences
- Long-horizon tasks need external prompt decomposition
- Dataset skew toward kitchen / tabletop tasks
Frequently asked questions
Related Models
View all VLA / RoboticsGemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
NVIDIA Cosmos-Predict-1
NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.
Start using Google RT-2-X today
Get started with free credits. No credit card required. Access Google RT-2-X and 100+ other models through a single API.