Google RT-2-X

Google DeepMind
VLA / Robotics

Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.

Research-only model
Google RT-2-X runs on physical robot hardware and is not exposed via the Railwail API yet.
Not API-accessible
Read the research
TL;DR·Last updated May 16, 2026

Google RT-2-X is vla / robotics AI model from Google DeepMind, priced at €0.000 per 1M input tokens with a unknown context window.

Try Google RT-2-X

0.7

Sign in to generate — 50 free credits on sign-up

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Google RT-2-X into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("rt-2-x", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("rt-2-x", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("rt-2-x", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Developer
Google DeepMind
Category
VLA / Robotics
Supported Formats
image
text
Tags
google
vla
robotics
research-only
weights-closed

Deep dive — Google DeepMind's Google RT-2-X

About Google DeepMind
Founded 2010 · London, UK / Mountain View, USA

RT-2-X is Google DeepMind's flagship Robotic Transformer 2 (RT-2) model retrained on the Open-X-Embodiment dataset - the first large-scale, multi-institution effort to assemble a unified robot-learning dataset spanning many labs and robots. Open-X-Embodiment was organised in 2023 by Google DeepMind together with 21+ academic and industry institutions (Stanford, UC Berkeley, CMU, MIT, Toyota Research Institute, etc.), producing the RT-X dataset of ~1 million trajectories across 22 robot embodiments. RT-2-X extends RT-2's Vision-Language-Action recipe - using a PaLM-E / PaLI-X style VLM as backbone and emitting actions as text tokens - to this cross-embodiment corpus, demonstrating positive transfer across robots and a new state of the art on generalist manipulation at the time of release. RT-2-X is research-only and not publicly callable; it remains a key academic reference and the conceptual parent of subsequent open VLAs.

Visit Google DeepMind →
Architecture
Vision-Language-Action transformer (PaLI / PaLM-E backbone, discrete action tokens)

RT-2-X follows the RT-2 design: a large Vision-Language Model (PaLI-X or PaLM-E) is co-fine-tuned on web-scale vision-language data and on robot demonstration data, where robot actions are tokenised as strings of natural-language-like tokens (each action dimension binned and rendered as a token). The same next-token prediction objective therefore trains the model on both internet-scale image-text data and on robot trajectories, allowing the resulting policy to inherit web knowledge (object semantics, OCR, common sense) and route it to motor commands. RT-2-X is the version of this recipe trained on the Open-X-Embodiment / RT-X dataset - ~1 million trajectories across 22 robot embodiments - rather than only on Google's internal kitchen-robot dataset. Public results report 5B and 55B variants, with the 55B model showing the strongest generalisation, especially when prompted with unseen language commands or unseen object combinations.

Parameters
Up to 55B (RT-2-X variants: 5B and 55B)
Context
unknown
What it can do
  • Generalist VLA trained on Open-X-Embodiment (22 robots)
  • Inherits web-scale knowledge from PaLI / PaLM-E backbones
  • Discrete action-token decoding (text-like vocabulary)
  • Positive transfer across robot embodiments
  • Strong emergent semantic reasoning (e.g. 'pick up the extinct animal')
  • 5B and 55B parameter variants
  • Reference architecture for the modern VLA paradigm
  • Co-training on internet data + robot demos
  • Best for: research, citation, conceptual baseline for VLAs.
Training & License

Co-trained on internet-scale vision-language data (PaLI / PaLM-E corpora) plus ~1 million robot trajectories from the Open-X-Embodiment (RT-X) dataset across 22 robot embodiments. Action targets are tokenised continuous controls.

License: Research-only - Google DeepMind has not publicly released the RT-2-X weights, code or API. Some Open-X-Embodiment data and smaller RT-X reproductions are available, but the proprietary RT-2-X checkpoints are not.

Known limitations
  • Closed weights - no public API or download
  • Inference latency too high for very fast control loops
  • Discrete tokens limit smoothness vs diffusion / flow-matching policies
  • Cross-embodiment transfer still constrained by action-space differences
  • Long-horizon tasks need external prompt decomposition
  • Dataset skew toward kitchen / tabletop tasks

Frequently asked questions

Start using Google RT-2-X today

Get started with free credits. No credit card required. Access Google RT-2-X and 100+ other models through a single API.