RDT-1B
Tsinghua's 1B diffusion-transformer bimanual manipulation policy. Predicts next 64 actions per inference.
RDT-1B is vla / robotics AI model from Custom, priced at β¬0.000 per 1M input tokens with a unknown context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate RDT-1B into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple β just pass a string
const reply = await rw.run("rdt-1b", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("rdt-1b", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("rdt-1b", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive β Tsinghua University (TSAIL / IIIS)'s RDT-1B
Robotics Diffusion Transformer (RDT) is a generalist bimanual manipulation policy developed at Tsinghua University's TSAIL / Institute for Interdisciplinary Information Sciences (IIIS), home of Jun Zhu's diffusion-modelling group. RDT-1B, introduced in October 2024, is one of the first publicly released billion-scale diffusion-based Vision-Language-Action models, specifically designed for two-arm robots such as Aloha, Mobile Aloha and a custom bimanual platform used by the authors. The project is positioned as a Chinese academic counterpart to Ο-0 and OpenVLA, with open weights released on Hugging Face under a permissive licence and the explicit aim of enabling fully reproducible bimanual VLA research.
Visit Tsinghua University (TSAIL / IIIS) βRDT-1B is a 1-billion-parameter Diffusion Transformer (DiT) trained as a Vision-Language-Action policy. Inputs are multi-view RGB observations (left + right + overhead), proprioception for both arms and any gripper / mobile-base degrees of freedom, plus a natural-language instruction encoded by a text encoder. The conditioning tokens are fed through a transformer trunk, while a diffusion head denoises continuous action chunks for both arms in a unified action space, allowing dual-arm coordinated motion. Pretraining is done in two stages: a large multi-robot pretraining phase on >1M episodes drawn from public datasets including Open-X-Embodiment and curated bimanual corpora, followed by fine-tuning on the authors' own 6,000-episode bimanual dataset spanning ~300 tasks. RDT-1B reports strong results on dexterous bimanual tasks such as folding T-shirts, pouring, and tool use.
- Parameters
- 1B
- Context
- unknown
- 1B-parameter Diffusion Transformer VLA
- Designed for bimanual manipulation (Aloha-class robots)
- Trained on >1M cross-embodiment episodes + 6k bimanual demos
- Continuous action chunks for both arms in a unified space
- Diffusion head produces smooth coordinated motion
- Open weights on Hugging Face (permissive licence)
- Strong results on folding, pouring and tool use
- Reproducible training and evaluation code
- Best for: bimanual manipulation research, two-arm fine-tuning.
Pretraining on >1 million robot episodes from Open-X-Embodiment and other public datasets, followed by fine-tuning on a curated bimanual dataset of ~6,000 episodes covering ~300 tasks collected with Aloha-class hardware.
License: Open weights released on Hugging Face under a permissive (CC-BY-NC-style) licence; primarily intended for research use.
Known limitations
- Primarily targets bimanual Aloha-class hardware
- Requires diffusion sampling at inference (multiple steps)
- Limited language reasoning compared to LLM-backed VLAs
- Generalisation to single-arm or mobile platforms needs adapters
- Mostly indoor-lab evaluation
- Smaller pretraining text corpus than RT-2-X / OpenVLA
Frequently asked questions
Related Models
View all VLA / RoboticsGemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
Start using RDT-1B today
Get started with free credits. No credit card required. Access RDT-1B and 100+ other models through a single API.