Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
Gemini Robotics-ER is vla / robotics AI model from Google DeepMind, priced at €0.000 per 1M input tokens with a unknown context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate Gemini Robotics-ER into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("gemini-robotics-er", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("gemini-robotics-er", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("gemini-robotics-er", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Google DeepMind's Gemini Robotics-ER
Google DeepMind announced Gemini Robotics-ER (Embodied Reasoning) alongside Gemini Robotics in March 2025. While Gemini Robotics is the action-producing VLA, Gemini Robotics-ER is the reasoning-focused sibling: a Vision-Language Model variant of Gemini 2.0 specialised for spatial understanding, 3D grounding, point/box prediction, trajectory planning and code generation for robotics. It is designed to be combined with classical motion planners, low-level controllers or with the Gemini Robotics VLA itself. DeepMind positions Gemini Robotics-ER as a 'reasoning brain' that a robot stack can call with multimodal prompts to decompose tasks, locate objects in 2D / 3D, and emit waypoints or Python control code. As with Gemini Robotics, access is limited to research and partner programs.
Visit Google DeepMind →Gemini Robotics-ER is a fine-tuned variant of Gemini 2.0 specialised for embodied perception and planning rather than direct control. The architecture preserves the multimodal Transformer backbone of Gemini 2.0 (image, video, text, code) but is post-trained on a curated corpus of embodied tasks: object detection in 2D and 3D, point and bounding-box prediction, grasp prediction, motion-trajectory generation, and code-as-policy outputs that call robot APIs. It can accept egocentric robot camera streams and a natural-language task description, then produce structured outputs such as pixel-space points to grasp, 3D coordinates relative to the camera, planning steps, or Python snippets that drive a downstream controller. In combination with the Gemini Robotics VLA, Robotics-ER provides high-level reasoning while the VLA handles closed-loop low-level actions.
- Parameters
- Undisclosed (Gemini 2.0-class)
- Context
- unknown
- Spatial reasoning over 2D / 3D scenes
- Point and bounding-box prediction for objects and grasps
- Trajectory waypoint generation
- Code-as-policy generation (Python that calls robot APIs)
- Compositional task planning from natural language
- Pair with motion planners or with Gemini Robotics VLA
- Multimodal context: images, video, text, robot state
- Improved zero-shot performance on embodied QA benchmarks
- Best for: planning and grounding modules in research robotics stacks.
Gemini 2.0 multimodal pretraining plus embodied post-training on object-detection, 3D grounding, grasp prediction, trajectory planning, and code-generation tasks for robotic control. Draws on Google's internal robot datasets and curated public embodied datasets.
License: Research-only / partner access through Google DeepMind. Not publicly downloadable.
Known limitations
- No direct low-level action output
- Requires downstream controller or planner
- Closed model - no public weights or API
- Spatial reasoning still imperfect on cluttered scenes
- Latency too high for tight inner control loops
- Generalisation depends on prompt and tool stack
Frequently asked questions
Related Models
View all VLA / RoboticsGemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
NVIDIA Cosmos-Predict-1
NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.
Start using Gemini Robotics-ER today
Get started with free credits. No credit card required. Access Gemini Robotics-ER and 100+ other models through a single API.