NVIDIA Cosmos-Predict-1
NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.
NVIDIA Cosmos-Predict-1 is vla / robotics AI model from Custom, priced at €0.000 per 1M input tokens with a unknown context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate NVIDIA Cosmos-Predict-1 into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("cosmos-predict-1", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("cosmos-predict-1", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("cosmos-predict-1", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — NVIDIA's NVIDIA Cosmos-Predict-1
NVIDIA is the dominant supplier of GPUs for AI training and inference and runs a large in-house research organisation across robotics, simulation, and generative modelling. NVIDIA Cosmos was announced at CES 2025 as a family of 'World Foundation Models' (WFMs) for Physical AI - models that predict how the physical world evolves given video, language, and action conditioning. Cosmos is positioned as a developer platform for robotics and autonomous-vehicle teams to generate synthetic training data, run policy evaluations in simulation, and bootstrap Vision-Language-Action (VLA) pipelines. The 'Predict-1' track focuses on diffusion-based video-future prediction conditioned on text and/or first-frame inputs and ships in 7B and 14B parameter variants with open weights under the NVIDIA Open Model License.
Visit NVIDIA →Cosmos-Predict-1 is a diffusion world model that predicts future video frames conditioned on text prompts, a starting frame, or short context clips. It uses a 3D causal video tokenizer (Cosmos Tokenizer) to compress video into spatio-temporal latents, then runs a Diffusion Transformer in latent space with cross-attention to text embeddings produced by a T5-XXL encoder. Training data is a curated corpus of ~20 million hours of driving, robotics, and human-activity video, filtered for motion quality, captioning coverage and safety. The model is not itself a VLA controller, but is the world-model backbone of NVIDIA's Cosmos stack: Cosmos-Predict generates rollouts; Cosmos-Reason adds VLM reasoning over predicted futures; and Cosmos-Transfer adapts simulation-to-real video. In a VLA pipeline it provides synthetic 'imagined' trajectories and dense reward / value signals, and is used to evaluate manipulation and driving policies offline at scale.
- Parameters
- 7B and 14B variants (Predict-1)
- Context
- unknown
- Predicts future video conditioned on text, image, or video context
- Two open-weight variants: Predict-1-7B and Predict-1-14B
- Generates physically plausible motion for driving, manipulation, and humanoid scenes
- Integrates with NVIDIA Isaac, Omniverse, and DRIVE pipelines
- Used as synthetic data engine for VLA / autonomy training
- Supports prompt upsampling via Cosmos-Reason VLM
- Cosmos Tokenizer (3D causal VAE) can be reused as a video encoder
- Released alongside Cosmos-Reason and Cosmos-Transfer for full Physical AI stack
- Best for: synthetic data, world-model research, robotics simulation, AV training.
Trained on ~20 million hours of curated physical-world video (driving, robotics manipulation, humanoid / first-person, navigation) sourced from licensed and open datasets, with multi-stage filtering for motion quality, caption alignment and safety. Text conditioning uses T5-XXL embeddings.
License: NVIDIA Open Model License - research-only / developer use with restrictions; weights downloadable from Hugging Face and NGC.
Known limitations
- Short rollouts (a few seconds) before drift dominates
- Not a controller - cannot produce robot actions on its own
- Heavy GPU footprint for the 14B variant
- Domain skew toward driving / Western indoor scenes
- Output is video, not joint commands - needs downstream policy
- Restricted commercial license (Open Model License, not Apache)
Frequently asked questions
Related Models
View all VLA / RoboticsGemini Robotics (2025)
Google DeepMind's vision-language-action model based on Gemini 2.0. Generalist robot policy with strong dexterity.
Gemini Robotics-ER
Embodied-reasoning variant of Gemini Robotics. Enhanced 3D spatial reasoning and trajectory planning.
Google RT-2-X
Google's VLA from RT-X collaboration. Trained on Open-X-Embodiment (22 robots, 527 skills), positive transfer.
LeRobot SmolVLA
HuggingFace's 450M VLA pretrained on 487 community LeRobot datasets. Runs on consumer GPUs.
Start using NVIDIA Cosmos-Predict-1 today
Get started with free credits. No credit card required. Access NVIDIA Cosmos-Predict-1 and 100+ other models through a single API.