NVIDIA Cosmos-Predict-1

Custom
VLA / Robotics

NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.

Research-only model
NVIDIA Cosmos-Predict-1 runs on physical robot hardware and is not exposed via the Railwail API yet.
Not API-accessible
Read the research
TL;DR·Last updated May 16, 2026

NVIDIA Cosmos-Predict-1 is vla / robotics AI model from Custom, priced at €0.000 per 1M input tokens with a unknown context window.

Try NVIDIA Cosmos-Predict-1

0.7

Direct API access coming soon

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate NVIDIA Cosmos-Predict-1 into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("cosmos-predict-1", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("cosmos-predict-1", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("cosmos-predict-1", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Developer
Custom
Category
VLA / Robotics
Supported Formats
image
text
Tags
nvidia
cosmos
vla
robotics
research-only
open-weights
world-model

Deep dive — NVIDIA's NVIDIA Cosmos-Predict-1

About NVIDIA
Founded 1993 · Santa Clara, California, USA

NVIDIA is the dominant supplier of GPUs for AI training and inference and runs a large in-house research organisation across robotics, simulation, and generative modelling. NVIDIA Cosmos was announced at CES 2025 as a family of 'World Foundation Models' (WFMs) for Physical AI - models that predict how the physical world evolves given video, language, and action conditioning. Cosmos is positioned as a developer platform for robotics and autonomous-vehicle teams to generate synthetic training data, run policy evaluations in simulation, and bootstrap Vision-Language-Action (VLA) pipelines. The 'Predict-1' track focuses on diffusion-based video-future prediction conditioned on text and/or first-frame inputs and ships in 7B and 14B parameter variants with open weights under the NVIDIA Open Model License.

Visit NVIDIA →
Architecture
Diffusion-based world foundation model (text/video-to-video) for Physical AI

Cosmos-Predict-1 is a diffusion world model that predicts future video frames conditioned on text prompts, a starting frame, or short context clips. It uses a 3D causal video tokenizer (Cosmos Tokenizer) to compress video into spatio-temporal latents, then runs a Diffusion Transformer in latent space with cross-attention to text embeddings produced by a T5-XXL encoder. Training data is a curated corpus of ~20 million hours of driving, robotics, and human-activity video, filtered for motion quality, captioning coverage and safety. The model is not itself a VLA controller, but is the world-model backbone of NVIDIA's Cosmos stack: Cosmos-Predict generates rollouts; Cosmos-Reason adds VLM reasoning over predicted futures; and Cosmos-Transfer adapts simulation-to-real video. In a VLA pipeline it provides synthetic 'imagined' trajectories and dense reward / value signals, and is used to evaluate manipulation and driving policies offline at scale.

Parameters
7B and 14B variants (Predict-1)
Context
unknown
What it can do
  • Predicts future video conditioned on text, image, or video context
  • Two open-weight variants: Predict-1-7B and Predict-1-14B
  • Generates physically plausible motion for driving, manipulation, and humanoid scenes
  • Integrates with NVIDIA Isaac, Omniverse, and DRIVE pipelines
  • Used as synthetic data engine for VLA / autonomy training
  • Supports prompt upsampling via Cosmos-Reason VLM
  • Cosmos Tokenizer (3D causal VAE) can be reused as a video encoder
  • Released alongside Cosmos-Reason and Cosmos-Transfer for full Physical AI stack
  • Best for: synthetic data, world-model research, robotics simulation, AV training.
Training & License

Trained on ~20 million hours of curated physical-world video (driving, robotics manipulation, humanoid / first-person, navigation) sourced from licensed and open datasets, with multi-stage filtering for motion quality, caption alignment and safety. Text conditioning uses T5-XXL embeddings.

License: NVIDIA Open Model License - research-only / developer use with restrictions; weights downloadable from Hugging Face and NGC.

Known limitations
  • Short rollouts (a few seconds) before drift dominates
  • Not a controller - cannot produce robot actions on its own
  • Heavy GPU footprint for the 14B variant
  • Domain skew toward driving / Western indoor scenes
  • Output is video, not joint commands - needs downstream policy
  • Restricted commercial license (Open Model License, not Apache)

Frequently asked questions

Start using NVIDIA Cosmos-Predict-1 today

Get started with free credits. No credit card required. Access NVIDIA Cosmos-Predict-1 and 100+ other models through a single API.