How much does NVIDIA Cosmos-Predict-1 cost via Railwail?

No monthly minimum, no subscription. Start with €5 free credits.

What is the context window of NVIDIA Cosmos-Predict-1?

NVIDIA Cosmos-Predict-1 supports a unknown context window — enough for typical AI workloads.

How fast is NVIDIA Cosmos-Predict-1?

Latency depends on prompt length and load — typically 200ms to 2s for short prompts. We measure p50/p95 in real-time on /rankings.

Is NVIDIA Cosmos-Predict-1 better than Gemini Robotics (2025)?

It depends on your use case. NVIDIA Cosmos-Predict-1 (Custom) and Gemini Robotics (2025) (Google DeepMind) are both strong choices in vla / robotics. Compare them side-by-side at /compare/cosmos-predict-1-vs-gemini-robotics-2025.

Does NVIDIA Cosmos-Predict-1 support image input (vision)?

Yes — NVIDIA Cosmos-Predict-1 accepts image inputs in addition to text. Send images via the standard OpenAI-compatible `messages` array with `image_url` content blocks. Supported formats: image, text.

NVIDIA Cosmos-Predict-1

Name: NVIDIA Cosmos-Predict-1
Brand: Custom
SKU: cosmos-predict-1
Availability: InStock

Custom

VLA / Robotics

NVIDIA's world foundation model for physical AI. Diffusion-based video prediction for robotics simulation.

Research-only model

NVIDIA Cosmos-Predict-1 runs on physical robot hardware and is not exposed via the Railwail API yet.

Not API-accessible

Read the research

TL;DR·Last updated June 24, 2026

NVIDIA Cosmos-Predict-1 is vla / robotics AI model from Custom, priced at €0.000 per 1M input tokens with a unknown context window.

Try NVIDIA Cosmos-Predict-1

System Prompt

Message

Temperature

0.7

Max Tokens

Direct API access coming soon

Pricing

Price per Generation

Per generationFree

API Integration

Use our OpenAI-compatible API to integrate NVIDIA Cosmos-Predict-1 into your application.

Install

npm install railwail

JavaScript / TypeScript

import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("cosmos-predict-1", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("cosmos-predict-1", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("cosmos-predict-1", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);

Specifications

Developer

Custom

Deep dive — NVIDIA's NVIDIA Cosmos-Predict-1

About NVIDIA

Founded 1993 · Santa Clara, California, USA

NVIDIA is the dominant supplier of GPUs for AI training and inference and runs a large in-house research organisation across robotics, simulation, and generative modelling. NVIDIA Cosmos was announced at CES 2025 as a family of 'World Foundation Models' (WFMs) for Physical AI - models that predict how the physical world evolves given video, language, and action conditioning. Cosmos is positioned as a developer platform for robotics and autonomous-vehicle teams to generate synthetic training data, run policy evaluations in simulation, and bootstrap Vision-Language-Action (VLA) pipelines. The 'Predict-1' track focuses on diffusion-based video-future prediction conditioned on text and/or first-frame inputs and ships in 7B and 14B parameter variants with open weights under the NVIDIA Open Model License.

Visit NVIDIA →

Architecture

Diffusion-based world foundation model (text/video-to-video) for Physical AI

Cosmos-Predict-1 is a diffusion world model that predicts future video frames conditioned on text prompts, a starting frame, or short context clips. It uses a 3D causal video tokenizer (Cosmos Tokenizer) to compress video into spatio-temporal latents, then runs a Diffusion Transformer in latent space with cross-attention to text embeddings produced by a T5-XXL encoder. Training data is a curated corpus of ~20 million hours of driving, robotics, and human-activity video, filtered for motion quality, captioning coverage and safety. The model is not itself a VLA controller, but is the world-model backbone of NVIDIA's Cosmos stack: Cosmos-Predict generates rollouts; Cosmos-Reason adds VLM reasoning over predicted futures; and Cosmos-Transfer adapts simulation-to-real video. In a VLA pipeline it provides synthetic 'imagined' trajectories and dense reward / value signals, and is used to evaluate manipulation and driving policies offline at scale.

Parameters: 7B and 14B variants (Predict-1)
Context: unknown

What it can do

Predicts future video conditioned on text, image, or video context
Two open-weight variants: Predict-1-7B and Predict-1-14B
Generates physically plausible motion for driving, manipulation, and humanoid scenes
Integrates with NVIDIA Isaac, Omniverse, and DRIVE pipelines
Used as synthetic data engine for VLA / autonomy training
Supports prompt upsampling via Cosmos-Reason VLM
Cosmos Tokenizer (3D causal VAE) can be reused as a video encoder
Released alongside Cosmos-Reason and Cosmos-Transfer for full Physical AI stack
Best for: synthetic data, world-model research, robotics simulation, AV training.

Training & License

Trained on ~20 million hours of curated physical-world video (driving, robotics manipulation, humanoid / first-person, navigation) sourced from licensed and open datasets, with multi-stage filtering for motion quality, caption alignment and safety. Text conditioning uses T5-XXL embeddings.

License: NVIDIA Open Model License - research-only / developer use with restrictions; weights downloadable from Hugging Face and NGC.

Known limitations