How much does CogVideoX-5B (open) cost via Railwail?

No monthly minimum, no subscription. Start with €5 free credits.

What is the context window of CogVideoX-5B (open)?

CogVideoX-5B (open) supports a unknown context window — enough for typical AI workloads.

How fast is CogVideoX-5B (open)?

Latency depends on prompt length and load — typically 200ms to 2s for short prompts. We measure p50/p95 in real-time on /rankings.

Is CogVideoX-5B (open) better than Google Veo 2?

It depends on your use case. CogVideoX-5B (open) (Replicate) and Google Veo 2 (Google DeepMind) are both strong choices in video generation. Compare them side-by-side at /compare/cogvideox-5b-open-vs-google-veo-2.

Does CogVideoX-5B (open) support image input (vision)?

Yes — CogVideoX-5B (open) accepts image inputs in addition to text. Send images via the standard OpenAI-compatible `messages` array with `image_url` content blocks. Supported formats: text, image.

CogVideoX-5B (open)

Name: CogVideoX-5B (open)
Brand: Replicate
SKU: cogvideox-5b-open
Availability: InStock

Replicate

Video Generation

Zhipu/Tsinghua's 5B open text-to-video model. 720x480 @ 8fps, 6s clips, image-to-video variant available.

Queue video with CogVideoX-5B (open)

Video generation runs asynchronously — we'll queue a job and you can track it in your history.

Generates as an async job — typically 30 s to 2 min.

TL;DR·Last updated June 24, 2026

CogVideoX-5B (open) is video generation AI model from Replicate, priced at €0.000 per 1M input tokens with a unknown context window.

Try CogVideoX-5B (open)

Prompt

Duration

Aspect Ratio

Pricing

Price per Generation

Per generationFree

API Integration

Use our OpenAI-compatible API to integrate CogVideoX-5B (open) into your application.

Install

npm install railwail

JavaScript / TypeScript

import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("cogvideox-5b-open", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("cogvideox-5b-open", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("cogvideox-5b-open", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);

Specifications

Developer

Replicate

Deep dive — THUDM (Tsinghua University KEG Lab) / Zhipu AI's CogVideoX-5B (open)

About THUDM (Tsinghua University KEG Lab) / Zhipu AI

Founded 2019 · Beijing, China

THUDM (the Knowledge Engineering Group at Tsinghua University) and its commercial arm Zhipu AI are among China's leading open-source generative-AI research groups. The lab is best known for the GLM language-model family (GLM-130B, ChatGLM) and the CogVLM/CogVideo vision projects. CogVideo was introduced in 2022 as one of the first publicly released 9B-parameter text-to-video transformers, followed by CogVideoX in August 2024, which released 2B and 5B variants under an open-weight licence. Zhipu AI raised more than $1.5B by 2025 and is one of the four 'AI tigers' of China alongside Moonshot, MiniMax and Baichuan. CogVideoX is widely adopted as a research and fine-tuning base because of its permissive licence and reproducible training recipe.

Visit THUDM (Tsinghua University KEG Lab) / Zhipu AI →

Architecture

Diffusion Transformer (DiT) with expert Transformer blocks and 3D causal VAE

CogVideoX is a latent video diffusion model built around a 3D causal Variational Autoencoder that compresses video into a compact latent grid across both spatial and temporal axes. On top of this latent space, an Expert Transformer (a diffusion transformer with separate text and video expert streams sharing self-attention) jointly denoises text-conditioned latent video at multiple resolutions. The architecture employs 3D Rotary Position Embeddings (3D-RoPE) for spatio-temporal positions, an Adaptive LayerNorm controlled by the diffusion timestep, and Flow Matching with v-prediction as the training objective. Training proceeds in stages: first low-resolution images, then short low-resolution videos, then high-resolution video up to 720x480 at 8 fps for 6 seconds. The team curated a large filtered video corpus with dense bilingual captions produced by a fine-tuned vision-language model. CogVideoX-5B supports text-to-video and an image-to-video variant (CogVideoX-5B-I2V).

Parameters: 5 billion (also 2B variant)
Context: 226 tokens

What it can do

Open-weight 5B text-to-video model (Apache 2.0-style permissive licence on weights)
Generates 6-second clips at 720x480 / 8 fps natively (~49 frames)
Image-to-video variant CogVideoX-5B-I2V for conditioning on a first frame
Strong prompt adherence on complex compositions and motion verbs
Bilingual English/Chinese prompting
Runs on a single 24-48 GB consumer-class GPU with optimisations (CPU offload, INT8)
Fine-tunable with LoRA and full-parameter fine-tuning
Frequently extended by community for longer durations via temporal tiling
Best for: research, open-source pipelines, custom fine-tunes, on-prem video generation.

Training & License

Trained on a curated multi-million-clip video corpus with bilingual dense captions generated by a fine-tuned video captioning model. Data is heavily filtered for aesthetic quality, motion coherence and caption alignment. Exact token / clip counts are reported in the paper.

License: Open weights under the CogVideoX Model Licence (free for research and commercial use with attribution).

Known limitations

Maximum native duration ~6 seconds
Resolution capped at 720x480 in the 5B base model
No audio generation
Slower than closed commercial APIs on similar hardware
Occasional anatomical artifacts and limb drift on fast motion

Research papers

Frequently asked questions

Related Models

View all Video Generation

Google Veo 2

Google DeepMind

Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.

€5.00

Google Veo 3

Google DeepMind

Google's Veo 3. High-fidelity text-to-video with native audio generation, up to 8s clips.

€0.75

Google Veo 3 (Replicate)

Google DeepMind

Google's Veo 3 served via Replicate. Text-to-video with native synchronized audio generation. High-fidelity motion and scene coherence in short clips.

€8.00

Google Veo 3.1

Google DeepMind

Latest Veo with image-to-video and context-aware audio

€6.00

Start using CogVideoX-5B (open) today

Get started with free credits. No credit card required. Access CogVideoX-5B (open) and 100+ other models through a single API.

Get Started Free Browse All Models

CogVideoX-5B (open)

Pricing

API Integration

Deep dive — THUDM (Tsinghua University KEG Lab) / Zhipu AI's CogVideoX-5B (open)

Research papers

Frequently asked questions

What is CogVideoX-5B (open)?

How much does CogVideoX-5B (open) cost via Railwail?

What is the context window of CogVideoX-5B (open)?

How fast is CogVideoX-5B (open)?

Is CogVideoX-5B (open) better than Google Veo 2?

Does CogVideoX-5B (open) support image input (vision)?

Related Models

Google Veo 2

Google Veo 3

Google Veo 3 (Replicate)

Google Veo 3.1

Start using CogVideoX-5B (open) today