CogVideoX-5B (open)

Replicate
Video Generation

Zhipu/Tsinghua's 5B open text-to-video model. 720x480 @ 8fps, 6s clips, image-to-video variant available.

Queue video with CogVideoX-5B (open)
Video generation runs asynchronously — we'll queue a job and you can track it in your history.
Sign in to try this model with €5 free credits.
Sign in
Generates as an async job — typically 30 s to 2 min.
TL;DR·Last updated May 16, 2026

CogVideoX-5B (open) is video generation AI model from Replicate, priced at €0.000 per 1M input tokens with a unknown context window.

Try CogVideoX-5B (open)
Sign in to generate — 50 free credits on sign-up

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate CogVideoX-5B (open) into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("cogvideox-5b-open", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("cogvideox-5b-open", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("cogvideox-5b-open", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Developer
Replicate
Category
Video Generation
Supported Formats
text
image
Tags
zhipu
tsinghua
cogvideox
text-to-video
open-weights
pricing-tbd

Deep dive — THUDM (Tsinghua University KEG Lab) / Zhipu AI's CogVideoX-5B (open)

About THUDM (Tsinghua University KEG Lab) / Zhipu AI
Founded 2019 · Beijing, China

THUDM (the Knowledge Engineering Group at Tsinghua University) and its commercial arm Zhipu AI are among China's leading open-source generative-AI research groups. The lab is best known for the GLM language-model family (GLM-130B, ChatGLM) and the CogVLM/CogVideo vision projects. CogVideo was introduced in 2022 as one of the first publicly released 9B-parameter text-to-video transformers, followed by CogVideoX in August 2024, which released 2B and 5B variants under an open-weight licence. Zhipu AI raised more than $1.5B by 2025 and is one of the four 'AI tigers' of China alongside Moonshot, MiniMax and Baichuan. CogVideoX is widely adopted as a research and fine-tuning base because of its permissive licence and reproducible training recipe.

Visit THUDM (Tsinghua University KEG Lab) / Zhipu AI
Architecture
Diffusion Transformer (DiT) with expert Transformer blocks and 3D causal VAE

CogVideoX is a latent video diffusion model built around a 3D causal Variational Autoencoder that compresses video into a compact latent grid across both spatial and temporal axes. On top of this latent space, an Expert Transformer (a diffusion transformer with separate text and video expert streams sharing self-attention) jointly denoises text-conditioned latent video at multiple resolutions. The architecture employs 3D Rotary Position Embeddings (3D-RoPE) for spatio-temporal positions, an Adaptive LayerNorm controlled by the diffusion timestep, and Flow Matching with v-prediction as the training objective. Training proceeds in stages: first low-resolution images, then short low-resolution videos, then high-resolution video up to 720x480 at 8 fps for 6 seconds. The team curated a large filtered video corpus with dense bilingual captions produced by a fine-tuned vision-language model. CogVideoX-5B supports text-to-video and an image-to-video variant (CogVideoX-5B-I2V).

Parameters
5 billion (also 2B variant)
Context
226 tokens
What it can do
  • Open-weight 5B text-to-video model (Apache 2.0-style permissive licence on weights)
  • Generates 6-second clips at 720x480 / 8 fps natively (~49 frames)
  • Image-to-video variant CogVideoX-5B-I2V for conditioning on a first frame
  • Strong prompt adherence on complex compositions and motion verbs
  • Bilingual English/Chinese prompting
  • Runs on a single 24-48 GB consumer-class GPU with optimisations (CPU offload, INT8)
  • Fine-tunable with LoRA and full-parameter fine-tuning
  • Frequently extended by community for longer durations via temporal tiling
  • Best for: research, open-source pipelines, custom fine-tunes, on-prem video generation.
Training & License

Trained on a curated multi-million-clip video corpus with bilingual dense captions generated by a fine-tuned video captioning model. Data is heavily filtered for aesthetic quality, motion coherence and caption alignment. Exact token / clip counts are reported in the paper.

License: Open weights under the CogVideoX Model Licence (free for research and commercial use with attribution).

Known limitations
  • Maximum native duration ~6 seconds
  • Resolution capped at 720x480 in the 5B base model
  • No audio generation
  • Slower than closed commercial APIs on similar hardware
  • Occasional anatomical artifacts and limb drift on fast motion

Frequently asked questions

Start using CogVideoX-5B (open) today

Get started with free credits. No credit card required. Access CogVideoX-5B (open) and 100+ other models through a single API.