CogVideoX-5B (open)
Zhipu/Tsinghua's 5B open text-to-video model. 720x480 @ 8fps, 6s clips, image-to-video variant available.
CogVideoX-5B (open) is video generation AI model from Replicate, priced at β¬0.000 per 1M input tokens with a unknown context window.
Pricing
API Integration
Use our OpenAI-compatible API to integrate CogVideoX-5B (open) into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple β just pass a string
const reply = await rw.run("cogvideox-5b-open", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("cogvideox-5b-open", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("cogvideox-5b-open", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive β THUDM (Tsinghua University KEG Lab) / Zhipu AI's CogVideoX-5B (open)
THUDM (the Knowledge Engineering Group at Tsinghua University) and its commercial arm Zhipu AI are among China's leading open-source generative-AI research groups. The lab is best known for the GLM language-model family (GLM-130B, ChatGLM) and the CogVLM/CogVideo vision projects. CogVideo was introduced in 2022 as one of the first publicly released 9B-parameter text-to-video transformers, followed by CogVideoX in August 2024, which released 2B and 5B variants under an open-weight licence. Zhipu AI raised more than $1.5B by 2025 and is one of the four 'AI tigers' of China alongside Moonshot, MiniMax and Baichuan. CogVideoX is widely adopted as a research and fine-tuning base because of its permissive licence and reproducible training recipe.
Visit THUDM (Tsinghua University KEG Lab) / Zhipu AI βCogVideoX is a latent video diffusion model built around a 3D causal Variational Autoencoder that compresses video into a compact latent grid across both spatial and temporal axes. On top of this latent space, an Expert Transformer (a diffusion transformer with separate text and video expert streams sharing self-attention) jointly denoises text-conditioned latent video at multiple resolutions. The architecture employs 3D Rotary Position Embeddings (3D-RoPE) for spatio-temporal positions, an Adaptive LayerNorm controlled by the diffusion timestep, and Flow Matching with v-prediction as the training objective. Training proceeds in stages: first low-resolution images, then short low-resolution videos, then high-resolution video up to 720x480 at 8 fps for 6 seconds. The team curated a large filtered video corpus with dense bilingual captions produced by a fine-tuned vision-language model. CogVideoX-5B supports text-to-video and an image-to-video variant (CogVideoX-5B-I2V).
- Parameters
- 5 billion (also 2B variant)
- Context
- 226 tokens
- Open-weight 5B text-to-video model (Apache 2.0-style permissive licence on weights)
- Generates 6-second clips at 720x480 / 8 fps natively (~49 frames)
- Image-to-video variant CogVideoX-5B-I2V for conditioning on a first frame
- Strong prompt adherence on complex compositions and motion verbs
- Bilingual English/Chinese prompting
- Runs on a single 24-48 GB consumer-class GPU with optimisations (CPU offload, INT8)
- Fine-tunable with LoRA and full-parameter fine-tuning
- Frequently extended by community for longer durations via temporal tiling
- Best for: research, open-source pipelines, custom fine-tunes, on-prem video generation.
Trained on a curated multi-million-clip video corpus with bilingual dense captions generated by a fine-tuned video captioning model. Data is heavily filtered for aesthetic quality, motion coherence and caption alignment. Exact token / clip counts are reported in the paper.
License: Open weights under the CogVideoX Model Licence (free for research and commercial use with attribution).
Known limitations
- Maximum native duration ~6 seconds
- Resolution capped at 720x480 in the 5B base model
- No audio generation
- Slower than closed commercial APIs on similar hardware
- Occasional anatomical artifacts and limb drift on fast motion
Frequently asked questions
Related Models
View all Video GenerationGoogle Veo 2
Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.
Google Veo 3
Google's Veo 3. High-fidelity text-to-video with native audio generation, up to 8s clips.
Google Veo 3.1
Latest Veo with image-to-video and context-aware audio
Kling v3
Cinematic video up to 15s with multi-shot and native audio
Start using CogVideoX-5B (open) today
Get started with free credits. No credit card required. Access CogVideoX-5B (open) and 100+ other models through a single API.