HunyuanVideo
Tencent's open-source video generation model. Strong visual quality with diverse style support.
HunyuanVideo is video generation AI model from Tencent, priced at €0.000 per 1M input tokens with a unknown context window.
Examples
See what HunyuanVideo can generate
Underwater Scene
"Camera gliding through a vibrant coral reef teeming with tropical fish, sunlight filtering through the crystal-clear water creating dancing light patterns on the ocean floor, documentary style"
Fantasy Animation
"A tiny glowing fairy emerging from an opening flower bud in an enchanted forest, sparkles trailing behind her wings as she takes flight, magical atmosphere with bioluminescent plants"
Pricing
API Integration
Use our OpenAI-compatible API to integrate HunyuanVideo into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("hunyuan-video", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("hunyuan-video", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("hunyuan-video", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Tencent's HunyuanVideo
Tencent Holdings is one of the largest technology and entertainment conglomerates in the world, founded in 1998 by Ma Huateng (Pony Ma) and four co-founders in Shenzhen. Tencent's AI Lab, founded in 2016, and the Hunyuan team (the company's foundation-model group) developed the Hunyuan family of text, image, 3D and video models. HunyuanVideo (released December 2024 with the 13B foundation model open-sourced) was the largest publicly released open-weight video diffusion model at the time, and rapidly became a popular base for community fine-tunes (LoRAs, control nets, audio-driven extensions). The model is hosted on Hugging Face and GitHub under a custom licence that allows research and limited commercial use.
Visit Tencent →HunyuanVideo is a 13B-parameter Diffusion Transformer (DiT) operating on a 3D causal VAE latent. Its denoiser uses a dual-stream / single-stream design inspired by FLUX-1: dedicated text and video streams first process their modalities separately and then concatenate for joint self-attention. Text conditioning combines a CLIP-like vision-language encoder and a multilingual large-language-model text encoder (MLLM) for stronger prompt understanding, especially on long captions. The model is trained with Flow Matching and 3D Rotary Position Embeddings, and uses a progressive curriculum from images to short videos to long videos at 720p / 24 fps. The accompanying paper details a meticulously curated multi-billion-clip dataset with hierarchical filtering and dense bilingual captions. HunyuanVideo supports text-to-video natively and the open release includes a high-quality 5-second / 720p model; community follow-ups added image-to-video, audio-driven and longer-duration variants.
- Parameters
- 13 billion
- Context
- unknown
- 13B open-weight text-to-video model (one of the largest released openly)
- Generates ~5-second clips at 720p / 24 fps natively
- Bilingual Chinese/English prompt handling via MLLM text encoder
- Dual-stream DiT architecture inspired by FLUX-1
- Strong prompt adherence on long, dense captions
- Active community ecosystem (image-to-video, LoRAs, audio-sync)
- Runs on 60-80 GB GPU memory (FP16), 40 GB with optimisations
- Permissive licence for non-commercial research and limited commercial use
- Best for: open-source video pipelines, research, on-prem creative tooling.
Multi-billion-clip curated video corpus with hierarchical aesthetic, motion and caption-quality filtering, plus dense bilingual captions produced by an in-house MLLM captioner.
License: Tencent Hunyuan Community Licence: weights free for research and limited commercial use with attribution; restrictions apply above thresholds and for certain jurisdictions.
Known limitations
- Native duration ~5 seconds
- 720p only (extensions require upscalers)
- High VRAM requirements vs smaller open models
- No native audio
- Licence has thresholds for very large commercial deployers
Frequently asked questions
Related Models
View all Video GenerationGoogle Veo 2
Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.
Google Veo 3
Google's Veo 3. High-fidelity text-to-video with native audio generation, up to 8s clips.
Google Veo 3.1
Latest Veo with image-to-video and context-aware audio
Kling v3
Cinematic video up to 15s with multi-shot and native audio
Start using HunyuanVideo today
Get started with free credits. No credit card required. Access HunyuanVideo and 100+ other models through a single API.