HunyuanVideo
Tencent's 13B open-weights video diffusion transformer. SOTA among open video models at release.
HunyuanVideo is video generation AI model from Tencent, priced at β¬0.000 per 1M input tokens with a unknown context window.
Pricing
API Integration
Use our OpenAI-compatible API to integrate HunyuanVideo into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple β just pass a string
const reply = await rw.run("hunyuanvideo-tencent", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("hunyuanvideo-tencent", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("hunyuanvideo-tencent", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive β Tencent's HunyuanVideo
Tencent Holdings, founded 1998 by Ma Huateng and four co-founders in Shenzhen, is one of the world's largest internet conglomerates and runs WeChat, Tencent Games and Tencent Cloud. Tencent's Hunyuan team developed the company's foundation-model family (Hunyuan Large for text, Hunyuan DiT for image, Hunyuan3D, and HunyuanVideo for video). HunyuanVideo (December 2024) was the first 13B-parameter open-weight video diffusion model and remains one of the largest publicly released video generators. This deep-dive entry is the Tencent-platform-facing presentation of HunyuanVideo, covering both the open release and the Tencent-hosted commercial API surfaced through Hunyuan Cloud, WeChat creator tools and Tencent's own ad platforms.
Visit Tencent βHunyuanVideo combines a 3D causal Variational Autoencoder for video compression with a 13B-parameter Diffusion Transformer denoiser. The denoiser adopts a FLUX-1-style dual-stream architecture: separate text and video streams that process each modality independently before merging into single-stream joint self-attention. Text conditioning fuses signals from a CLIP-style vision-language encoder and a large multimodal LLM, which the Tencent team found materially improves prompt fidelity on long captions. Training uses Flow Matching with 3D Rotary Position Embeddings and a progressive curriculum (images, low-res videos, high-res videos) on a heavily filtered multi-billion-clip multilingual corpus with dense bilingual captions. The Tencent-hosted commercial endpoint adds longer-duration generation (via chained extensions), 1080p upscaling, image-to-video conditioning and a 'sound' module producing aligned audio.
- Parameters
- 13 billion
- Context
- unknown
- 13B open-weight base model with permissive Hunyuan Community Licence
- 5-second 720p / 24 fps native clips; extended to 10-15s on commercial endpoint
- Strong bilingual (Chinese/English) prompt understanding via MLLM text encoder
- Dual-stream DiT with high prompt adherence on dense captions
- Image-to-video and audio-sync available on commercial Tencent API
- Massive open-source ecosystem of LoRAs, control nets and pipelines
- Available on Hugging Face, GitHub and Tencent Hunyuan Cloud
- Strong on cinematic motion, lighting and dynamic camera
- Best for: research, open pipelines, Chinese-market creative tools, branded campaigns.
Multi-billion-clip curated video corpus with hierarchical filtering for aesthetics, motion and caption alignment, plus dense bilingual captions from an in-house MLLM captioner. Exact size disclosed only in summary form in the paper.
License: Tencent Hunyuan Community Licence on weights (free for research and limited commercial use with attribution; thresholds for very large deployers). Commercial API governed by Tencent Cloud terms.
Known limitations
- Native open weights limited to ~5s / 720p
- Audio only on commercial Tencent endpoint
- High VRAM requirements for local inference (~60-80 GB FP16)
- Commercial API has stricter content moderation on Chinese surfaces
- Licence has thresholds for very large commercial users
Frequently asked questions
Related Models
View all Video GenerationGoogle Veo 2
Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.
Google Veo 3
Google's Veo 3. High-fidelity text-to-video with native audio generation, up to 8s clips.
Google Veo 3.1
Latest Veo with image-to-video and context-aware audio
Kling v3
Cinematic video up to 15s with multi-shot and native audio
Start using HunyuanVideo today
Get started with free credits. No credit card required. Access HunyuanVideo and 100+ other models through a single API.