Wan 2.2 Text-to-Video
Ultra-cheap T2V for pennies
Wan 2.2 Text-to-Video is video generation AI model from Replicate, priced at €0.000 per 1M input tokens with a unknown context window.
Examples
See what Wan 2.2 Text-to-Video can generate
Quick
"Cat playing with yarn on wooden floor"
Pricing
API Integration
Use our OpenAI-compatible API to integrate Wan 2.2 Text-to-Video into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("wan-t2v", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("wan-t2v", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("wan-t2v", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Alibaba (Tongyi Wanxiang Lab)'s Wan 2.2 Text-to-Video
Alibaba's Tongyi Lab in Hangzhou runs the Qwen LLM family and the Wanxiang generative-media family. After Wan 2.0 (mid-2024) and Wan 2.1 (early 2025), the team released Wan 2.2 in 2025 as the next-generation open-weight video model. Wan 2.2 ships as purpose-tuned variants for Text-to-Video, Image-to-Video and Audio/A2V. Wan 2.2 Text-to-Video is the flagship pure-text-conditioned variant and replaces Wan 2.1 T2V-14B as the principal open-weight text-to-video reference for the Chinese research community. The Wan team consistently rank near the top of open-model VBench leaderboards and ship reproducible training code under a permissive Wan-series licence.
Visit Alibaba (Tongyi Wanxiang Lab) →Wan 2.2 Text-to-Video is a Diffusion Transformer operating on a 3D causal Wan-VAE latent. Wan 2.2 introduces architectural refinements over Wan 2.1: improved 3D Rotary Position Embeddings, larger attention windows, and (in the flagship) a Mixture-of-Experts feed-forward design that routes tokens to specialist experts. Text conditioning uses a Qwen-family multilingual encoder with strong Chinese-English capability. The denoiser is trained with Flow Matching on a curated multi-million-clip multilingual video corpus with synthetic dense bilingual captions. Native generation is 5 seconds at 720p / 24 fps (with 1080p extensions). The training recipe and weights are open-source on Hugging Face and GitHub under the Wan-series permissive licence, designed to enable broad commercial and research use.
- Parameters
- 14 billion (flagship); smaller variants available
- Context
- unknown
- Open-weight text-to-video flagship at 14B parameters (smaller variants available)
- 5-second 720p / 24 fps generation natively, 1080p extensions
- Bilingual Chinese/English prompts via Qwen-based text encoder
- MoE-style scaling and improved 3D RoPE in flagship variant
- Permissive Wan-series licence for research and commercial use
- Top-tier results on VBench among open-weight models
- Active community ecosystem (LoRAs, fine-tunes, ComfyUI nodes)
- Reproducible training recipe and code
- Best for: open-source video pipelines, research, on-prem creative tooling, branded fine-tunes.
Curated multi-million-clip multilingual video corpus filtered for aesthetics, motion and caption quality, with dense bilingual captions; specifics documented in Wan technical materials.
License: Open weights under an Apache-style permissive licence (Wan-series release).
Known limitations
- Native duration 5 seconds
- No native audio
- High VRAM requirements for the 14B flagship
- Closed leaders (Veo 3, Sora 2, Kling v3) still ahead on absolute fidelity
- Resolution capped at 720p natively (1080p only in extended modes)
Frequently asked questions
Related Models
View all Video GenerationGoogle Veo 2
Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.
Google Veo 3
Google's Veo 3. High-fidelity text-to-video with native audio generation, up to 8s clips.
Google Veo 3.1
Latest Veo with image-to-video and context-aware audio
Kling v3
Cinematic video up to 15s with multi-shot and native audio
Start using Wan 2.2 Text-to-Video today
Get started with free credits. No credit card required. Access Wan 2.2 Text-to-Video and 100+ other models through a single API.