Wan 2.1 (Alibaba)
Alibaba's Wan 2.1 open-weights video diffusion model. 14B MoE-based, supports T2V and I2V.
Wan 2.1 (Alibaba) is video generation AI model from Replicate, priced at €0.000 per 1M input tokens with a unknown context window.
Pricing
API Integration
Use our OpenAI-compatible API to integrate Wan 2.1 (Alibaba) into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("wan-2-1-alibaba", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("wan-2-1-alibaba", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("wan-2-1-alibaba", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Alibaba (Tongyi Wanxiang Lab)'s Wan 2.1 (Alibaba)
Alibaba was founded in 1999 by Jack Ma and 17 co-founders in Hangzhou and is one of China's two largest cloud and e-commerce conglomerates. Its Tongyi Lab (Tongyi Qianwen / Wanxiang) runs Alibaba's foundation-model research and shipped the Qwen large-language-model family, the Wanxiang image generator and the Wan video-generation family. Wan 2.0 launched in mid-2024 and Wan 2.1 in early 2025 as an open-weight diffusion-transformer family covering text-to-video, image-to-video and first-/last-frame conditioning, with a permissive licence designed to enable broad community adoption. Alibaba positioned Wan 2.1 as the strongest open-weight Chinese video model at its launch, with public weights and reproducible training recipes.
Visit Alibaba (Tongyi Wanxiang Lab) →Wan 2.1 is a family of open-weight Diffusion Transformer (DiT) video models released by Alibaba's Tongyi Wanxiang Lab. The family includes a 1.3B-parameter text-to-video model that runs on consumer GPUs and a 14B-parameter text-to-video and image-to-video model targeted at high-fidelity creative work. All variants operate on a 3D causal Variational Autoencoder (Wan-VAE) that compresses video into a spatio-temporal latent grid, then denoise that latent with a DiT trained using Flow Matching. Text conditioning uses a multilingual encoder built on the Qwen LLM family with strong Chinese-English capability. The Wan 2.1 release on GitHub and Hugging Face also includes weights and inference code for first-frame and last-frame conditioning. Native generation is 5 seconds at 832x480 (1.3B) or 720p / 24 fps (14B). The team report SOTA-class results on VBench among open models.
- Parameters
- 1.3 billion (T2V-1.3B) and 14 billion (T2V-14B / I2V-14B)
- Context
- unknown
- Open-weight DiT family at 1.3B (consumer GPU) and 14B (high-fidelity) sizes
- Text-to-video, image-to-video, first-frame and last-frame conditioning
- Bilingual Chinese/English prompts via Qwen-based text encoder
- Permissive licence aimed at broad community adoption
- Strong performance on VBench among open-weight models
- Runs locally on 24-48 GB consumer GPUs (1.3B variant)
- Active community ecosystem (LoRAs, ComfyUI nodes, fine-tunes)
- Reproducible training recipe documented in technical report
- Best for: open-source video pipelines, research, on-prem creative tools, custom fine-tunes.
Curated multi-million-clip multilingual video corpus filtered for aesthetics, motion and caption quality, with dense bilingual captions; specifics documented in the Wan 2.1 technical report.
License: Open weights under an Apache-style permissive licence (Wan 2.1 release), suitable for research and commercial use.
Known limitations
- Native duration 5 seconds
- Resolution capped at 720p in 14B base model, 832x480 in 1.3B
- No native audio
- High VRAM requirements for 14B variant
- Quality below closed leaders (Veo 3, Sora 2, Kling v3) at similar duration
Frequently asked questions
Related Models
View all Video GenerationGoogle Veo 2
Google's state-of-the-art video generation model. Simulates real-world physics with various visual styles.
Google Veo 3
Google's Veo 3. High-fidelity text-to-video with native audio generation, up to 8s clips.
Google Veo 3.1
Latest Veo with image-to-video and context-aware audio
Kling v3
Cinematic video up to 15s with multi-shot and native audio
Start using Wan 2.1 (Alibaba) today
Get started with free credits. No credit card required. Access Wan 2.1 (Alibaba) and 100+ other models through a single API.