Qwen2-VL-72B Instruct
Alibaba's 72B vision-language model with M-RoPE and dynamic resolution. Strong document and video understanding.
Qwen2-VL-72B Instruct is multimodal AI model from Alibaba / Qwen, priced at €0.000 per 1M input tokens with a 32.8K tokens context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate Qwen2-VL-72B Instruct into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("qwen2-vl-72b-instruct", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("qwen2-vl-72b-instruct", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("qwen2-vl-72b-instruct", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — Alibaba DAMO Academy (Qwen Team)'s Qwen2-VL-72B Instruct
The Qwen (Tongyi Qianwen) team sits inside Alibaba Cloud's DAMO Academy, the company's research arm founded in 2017 in Hangzhou. The team is led by Junyang Lin and Le Hou and counts dozens of researchers across NLP, vision and speech. Qwen has produced one of the most prolific open-source model lines in the world, including Qwen-1.5, Qwen2 (June 2024), Qwen2.5 (September 2024), the Code, Math, Audio and VL (vision-language) families, and the December 2024 release of Qwen2.5-VL. Qwen2-VL launched in August 2024 in 2B, 7B and 72B sizes, all released on Hugging Face and ModelScope; the 72B Instruct variant became one of the top open-weights vision-language models worldwide, frequently matching closed-source peers on OCR-heavy benchmarks like DocVQA and ChartQA. Alibaba offers Qwen models commercially through Alibaba Cloud and Bailian.
Visit Alibaba DAMO Academy (Qwen Team) →Qwen2-VL-72B-Instruct combines the Qwen2 72B decoder-only Transformer with a custom 675M ViT vision encoder using Naive Dynamic Resolution: instead of resizing every image to a fixed grid, the encoder accepts the native resolution and generates a variable number of visual tokens per image. The model also introduces Multimodal Rotary Position Embedding (M-RoPE) that encodes positions in time (for video), height and width separately, enabling single-stream multimodal video understanding. The model supports up to 20 minutes of video input via uniform frame sampling, single-frame image input at variable resolution up to ~16K visual tokens, and a 131,072-token text context window. Training proceeded in three stages: contrastive vision-language pretraining, multimodal pretraining on interleaved image-text and video-text data, and supervised fine-tuning with chain-of-thought multimodal instructions. Weights are released under the Qwen licence (free for commercial use under specific terms).
- Parameters
- 72B (~73B with vision encoder)
- Context
- 131.1K tokens
- Open-weights 72B vision-language model under permissive Qwen licence
- Naive Dynamic Resolution: native image aspect ratio without fixed grid
- Multimodal Rotary Position Embedding (M-RoPE) for joint image and video
- Up to 20 minutes of video understanding
- 131K-token text context
- Top open-weights scores on DocVQA, ChartQA, MathVista, RealWorldQA
- Strong OCR across English, Chinese, Japanese, Korean and European languages
- Best for: open-weights document AI, video QA, OCR-heavy multilingual workloads
Multi-stage curriculum: contrastive vision-language pretraining on large web image-text pairs, multimodal pretraining on interleaved image-text and video-text data, supervised fine-tuning on curated chain-of-thought multimodal instructions.
License: Qwen Licence (commercial use permitted under 100M MAU; bespoke licence required above).
Known limitations
- Serving 72B requires multi-GPU infrastructure
- Video understanding limited to 20 minutes uniform sampling
- Hallucination on extreme OCR cases
- Licence has MAU and competing-services restrictions
- Audio input requires separate Qwen-Audio model
Frequently asked questions
Related Models
View all MultimodalClaude Opus 4.7
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Claude Sonnet 4.6
Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.
Depth Anything v2
Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.
GPT-5.4
OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.
Start using Qwen2-VL-72B Instruct today
Get started with free credits. No credit card required. Access Qwen2-VL-72B Instruct and 100+ other models through a single API.