Qwen2-VL-72B Instruct

Alibaba / Qwen
Multimodal

Alibaba's 72B vision-language model with M-RoPE and dynamic resolution. Strong document and video understanding.

Try Qwen2-VL-72B Instruct now
Send a single prompt and stream a response inline. Hit Cmd+Enter to submit.
Sign in to try this model with €5 free credits.
Sign in
Press Cmd+Enter to send
Response appears here.
TL;DR·Last updated May 16, 2026

Qwen2-VL-72B Instruct is multimodal AI model from Alibaba / Qwen, priced at €0.000 per 1M input tokens with a 32.8K tokens context window.

Try Qwen2-VL-72B Instruct

0.7

Sign in to generate — 50 free credits on sign-up

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Qwen2-VL-72B Instruct into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("qwen2-vl-72b-instruct", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("qwen2-vl-72b-instruct", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("qwen2-vl-72b-instruct", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Context window
32,768 tokens
Max output
8,192 tokens
Developer
Alibaba / Qwen
Category
Multimodal
Supported Formats
text
image
video
Tags
qwen
alibaba
multimodal
vision
open-weights
video-understanding
pricing-tbd

Deep dive — Alibaba DAMO Academy (Qwen Team)'s Qwen2-VL-72B Instruct

About Alibaba DAMO Academy (Qwen Team)
Founded 2017 · Hangzhou, China

The Qwen (Tongyi Qianwen) team sits inside Alibaba Cloud's DAMO Academy, the company's research arm founded in 2017 in Hangzhou. The team is led by Junyang Lin and Le Hou and counts dozens of researchers across NLP, vision and speech. Qwen has produced one of the most prolific open-source model lines in the world, including Qwen-1.5, Qwen2 (June 2024), Qwen2.5 (September 2024), the Code, Math, Audio and VL (vision-language) families, and the December 2024 release of Qwen2.5-VL. Qwen2-VL launched in August 2024 in 2B, 7B and 72B sizes, all released on Hugging Face and ModelScope; the 72B Instruct variant became one of the top open-weights vision-language models worldwide, frequently matching closed-source peers on OCR-heavy benchmarks like DocVQA and ChartQA. Alibaba offers Qwen models commercially through Alibaba Cloud and Bailian.

Visit Alibaba DAMO Academy (Qwen Team)
Architecture
Decoder-only Transformer with Naive Dynamic Resolution Vision Transformer

Qwen2-VL-72B-Instruct combines the Qwen2 72B decoder-only Transformer with a custom 675M ViT vision encoder using Naive Dynamic Resolution: instead of resizing every image to a fixed grid, the encoder accepts the native resolution and generates a variable number of visual tokens per image. The model also introduces Multimodal Rotary Position Embedding (M-RoPE) that encodes positions in time (for video), height and width separately, enabling single-stream multimodal video understanding. The model supports up to 20 minutes of video input via uniform frame sampling, single-frame image input at variable resolution up to ~16K visual tokens, and a 131,072-token text context window. Training proceeded in three stages: contrastive vision-language pretraining, multimodal pretraining on interleaved image-text and video-text data, and supervised fine-tuning with chain-of-thought multimodal instructions. Weights are released under the Qwen licence (free for commercial use under specific terms).

Parameters
72B (~73B with vision encoder)
Context
131.1K tokens
What it can do
  • Open-weights 72B vision-language model under permissive Qwen licence
  • Naive Dynamic Resolution: native image aspect ratio without fixed grid
  • Multimodal Rotary Position Embedding (M-RoPE) for joint image and video
  • Up to 20 minutes of video understanding
  • 131K-token text context
  • Top open-weights scores on DocVQA, ChartQA, MathVista, RealWorldQA
  • Strong OCR across English, Chinese, Japanese, Korean and European languages
  • Best for: open-weights document AI, video QA, OCR-heavy multilingual workloads
Training & License

Multi-stage curriculum: contrastive vision-language pretraining on large web image-text pairs, multimodal pretraining on interleaved image-text and video-text data, supervised fine-tuning on curated chain-of-thought multimodal instructions.

License: Qwen Licence (commercial use permitted under 100M MAU; bespoke licence required above).

Known limitations
  • Serving 72B requires multi-GPU infrastructure
  • Video understanding limited to 20 minutes uniform sampling
  • Hallucination on extreme OCR cases
  • Licence has MAU and competing-services restrictions
  • Audio input requires separate Qwen-Audio model

Frequently asked questions

Start using Qwen2-VL-72B Instruct today

Get started with free credits. No credit card required. Access Qwen2-VL-72B Instruct and 100+ other models through a single API.