Claude vs GPT vs Gemini: The 2026 Vision Benchmark

TL;DRVision benchmarks 2026 — fast summary

Gemini 3.1 Pro wins on raw multimodal breadth — MMMU 85.1%, AI2D 92.4%, native video understanding, 2M-token context for image-heavy inputs.
GPT-5.4 wins on chart and figure analysis — ChartQA 89.7%, DocVQA 96.2%, best fine-grained OCR on noisy documents.
Claude Opus 4.7 wins on document layout and long-form visual reasoning — best at multi-page PDFs, technical drawings, and screenshot-to-code tasks.
Per-image cost: Gemini is cheapest at $0.0017/image (1024×1024), GPT-5.4 mid at $0.012, Claude most expensive at $0.024. The gap matters at scale.
Latency: Gemini fastest TTFT for image-in workloads (~320ms), Claude slowest (~580ms). Throughput is comparable once streaming begins.
Recommended default: Gemini 3.1 Pro for cost-sensitive vision at scale, GPT-5.4 for chart/figure OCR, Claude Opus 4.7 for screenshots and multi-page PDFs.

Vision capabilities are the fastest-changing axis in the 2026 LLM landscape. Two years ago, asking a model to read a chart was a research demo. Today it is a production feature in every major product, and the three frontier providers — Anthropic, OpenAI, Google — have converged on a similar capability set with very different trade-offs. The headline benchmarks no longer separate them by much; the cost gap, latency profile, and quality on specific visual subtypes are where the decision lives.

This guide compares Claude Opus 4.7, GPT-5.4, and Gemini 3.1 Pro across the seven vision benchmarks that actually correlate with production behavior, then layers cost, latency, video support, and per-task use-case recommendations on top. All scores below are independently verified from the providers' April–May 2026 model cards plus the public VLM leaderboard at vlms.org. Where a number is contested we say so.

The Three Flagships in May 2026

Vision-capable flagships, May 2026

Model	Provider	Released	Native multimodal?	Video?	Context
Claude Opus 4.7	Anthropic	Feb 2026	Yes (image+text trained jointly)	Frame-extracted only	1M tokens
GPT-5.4	OpenAI	Apr 2026	Yes (unified encoder)	Native (up to 60 min)	1M tokens
Gemini 3.1 Pro	Google	Mar 2026	Yes (Gemini's original design)	Native (up to 90 min)	2M tokens

All three are now genuinely multimodal — text and image are trained jointly rather than bolted on through a separate vision encoder. Practical differences: Gemini was multimodal-first from day one, which shows in its handling of video and mixed image-text inputs; GPT-5.4 unified its vision encoder in this release, giving the cleanest text-image reasoning we have measured; Claude Opus 4.7 added image generation alongside its existing image understanding, narrowing the gap with Gemini and OpenAI on creative vision tasks.

The Seven Vision Benchmarks That Matter

We tracked seven benchmarks across these models. Each tests a different visual capability — and each maps to a different production workload. Below are the official scores from each provider's model card, cross-checked against the Hugging Face VLM leaderboard.

Vision benchmark comparison (higher is better, May 2026)

Benchmark	Tests	Claude Opus 4.7	GPT-5.4	Gemini 3.1 Pro	Winner
MMMU (val)	College-level multimodal QA	78.6%	82.4%	85.1%	Gemini
MMMU-Pro	Harder multimodal QA, expert-curated	67.8%	71.2%	73.5%	Gemini
VQAv2	Open-domain visual QA	84.2%	86.7%	85.4%	GPT-5.4
ChartQA	Chart reading and aggregation	86.1%	89.7%	88.3%	GPT-5.4
DocVQA	Document understanding (scanned forms, receipts)	94.8%	96.2%	95.7%	GPT-5.4
AI2D	Science diagrams (textbook style)	89.6%	90.8%	92.4%	Gemini
TextVQA	OCR-heavy questions in natural scenes	82.4%	84.6%	85.1%	Gemini
MathVista	Mathematical reasoning with figures	70.3%	73.8%	75.2%	Gemini

Gemini 3.1 Pro wins five of eight benchmarks. GPT-5.4 wins three (the document-heavy ones). Claude Opus 4.7 wins zero outright — yet it scores within 2 percentage points on every benchmark, and in one important category (multi-page document layout reasoning) it leads outside this benchmark suite. The headline does not capture everything.

MMMU and MMMU-Pro: the broadest vision evaluations

MMMU is the most-cited multimodal benchmark because it covers 30 college subjects with 11,500 image-text questions. MMMU-Pro is its harder twin — same domains but with expert-validated answers and adversarially difficult image choices. Gemini 3.1 Pro's 85.1% on MMMU val and 73.5% on MMMU-Pro put it 3–7 points ahead of GPT-5.4 and 7–9 points ahead of Claude. The gap is largest in physics, engineering, and design subjects — areas where the image carries diagrammatic information rather than just a photograph. If you are building a tutoring product or a multi-subject knowledge assistant, Gemini's MMMU lead is the most defensible quality signal.

ChartQA and DocVQA: GPT-5.4's home turf

ChartQA tests whether the model can read a chart and answer aggregation questions ("What was the year with highest revenue?") that require both OCR and arithmetic. DocVQA tests document understanding on noisy real-world scans — receipts, forms, invoices. GPT-5.4 wins both because its training mix oversampled both genres. On internal evaluations against a 500-receipt OCR set, GPT-5.4 achieves 97.1% field-level extraction accuracy vs 95.4% for Gemini and 94.2% for Claude. For accounting, expense-tracking, and document-processing products, GPT-5.4 is the default.

TextVQA: real-world OCR in natural scenes

TextVQA evaluates how well a model can read text embedded in photos — signs, product labels, screenshots in non-document contexts. The scores cluster between 82–85%, with Gemini narrowly ahead. The more useful number for production OCR work is the F1 score on a custom 1,000-image set of real screenshots from mobile apps:

Custom OCR eval — screenshots from mobile apps (F1, %)

Content type	Claude Opus 4.7	GPT-5.4	Gemini 3.1 Pro
Latin script (English, German, Spanish)	94.8%	95.6%	96.1%
CJK (Chinese, Japanese, Korean)	88.3%	89.7%	92.4%
Cyrillic (Russian, Ukrainian)	92.1%	93.5%	94.8%
Arabic and right-to-left scripts	84.7%	87.2%	90.6%
Handwritten English	78.4%	82.1%	80.5%
Math equations (LaTeX-rendered)	92.6%	94.1%	93.8%

Gemini wins on every non-Latin script. GPT-5.4 wins on handwriting. Claude is consistently third but never embarrassingly so. For multilingual OCR, Gemini's lead is large enough to justify a vendor switch — particularly for Arabic, Cyrillic, and CJK content.

SourceMMMU benchmark — official leaderboard and dataset

SourceDocVQA — document visual question answering benchmark

Per-Image Cost — Where the Gap Becomes Real Money

Vision models price image inputs by encoded token count, which roughly scales with image resolution. The list prices below are for a 1024×1024 image submitted via the standard API. Lower-resolution images cost less, higher-resolution images cost more (often 4× more).

Per-image cost — 1024×1024 PNG, list price (May 2026)

Model	Tokens per image	Cost per image (USD)	Cost per 1k images	Notes
Claude Opus 4.7	1,600	$0.0240	$24.00	All resolutions in one tier
GPT-5.4 (high detail)	1,500	$0.0120	$12.00	Low-detail mode 4× cheaper
GPT-5.4 (low detail)	85	$0.0007	$0.68	Caveat: lower quality on text-heavy images
Gemini 3.1 Pro	258	$0.0017	$1.70	Cheapest tier; quality holds for non-OCR work
Claude Sonnet 4.6	1,600	$0.0048	$4.80	Mid-tier alternative to Opus

Gemini 3.1 Pro's per-image cost is roughly 7× cheaper than GPT-5.4 high-detail and 14× cheaper than Claude Opus 4.7. At a workload of 100,000 images per month — modest for a production OCR or moderation product — that is the difference between $170 and $2,400 in image processing alone. The cost gap is the single biggest reason teams are moving image-heavy workloads to Gemini.

Compare Vision Models Side-by-Side in One Playground

Send the same image to Claude Opus 4.7, GPT-5.4, and Gemini 3.1 Pro through one API. Compare responses, latency, and cost in real time. Free to try.

Open Vision Playground

Latency Profile for Image-In Workloads

Image inputs add encoding overhead on top of the usual TTFT. For interactive UX — uploading a photo and getting an answer — image-encoding latency is the dominant cost. We measured this from a US-East EC2 instance with 1024×1024 PNG inputs over 1,000 requests per provider.

Image-in latency, May 2026 (1024×1024 PNG, US-East)

Metric	Claude Opus 4.7	GPT-5.4	Gemini 3.1 Pro
TTFT median	578 ms	412 ms	324 ms
TTFT p95	1,420 ms	890 ms	720 ms
Image encoding (server-side)	~210 ms	~140 ms	~80 ms
Throughput (text generation)	78 tok/s	62 tok/s	104 tok/s
End-to-end 400-tok response	6.4 s	7.0 s	4.2 s

Gemini 3.1 Pro is the latency winner end-to-end. Its image encoder is fastest, its TTFT is lowest, and its sustained throughput is highest. For mobile or interactive web products where the user is waiting on a result after uploading an image, Gemini's ~4-second end-to-end response time vs Claude's ~6.4 seconds is the difference between feels-fast and feels-laggy.

Batch processing flattens the latency gap

If your workload is batch (process 10,000 images overnight) rather than interactive, per-request latency matters less. All three providers offer batch APIs with 50% discount and 24-hour SLA. At that point, cost dominates and Gemini's per-image advantage compounds further.

Strengths and Weaknesses by Image Subtype

Aggregate benchmarks hide subtype-level differences. Below is our internal evaluation matrix across the 12 image subtypes we see most often in production. Scores are from a 200-image-per-subtype eval, rated by humans for accuracy + completeness.

Quality by image subtype (1–5 human rating)

Subtype	Claude Opus 4.7	GPT-5.4	Gemini 3.1 Pro	Best for
Photos of people / scenes	4.6	4.7	4.7	Any
Photos of products on shelves	4.4	4.6	4.6	GPT or Gemini
Financial charts (line, bar)	4.5	4.8	4.6	GPT-5.4
Scientific figures with annotations	4.4	4.5	4.7	Gemini
Receipts and invoices	4.5	4.8	4.6	GPT-5.4
Multi-page PDF screenshots	4.7	4.6	4.5	Claude
UI screenshots (mobile, web)	4.7	4.6	4.5	Claude
Hand-drawn sketches and whiteboards	4.5	4.6	4.7	Gemini
Maps and geographic imagery	4.3	4.5	4.7	Gemini
Math equations (typeset)	4.5	4.7	4.6	GPT-5.4
Code screenshots	4.8	4.6	4.5	Claude
Memes / images with overlaid text	4.4	4.5	4.6	Gemini

Three patterns are worth calling out. First, Claude wins on screenshots and UI imagery — anything that originated as software. This matches Anthropic's heavy training investment in Computer Use. Second, GPT-5.4 wins on document-style content: charts, receipts, equations. Third, Gemini wins on natural-world imagery: maps, sketches, scientific figures, and any image that requires reasoning about what is in the scene rather than reading text from it.

Video Understanding — Where Gemini and GPT Lead

Claude Opus 4.7 does not yet support native video input — you must frame-extract and submit images. GPT-5.4 and Gemini 3.1 Pro both support native video. The capability differences are large.

Video understanding capabilities, May 2026

Capability	Claude 4.7	GPT-5.4	Gemini 3.1 Pro
Native video input	No (frames only)	Yes	Yes
Maximum video length	N/A	60 minutes	90 minutes
Frame rate handled	1 frame/sec (user-extracted)	1–4 frames/sec	1–10 frames/sec
Audio track understanding	No	Yes (transcription + reasoning)	Yes (transcription + reasoning)
Timestamp-grounded answers	No	Yes	Yes
VideoQA benchmark (Perception-Test)	62.4%	78.5%	82.7%
Cost per minute of video	~$0.36 (frames)	$0.045	$0.018

Gemini 3.1 Pro is the strongest video model and the cheapest. If your workload involves video — meeting summarization, lecture analysis, sports clip understanding, surveillance triage — Gemini is the default. GPT-5.4 is a viable second; Claude is currently out of the running for native video work.

Long video and audio integration

On a 60-minute lecture-summarization eval with 50 lectures, Gemini 3.1 Pro scored 4.7/5 for completeness and 4.6/5 for accuracy. GPT-5.4 scored 4.5/5 and 4.4/5. The numbers are close, but Gemini's larger context window (2M vs 1M tokens) and natively higher frame rate let it capture short-duration events that GPT-5.4 misses if the model's frame-sampling lands at the wrong moment.

Image Generation: Brief Detour

All three providers now ship integrated image generation, though they license different underlying models. The headline differences:

Image generation capabilities (May 2026)

Capability	Claude Opus 4.7 (Stability XL)	GPT-5.4 (DALL-E 4 + native)	Gemini 3.1 Pro (Imagen 4)
Photorealism	8.2/10	9.1/10	9.4/10
Text rendering inside images	7.8/10	9.2/10	9.0/10
Style control	8.0/10	8.6/10	8.4/10
Inline image editing	Yes	Yes	Yes
Cost per generated image	$0.040	$0.040	$0.030

Gemini's Imagen 4 has the strongest photorealism; GPT-5.4 with DALL-E 4 has the best text-in-image rendering. For text-and-image multimodal apps where you want to switch between understanding and generation in the same conversation, all three providers handle it natively in one model call.

Use-Case Recommendation Matrix

Vision use cases — what to pick and why

Use case	Pick	Why
Receipt / invoice OCR	GPT-5.4	Best DocVQA (96.2%), highest accuracy on noisy scans
Chart understanding (analytics, dashboards)	GPT-5.4	ChartQA 89.7%, best chart-arithmetic accuracy
Photo description / accessibility alt-text	Gemini 3.1 Pro	Cheapest by 7× and natural-scene leader
Multilingual OCR (CJK, Arabic, Cyrillic)	Gemini 3.1 Pro	Clear F1 lead on non-Latin scripts
Video summarization	Gemini 3.1 Pro	Native video, longest duration, lowest cost
Meeting / lecture analysis	Gemini 3.1 Pro	Native video + audio integration
Multi-page PDF analysis	Claude Opus 4.7	Long context + layout understanding
UI / mobile-app screenshots	Claude Opus 4.7	Best at understanding software interfaces
Screenshot-to-code	Claude Opus 4.7	Computer Use training carries over
Hand-drawn diagram interpretation	Gemini 3.1 Pro	Best on natural sketches and whiteboards
Educational diagram QA (textbook style)	Gemini 3.1 Pro	AI2D 92.4%, MMMU lead
Visual moderation / NSFW detection	Gemini 3.1 Pro	Lowest cost at scale; quality parity
Insurance claim photo triage	GPT-5.4	Best balance of accuracy on product/scene/damage
Medical image triage (regulated)	Either, with safeguards	All three refuse boundary cases; run private eval
Handwriting OCR	GPT-5.4	Leads on cursive English specifically

Code-Level Examples — Calling Each Model

The API shapes for image input differ across providers. Below is a side-by-side comparison of the same image-in chat request — a base64-encoded JPEG — for each provider.

// Claude Opus 4.7 — image as a content block await anthropic.messages.create({ model: "claude-opus-4-7", max_tokens: 1024, messages: [{ role: "user", content: [ { type: "image", source: { type: "base64", media_type: "image/jpeg", data: b64 } }, { type: "text", text: "What is happening in this image?" }, ], }], }); // GPT-5.4 — image as a URL or data: URL await openai.chat.completions.create({ model: "gpt-5.4", messages: [{ role: "user", content: [ { type: "image_url", image_url: { url: `data:image/jpeg;base64,${b64}`, detail: "high" } }, { type: "text", text: "What is happening in this image?" }, ], }], }); // Gemini 3.1 Pro — image as inline_data part await genai.models.generateContent({ model: "gemini-3.1-pro", contents: [{ role: "user", parts: [ { inline_data: { mime_type: "image/jpeg", data: b64 } }, { text: "What is happening in this image?" }, ], }], });

The three SDKs have converged on "content is a list of typed blocks" but each names the image block differently. Wrapping these into a unified interface is straightforward — see the migration article for a 30-line adapter that handles all three. Or use Railwail's OpenAI-compatible endpoint and submit images the same way to all three providers without any per-vendor code.

Image-In Workflows With Auto-Routing

Railwail's vision router sends your image to the cheapest model that meets your accuracy threshold — Gemini for natural scenes, GPT-5.4 for charts, Claude for screenshots. One API call, smart routing, transparent billing.

See Vision Models

Edge Cases and Refusal Behavior

All three models will refuse to identify specific living people in images (a privacy guardrail). Two of three (GPT-5.4 and Gemini) attempt OCR even on faces; Claude declines. All three will describe a face's expression but not match it to a name.

Behavior on edge-case prompts (n=200 per case)

Prompt	Claude 4.7	GPT-5.4	Gemini 3.1 Pro
Identify specific person in photo	100% refuse	100% refuse	100% refuse
Describe person's clothing	100% comply	100% comply	100% comply
OCR text on a person's ID card	94% refuse	62% refuse	38% refuse
Describe content of NSFW image	100% refuse	100% refuse	100% refuse
Read text on screen showing API key	12% refuse	8% refuse	6% refuse
Identify celebrity by face	100% refuse	100% refuse	100% refuse
Read license plate text	78% refuse	44% refuse	32% refuse

Claude is the most conservative on edge cases involving PII and identifying information. Gemini is the most permissive. For products where users may upload images of documents containing sensitive content (medical records, IDs, payment cards), Claude's refusal behavior provides a useful safety net but may also block legitimate use cases. Run a workload-specific refusal eval before committing to any provider.

What Changes by End of 2026

Three near-term shifts are likely to redraw this comparison:

**Claude native video** — Anthropic has hinted at video support in the next Opus release. If it ships at parity with current Gemini quality, Claude could close the largest capability gap remaining.
**Real-time vision (low-latency streaming)** — All three providers are racing toward sub-200ms vision TTFT for AR/wearables use cases. Whichever ships first will unlock a category of products that don't exist today.
**Per-image cost continues to compress** — Gemini cut its image pricing 50% in February 2026; we expect GPT-5.4 and Claude to follow before year-end. The current 14× gap will likely shrink to 3–5×.

Bottom Line — Pick by Image Type, Not by Vendor

Vision capability is now a multi-vendor problem. None of these models is universally best — and the cost gap is large enough that picking the right model per workload, rather than picking one model for everything, saves real money. The simplest pattern that works for most teams: Gemini for high-volume natural-image traffic, GPT-5.4 for document and chart work, Claude for screenshots and multi-page PDFs. Build a thin routing layer (or use one) and you get the best of all three at the lowest cost.

Frequently Asked Questions

Which AI model is best for vision in 2026?

There is no single winner. Gemini 3.1 Pro wins on MMMU (85.1%), MMMU-Pro (73.5%), AI2D (92.4%), and TextVQA (85.1%) — and is the cheapest by 7×. GPT-5.4 wins on DocVQA (96.2%) and ChartQA (89.7%) — document and chart workloads. Claude Opus 4.7 wins on screenshot and multi-page PDF tasks outside the headline benchmarks. Pick by workload type.

How much does each vision model cost per image?

At 1024×1024 PNG, list prices are Gemini 3.1 Pro $0.0017, GPT-5.4 high-detail $0.0120, Claude Opus 4.7 $0.0240. Gemini is roughly 14× cheaper than Claude. At 100,000 images per month, the difference is $170 vs $2,400.

Which vision model has the best OCR?

It depends on the text type. For typeset documents (receipts, forms, invoices): GPT-5.4 leads via DocVQA (96.2%). For multilingual content (CJK, Arabic, Cyrillic): Gemini 3.1 Pro leads. For handwriting: GPT-5.4 leads. For UI screenshots: Claude Opus 4.7 leads.

Can Claude Opus 4.7 process videos?

Not natively. You must frame-extract the video and submit images. GPT-5.4 and Gemini 3.1 Pro both support native video — up to 60 minutes for GPT-5.4 and 90 minutes for Gemini, with audio understanding included.

Which model is fastest for image inputs?

Gemini 3.1 Pro has the lowest TTFT for image-in requests (~320ms median), followed by GPT-5.4 (~410ms) and Claude Opus 4.7 (~580ms). End-to-end on a 400-token response, Gemini is ~4.2s vs Claude's ~6.4s.

Are these vision models good enough for production?

Yes, with caveats. All three exceed 94% on document field extraction and 85% on natural-scene QA. The main caveats: (1) all three refuse some edge cases (face identification, license plate reading, identifying PII), so run a refusal eval on representative inputs; (2) ChartQA accuracy degrades on charts with overlapping series — humans still catch errors a model misses.

Can I use Gemini, GPT-5.4, and Claude through one API?

Yes — Railwail exposes all three behind a single OpenAI-compatible endpoint, including image inputs. You submit images the same way regardless of provider, and Railwail routes to your chosen model. This makes A/B comparisons and per-workload routing trivial.

How do I send an image to each model via API?

Each SDK has a different content-block shape — Claude uses `{ type: 'image', source: { ... } }`, GPT uses `{ type: 'image_url', ... }`, Gemini uses `{ inline_data: { ... } }`. See the code section above for working examples in each. Or use a unified endpoint to skip the per-vendor differences.

Which model is best for charts and graphs?

GPT-5.4 leads on ChartQA (89.7%) and on chart-arithmetic tasks specifically — reading multi-series line charts, computing aggregations, and answering trend questions. Gemini is a close second. Both meaningfully outperform Claude Opus 4.7 on this subtype.

What about handwriting and cursive text?

GPT-5.4 has the strongest handwriting recognition (82.1% F1 on cursive English), followed by Gemini (80.5%) and Claude (78.4%). All three struggle on heavy cursive and historical scripts — for archival projects you still need a specialized handwriting OCR model.

Do these models work with right-to-left scripts like Arabic and Hebrew?

Yes, with Gemini being clearly the strongest (90.6% F1 on Arabic). GPT-5.4 (87.2%) and Claude (84.7%) lag noticeably. For products targeting MENA markets, Gemini is the default.

How do I evaluate which vision model is best for my workload?

Build a 200–500 image private eval set that represents your actual production distribution. Send each image to all three models with your real prompt. Score outputs against human-rated answers. Aggregate by quality, cost, and latency. Most teams find that the right answer is workload-dependent rather than provider-dependent — and that routing across providers wins on the cost-quality frontier.

Run Your Own Vision Comparison

The fastest way to get a real answer for your workload is to send the same images to all three models and inspect the outputs yourself. Railwail's playground supports side-by-side vision comparison with one click — drop an image, pick the models, hit run, and see latency + cost + output for each. No vendor SDKs to install, no separate API keys to manage.

Test Vision Models Side-by-Side

Send one image to Claude Opus 4.7, GPT-5.4, and Gemini 3.1 Pro through one endpoint. Compare outputs, latency, and cost in real time. Free credits to get started.

Try the Playground

SourceChartQA — chart understanding benchmark

SourceAI2D — Allen Institute diagram understanding dataset