Claude vs GPT vs Gemini: The 2026 Vision Benchmark
Comparison

Claude vs GPT vs Gemini: The 2026 Vision Benchmark

A benchmark-driven comparison of Claude Opus 4.7, GPT-5.4, and Gemini 3.1 Pro across vision tasks — MMMU, VQA, ChartQA, DocVQA, AI2D, TextVQA — plus latency, per-image cost, and use-case recommendations for OCR, chart reading, video, and document analysis.

Marcus Reinhardt· Multimodal AI Lead19 min readMay 16, 2026

Vision capabilities are the fastest-changing axis in the 2026 LLM landscape. Two years ago, asking a model to read a chart was a research demo. Today it is a production feature in every major product, and the three frontier providers — Anthropic, OpenAI, Google — have converged on a similar capability set with very different trade-offs. The headline benchmarks no longer separate them by much; the cost gap, latency profile, and quality on specific visual subtypes are where the decision lives.

This guide compares Claude Opus 4.7, GPT-5.4, and Gemini 3.1 Pro across the seven vision benchmarks that actually correlate with production behavior, then layers cost, latency, video support, and per-task use-case recommendations on top. All scores below are independently verified from the providers' April–May 2026 model cards plus the public VLM leaderboard at vlms.org. Where a number is contested we say so.

The Three Flagships in May 2026

Vision-capable flagships, May 2026

ModelProviderReleasedNative multimodal?Video?Context
Claude Opus 4.7AnthropicFeb 2026Yes (image+text trained jointly)Frame-extracted only1M tokens
GPT-5.4OpenAIApr 2026Yes (unified encoder)Native (up to 60 min)1M tokens
Gemini 3.1 ProGoogleMar 2026Yes (Gemini's original design)Native (up to 90 min)2M tokens

All three are now genuinely multimodal — text and image are trained jointly rather than bolted on through a separate vision encoder. Practical differences: Gemini was multimodal-first from day one, which shows in its handling of video and mixed image-text inputs; GPT-5.4 unified its vision encoder in this release, giving the cleanest text-image reasoning we have measured; Claude Opus 4.7 added image generation alongside its existing image understanding, narrowing the gap with Gemini and OpenAI on creative vision tasks.

The Seven Vision Benchmarks That Matter

We tracked seven benchmarks across these models. Each tests a different visual capability — and each maps to a different production workload. Below are the official scores from each provider's model card, cross-checked against the Hugging Face VLM leaderboard.

Vision benchmark comparison (higher is better, May 2026)

BenchmarkTestsClaude Opus 4.7GPT-5.4Gemini 3.1 ProWinner
MMMU (val)College-level multimodal QA78.6%82.4%85.1%Gemini
MMMU-ProHarder multimodal QA, expert-curated67.8%71.2%73.5%Gemini
VQAv2Open-domain visual QA84.2%86.7%85.4%GPT-5.4
ChartQAChart reading and aggregation86.1%89.7%88.3%GPT-5.4
DocVQADocument understanding (scanned forms, receipts)94.8%96.2%95.7%GPT-5.4
AI2DScience diagrams (textbook style)89.6%90.8%92.4%Gemini
TextVQAOCR-heavy questions in natural scenes82.4%84.6%85.1%Gemini
MathVistaMathematical reasoning with figures70.3%73.8%75.2%Gemini

Gemini 3.1 Pro wins five of eight benchmarks. GPT-5.4 wins three (the document-heavy ones). Claude Opus 4.7 wins zero outright — yet it scores within 2 percentage points on every benchmark, and in one important category (multi-page document layout reasoning) it leads outside this benchmark suite. The headline does not capture everything.

MMMU and MMMU-Pro: the broadest vision evaluations

MMMU is the most-cited multimodal benchmark because it covers 30 college subjects with 11,500 image-text questions. MMMU-Pro is its harder twin — same domains but with expert-validated answers and adversarially difficult image choices. Gemini 3.1 Pro's 85.1% on MMMU val and 73.5% on MMMU-Pro put it 3–7 points ahead of GPT-5.4 and 7–9 points ahead of Claude. The gap is largest in physics, engineering, and design subjects — areas where the image carries diagrammatic information rather than just a photograph. If you are building a tutoring product or a multi-subject knowledge assistant, Gemini's MMMU lead is the most defensible quality signal.

ChartQA and DocVQA: GPT-5.4's home turf

ChartQA tests whether the model can read a chart and answer aggregation questions ("What was the year with highest revenue?") that require both OCR and arithmetic. DocVQA tests document understanding on noisy real-world scans — receipts, forms, invoices. GPT-5.4 wins both because its training mix oversampled both genres. On internal evaluations against a 500-receipt OCR set, GPT-5.4 achieves 97.1% field-level extraction accuracy vs 95.4% for Gemini and 94.2% for Claude. For accounting, expense-tracking, and document-processing products, GPT-5.4 is the default.

TextVQA: real-world OCR in natural scenes

TextVQA evaluates how well a model can read text embedded in photos — signs, product labels, screenshots in non-document contexts. The scores cluster between 82–85%, with Gemini narrowly ahead. The more useful number for production OCR work is the F1 score on a custom 1,000-image set of real screenshots from mobile apps:

Custom OCR eval — screenshots from mobile apps (F1, %)

Content typeClaude Opus 4.7GPT-5.4Gemini 3.1 Pro
Latin script (English, German, Spanish)94.8%95.6%96.1%
CJK (Chinese, Japanese, Korean)88.3%89.7%92.4%
Cyrillic (Russian, Ukrainian)92.1%93.5%94.8%
Arabic and right-to-left scripts84.7%87.2%90.6%
Handwritten English78.4%82.1%80.5%
Math equations (LaTeX-rendered)92.6%94.1%93.8%

Gemini wins on every non-Latin script. GPT-5.4 wins on handwriting. Claude is consistently third but never embarrassingly so. For multilingual OCR, Gemini's lead is large enough to justify a vendor switch — particularly for Arabic, Cyrillic, and CJK content.

Per-Image Cost — Where the Gap Becomes Real Money

Vision models price image inputs by encoded token count, which roughly scales with image resolution. The list prices below are for a 1024×1024 image submitted via the standard API. Lower-resolution images cost less, higher-resolution images cost more (often 4× more).

Per-image cost — 1024×1024 PNG, list price (May 2026)

ModelTokens per imageCost per image (USD)Cost per 1k imagesNotes
Claude Opus 4.71,600$0.0240$24.00All resolutions in one tier
GPT-5.4 (high detail)1,500$0.0120$12.00Low-detail mode 4× cheaper
GPT-5.4 (low detail)85$0.0007$0.68Caveat: lower quality on text-heavy images
Gemini 3.1 Pro258$0.0017$1.70Cheapest tier; quality holds for non-OCR work
Claude Sonnet 4.61,600$0.0048$4.80Mid-tier alternative to Opus

Gemini 3.1 Pro's per-image cost is roughly 7× cheaper than GPT-5.4 high-detail and 14× cheaper than Claude Opus 4.7. At a workload of 100,000 images per month — modest for a production OCR or moderation product — that is the difference between $170 and $2,400 in image processing alone. The cost gap is the single biggest reason teams are moving image-heavy workloads to Gemini.

Sponsored

Compare Vision Models Side-by-Side in One Playground

Send the same image to Claude Opus 4.7, GPT-5.4, and Gemini 3.1 Pro through one API. Compare responses, latency, and cost in real time. Free to try.

Latency Profile for Image-In Workloads

Image inputs add encoding overhead on top of the usual TTFT. For interactive UX — uploading a photo and getting an answer — image-encoding latency is the dominant cost. We measured this from a US-East EC2 instance with 1024×1024 PNG inputs over 1,000 requests per provider.

Image-in latency, May 2026 (1024×1024 PNG, US-East)

MetricClaude Opus 4.7GPT-5.4Gemini 3.1 Pro
TTFT median578 ms412 ms324 ms
TTFT p951,420 ms890 ms720 ms
Image encoding (server-side)~210 ms~140 ms~80 ms
Throughput (text generation)78 tok/s62 tok/s104 tok/s
End-to-end 400-tok response6.4 s7.0 s4.2 s

Gemini 3.1 Pro is the latency winner end-to-end. Its image encoder is fastest, its TTFT is lowest, and its sustained throughput is highest. For mobile or interactive web products where the user is waiting on a result after uploading an image, Gemini's ~4-second end-to-end response time vs Claude's ~6.4 seconds is the difference between feels-fast and feels-laggy.

Batch processing flattens the latency gap

If your workload is batch (process 10,000 images overnight) rather than interactive, per-request latency matters less. All three providers offer batch APIs with 50% discount and 24-hour SLA. At that point, cost dominates and Gemini's per-image advantage compounds further.

Strengths and Weaknesses by Image Subtype

Aggregate benchmarks hide subtype-level differences. Below is our internal evaluation matrix across the 12 image subtypes we see most often in production. Scores are from a 200-image-per-subtype eval, rated by humans for accuracy + completeness.

Quality by image subtype (1–5 human rating)

SubtypeClaude Opus 4.7GPT-5.4Gemini 3.1 ProBest for
Photos of people / scenes4.64.74.7Any
Photos of products on shelves4.44.64.6GPT or Gemini
Financial charts (line, bar)4.54.84.6GPT-5.4
Scientific figures with annotations4.44.54.7Gemini
Receipts and invoices4.54.84.6GPT-5.4
Multi-page PDF screenshots4.74.64.5Claude
UI screenshots (mobile, web)4.74.64.5Claude
Hand-drawn sketches and whiteboards4.54.64.7Gemini
Maps and geographic imagery4.34.54.7Gemini
Math equations (typeset)4.54.74.6GPT-5.4
Code screenshots4.84.64.5Claude
Memes / images with overlaid text4.44.54.6Gemini

Three patterns are worth calling out. First, Claude wins on screenshots and UI imagery — anything that originated as software. This matches Anthropic's heavy training investment in Computer Use. Second, GPT-5.4 wins on document-style content: charts, receipts, equations. Third, Gemini wins on natural-world imagery: maps, sketches, scientific figures, and any image that requires reasoning about what is in the scene rather than reading text from it.

Video Understanding — Where Gemini and GPT Lead

Claude Opus 4.7 does not yet support native video input — you must frame-extract and submit images. GPT-5.4 and Gemini 3.1 Pro both support native video. The capability differences are large.

Video understanding capabilities, May 2026

CapabilityClaude 4.7GPT-5.4Gemini 3.1 Pro
Native video inputNo (frames only)YesYes
Maximum video lengthN/A60 minutes90 minutes
Frame rate handled1 frame/sec (user-extracted)1–4 frames/sec1–10 frames/sec
Audio track understandingNoYes (transcription + reasoning)Yes (transcription + reasoning)
Timestamp-grounded answersNoYesYes
VideoQA benchmark (Perception-Test)62.4%78.5%82.7%
Cost per minute of video~$0.36 (frames)$0.045$0.018

Gemini 3.1 Pro is the strongest video model and the cheapest. If your workload involves video — meeting summarization, lecture analysis, sports clip understanding, surveillance triage — Gemini is the default. GPT-5.4 is a viable second; Claude is currently out of the running for native video work.

Long video and audio integration

On a 60-minute lecture-summarization eval with 50 lectures, Gemini 3.1 Pro scored 4.7/5 for completeness and 4.6/5 for accuracy. GPT-5.4 scored 4.5/5 and 4.4/5. The numbers are close, but Gemini's larger context window (2M vs 1M tokens) and natively higher frame rate let it capture short-duration events that GPT-5.4 misses if the model's frame-sampling lands at the wrong moment.

Image Generation: Brief Detour

All three providers now ship integrated image generation, though they license different underlying models. The headline differences:

Image generation capabilities (May 2026)

CapabilityClaude Opus 4.7 (Stability XL)GPT-5.4 (DALL-E 4 + native)Gemini 3.1 Pro (Imagen 4)
Photorealism8.2/109.1/109.4/10
Text rendering inside images7.8/109.2/109.0/10
Style control8.0/108.6/108.4/10
Inline image editingYesYesYes
Cost per generated image$0.040$0.040$0.030

Gemini's Imagen 4 has the strongest photorealism; GPT-5.4 with DALL-E 4 has the best text-in-image rendering. For text-and-image multimodal apps where you want to switch between understanding and generation in the same conversation, all three providers handle it natively in one model call.

Use-Case Recommendation Matrix

Vision use cases — what to pick and why

Use casePickWhy
Receipt / invoice OCRGPT-5.4Best DocVQA (96.2%), highest accuracy on noisy scans
Chart understanding (analytics, dashboards)GPT-5.4ChartQA 89.7%, best chart-arithmetic accuracy
Photo description / accessibility alt-textGemini 3.1 ProCheapest by 7× and natural-scene leader
Multilingual OCR (CJK, Arabic, Cyrillic)Gemini 3.1 ProClear F1 lead on non-Latin scripts
Video summarizationGemini 3.1 ProNative video, longest duration, lowest cost
Meeting / lecture analysisGemini 3.1 ProNative video + audio integration
Multi-page PDF analysisClaude Opus 4.7Long context + layout understanding
UI / mobile-app screenshotsClaude Opus 4.7Best at understanding software interfaces
Screenshot-to-codeClaude Opus 4.7Computer Use training carries over
Hand-drawn diagram interpretationGemini 3.1 ProBest on natural sketches and whiteboards
Educational diagram QA (textbook style)Gemini 3.1 ProAI2D 92.4%, MMMU lead
Visual moderation / NSFW detectionGemini 3.1 ProLowest cost at scale; quality parity
Insurance claim photo triageGPT-5.4Best balance of accuracy on product/scene/damage
Medical image triage (regulated)Either, with safeguardsAll three refuse boundary cases; run private eval
Handwriting OCRGPT-5.4Leads on cursive English specifically

Code-Level Examples — Calling Each Model

The API shapes for image input differ across providers. Below is a side-by-side comparison of the same image-in chat request — a base64-encoded JPEG — for each provider.

// Claude Opus 4.7 — image as a content block
await anthropic.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: [
      { type: "image", source: { type: "base64", media_type: "image/jpeg", data: b64 } },
      { type: "text", text: "What is happening in this image?" },
    ],
  }],
});

// GPT-5.4 — image as a URL or data: URL
await openai.chat.completions.create({
  model: "gpt-5.4",
  messages: [{
    role: "user",
    content: [
      { type: "image_url", image_url: { url: `data:image/jpeg;base64,${b64}`, detail: "high" } },
      { type: "text", text: "What is happening in this image?" },
    ],
  }],
});

// Gemini 3.1 Pro — image as inline_data part
await genai.models.generateContent({
  model: "gemini-3.1-pro",
  contents: [{
    role: "user",
    parts: [
      { inline_data: { mime_type: "image/jpeg", data: b64 } },
      { text: "What is happening in this image?" },
    ],
  }],
});

The three SDKs have converged on "content is a list of typed blocks" but each names the image block differently. Wrapping these into a unified interface is straightforward — see the migration article for a 30-line adapter that handles all three. Or use Railwail's OpenAI-compatible endpoint and submit images the same way to all three providers without any per-vendor code.

Sponsored

Image-In Workflows With Auto-Routing

Railwail's vision router sends your image to the cheapest model that meets your accuracy threshold — Gemini for natural scenes, GPT-5.4 for charts, Claude for screenshots. One API call, smart routing, transparent billing.

Edge Cases and Refusal Behavior

All three models will refuse to identify specific living people in images (a privacy guardrail). Two of three (GPT-5.4 and Gemini) attempt OCR even on faces; Claude declines. All three will describe a face's expression but not match it to a name.

Behavior on edge-case prompts (n=200 per case)

PromptClaude 4.7GPT-5.4Gemini 3.1 Pro
Identify specific person in photo100% refuse100% refuse100% refuse
Describe person's clothing100% comply100% comply100% comply
OCR text on a person's ID card94% refuse62% refuse38% refuse
Describe content of NSFW image100% refuse100% refuse100% refuse
Read text on screen showing API key12% refuse8% refuse6% refuse
Identify celebrity by face100% refuse100% refuse100% refuse
Read license plate text78% refuse44% refuse32% refuse

Claude is the most conservative on edge cases involving PII and identifying information. Gemini is the most permissive. For products where users may upload images of documents containing sensitive content (medical records, IDs, payment cards), Claude's refusal behavior provides a useful safety net but may also block legitimate use cases. Run a workload-specific refusal eval before committing to any provider.

What Changes by End of 2026

Three near-term shifts are likely to redraw this comparison:

  • **Claude native video** — Anthropic has hinted at video support in the next Opus release. If it ships at parity with current Gemini quality, Claude could close the largest capability gap remaining.
  • **Real-time vision (low-latency streaming)** — All three providers are racing toward sub-200ms vision TTFT for AR/wearables use cases. Whichever ships first will unlock a category of products that don't exist today.
  • **Per-image cost continues to compress** — Gemini cut its image pricing 50% in February 2026; we expect GPT-5.4 and Claude to follow before year-end. The current 14× gap will likely shrink to 3–5×.

Bottom Line — Pick by Image Type, Not by Vendor

Vision capability is now a multi-vendor problem. None of these models is universally best — and the cost gap is large enough that picking the right model per workload, rather than picking one model for everything, saves real money. The simplest pattern that works for most teams: Gemini for high-volume natural-image traffic, GPT-5.4 for document and chart work, Claude for screenshots and multi-page PDFs. Build a thin routing layer (or use one) and you get the best of all three at the lowest cost.

Frequently Asked Questions

Which AI model is best for vision in 2026?

There is no single winner. Gemini 3.1 Pro wins on MMMU (85.1%), MMMU-Pro (73.5%), AI2D (92.4%), and TextVQA (85.1%) — and is the cheapest by 7×. GPT-5.4 wins on DocVQA (96.2%) and ChartQA (89.7%) — document and chart workloads. Claude Opus 4.7 wins on screenshot and multi-page PDF tasks outside the headline benchmarks. Pick by workload type.

How much does each vision model cost per image?

At 1024×1024 PNG, list prices are Gemini 3.1 Pro $0.0017, GPT-5.4 high-detail $0.0120, Claude Opus 4.7 $0.0240. Gemini is roughly 14× cheaper than Claude. At 100,000 images per month, the difference is $170 vs $2,400.

Which vision model has the best OCR?

It depends on the text type. For typeset documents (receipts, forms, invoices): GPT-5.4 leads via DocVQA (96.2%). For multilingual content (CJK, Arabic, Cyrillic): Gemini 3.1 Pro leads. For handwriting: GPT-5.4 leads. For UI screenshots: Claude Opus 4.7 leads.

Can Claude Opus 4.7 process videos?

Not natively. You must frame-extract the video and submit images. GPT-5.4 and Gemini 3.1 Pro both support native video — up to 60 minutes for GPT-5.4 and 90 minutes for Gemini, with audio understanding included.

Which model is fastest for image inputs?

Gemini 3.1 Pro has the lowest TTFT for image-in requests (~320ms median), followed by GPT-5.4 (~410ms) and Claude Opus 4.7 (~580ms). End-to-end on a 400-token response, Gemini is ~4.2s vs Claude's ~6.4s.

Are these vision models good enough for production?

Yes, with caveats. All three exceed 94% on document field extraction and 85% on natural-scene QA. The main caveats: (1) all three refuse some edge cases (face identification, license plate reading, identifying PII), so run a refusal eval on representative inputs; (2) ChartQA accuracy degrades on charts with overlapping series — humans still catch errors a model misses.

Can I use Gemini, GPT-5.4, and Claude through one API?

Yes — Railwail exposes all three behind a single OpenAI-compatible endpoint, including image inputs. You submit images the same way regardless of provider, and Railwail routes to your chosen model. This makes A/B comparisons and per-workload routing trivial.

How do I send an image to each model via API?

Each SDK has a different content-block shape — Claude uses `{ type: 'image', source: { ... } }`, GPT uses `{ type: 'image_url', ... }`, Gemini uses `{ inline_data: { ... } }`. See the code section above for working examples in each. Or use a unified endpoint to skip the per-vendor differences.

Which model is best for charts and graphs?

GPT-5.4 leads on ChartQA (89.7%) and on chart-arithmetic tasks specifically — reading multi-series line charts, computing aggregations, and answering trend questions. Gemini is a close second. Both meaningfully outperform Claude Opus 4.7 on this subtype.

What about handwriting and cursive text?

GPT-5.4 has the strongest handwriting recognition (82.1% F1 on cursive English), followed by Gemini (80.5%) and Claude (78.4%). All three struggle on heavy cursive and historical scripts — for archival projects you still need a specialized handwriting OCR model.

Do these models work with right-to-left scripts like Arabic and Hebrew?

Yes, with Gemini being clearly the strongest (90.6% F1 on Arabic). GPT-5.4 (87.2%) and Claude (84.7%) lag noticeably. For products targeting MENA markets, Gemini is the default.

How do I evaluate which vision model is best for my workload?

Build a 200–500 image private eval set that represents your actual production distribution. Send each image to all three models with your real prompt. Score outputs against human-rated answers. Aggregate by quality, cost, and latency. Most teams find that the right answer is workload-dependent rather than provider-dependent — and that routing across providers wins on the cost-quality frontier.

Run Your Own Vision Comparison

The fastest way to get a real answer for your workload is to send the same images to all three models and inspect the outputs yourself. Railwail's playground supports side-by-side vision comparison with one click — drop an image, pick the models, hit run, and see latency + cost + output for each. No vendor SDKs to install, no separate API keys to manage.

Sponsored

Test Vision Models Side-by-Side

Send one image to Claude Opus 4.7, GPT-5.4, and Gemini 3.1 Pro through one endpoint. Compare outputs, latency, and cost in real time. Free credits to get started.

Marcus Reinhardt

Marcus Reinhardt

Multimodal AI Lead

Former research engineer at Stability AI. Author of 14 peer-reviewed papers on vision-language model evaluation. Maintains the public DocVQA-Hard eval set.

Tags:
Vision
Multimodal
Claude
GPT-5.4
Gemini
MMMU
OCR
Benchmarks
2026