Multimodal Models

Models that combine text, vision, and other modalities

Multimodal models for vision, OCR, and document understanding

Multimodal models accept text plus images (sometimes plus audio or video) and produce text output. Reach for one when your input contains images and your output is structured information: extract an invoice, describe a chart, transcribe a handwritten note, answer questions about a UI screenshot.

57 models available

Claude Opus 4.7

MultimodalAnthropic
NewPopular

Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.

Free
anthropicflagshipreasoning

Claude Sonnet 4.6

MultimodalAnthropic
NewPopular

Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.

Free
anthropicbalancedproduction

Depth Anything v2

MultimodalReplicate
Popular

Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.

€0.005
replicatedepthvision-understanding

Gemini 3 Flash

MultimodalGoogle DeepMind
NewPopular

Google's April 2026 fast multimodal model. Combines Gemini 3 Pro's reasoning with Flash-tier latency and price. Default model in the Gemini app.

Free
googledeepmindbalanced

Gemini 3.1 Pro

MultimodalGoogle DeepMind
NewPopular

Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.

Free
googledeepmindflagship

GPT-5.4

MultimodalOpenAI
NewPopular

OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.

Free
openaiflagshipreasoning

GPT-5.4 Mini

MultimodalOpenAI
NewPopular

OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.

Free
openaibalancedcost-efficient

Grok 4.3

MultimodalxAI
NewPopular

xAI's May 2026 flagship. 1M context, vision, always-on reasoning, real-time X/web retrieval via DeepSearch.

Free
xaiflagshipreasoning

SAM 2 (Segment Anything 2)

MultimodalMeta
Popular

Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.

€0.01
replicatesegmentationmeta

Claude Haiku 4.5

MultimodalAnthropic
New

Anthropic's fastest and cheapest 4.x model. Strong vision and tool use at ultra-low latency, ideal for high-concurrency workloads.

Free
anthropiccost-efficientlow-latency

CogVLM2 19B

MultimodalReplicate

Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.

€0.01
replicatemultimodalvision-understanding

DeepSeek-VL 7B

MultimodalReplicate

DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.

€0.008
replicatemultimodalvision-understanding

Detectron2

MultimodalReplicate

Meta Detectron2 object-detection and segmentation toolkit. Mask R-CNN, Cascade R-CNN, panoptic FPN and many other model variants in one wrapper.

€0.008
replicatesegmentationvision-understanding

DINOv2

MultimodalReplicate

Meta DINOv2 self-supervised vision backbone. Pretrained features for classification, segmentation and depth without task-specific fine-tuning.

€0.005
replicatesegmentationvision-understanding

Donut Document

MultimodalReplicate

Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.

€0.008
replicateocrvision-understanding

Dots OCR

MultimodalReplicate

Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.

€0.008
replicateocrvision-understanding

DWPose

MultimodalReplicate

DWPose whole-body 2D pose estimator. Two-stage knowledge-distilled model with strong accuracy on face, hands and body keypoints simultaneously.

€0.005
replicateposevision-understanding

EasyOCR

MultimodalReplicate

JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.

€0.002
replicateocrvision-understanding

Florence-2 Large

MultimodalMicrosoft

Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.

€0.008
replicatemultimodalvision-understanding

Florence-2 Segmentation

MultimodalCommunity

Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.

€0.009
replicatesegmentationvision-understanding

GLPN Depth

MultimodalReplicate

Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.

€0.004
replicatedepthvision-understanding

GOT-OCR 2.0

MultimodalReplicate

StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.

€0.009
replicateocrvision-understanding

GPT-5.4 Nano

MultimodalOpenAI
New

OpenAI's smallest and cheapest GPT-5.4 variant. Built for high-volume classification, extraction and coding subagents at edge-grade latency.

Free
openaicost-efficientlow-latency

Grok 2 Vision

MultimodalxAI

xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.

Free
xaivisionlegacy

Grok 4.1 Fast

MultimodalxAI
New

xAI's cost-efficient high-throughput model. 2M context, optional reasoning, optimized for agentic loops and real-time apps.

Free
xaicost-efficientvision

Grounded-SAM

MultimodalReplicate

Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.

€0.01
replicatesegmentationvision-understanding

HRNet Pose

MultimodalReplicate

Microsoft HRNet high-resolution pose-estimation backbone. Parallel multi-resolution streams yield strong accuracy on COCO keypoint benchmarks.

€0.005
replicateposevision-understanding

Idefics3 8B

MultimodalReplicate

Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.

€0.007
replicatemultimodalvision-understanding

InternVL 2.5

MultimodalReplicate

OpenGVLab InternVL 2.5 78B. Open-source vision-language model approaching GPT-4o on MMMU, OCRBench and Math-Vista benchmarks.

€0.03
replicatemultimodalvision-understanding

LayoutLMv3

MultimodalMicrosoft

Microsoft LayoutLMv3 multimodal document model. Unified text/image masking pretraining for form understanding, receipts and document QA.

€0.007
replicateocrvision-understanding

Llama 3.2 90B Vision (multimodal)

MultimodalMeta

Meta's flagship vision-language model. 90B parameters, image understanding + chat, strong VQA performance.

Free
metallamamultimodal

Llama 3.2 Vision 90B

MultimodalMeta

Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.

€0.02
replicatemultimodalvision-understanding

LLaVA-OneVision 72B

MultimodalReplicate

LMMs-Lab LLaVA-OneVision 72B. Unified single-image, multi-image and video instruction-tuned VLM with task-transfer across modalities.

€0.02
replicatemultimodalvision-understanding

Lotus-G

MultimodalReplicate

Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.

€0.01
replicatedepthvision-understanding

Marigold

MultimodalReplicate

ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.

€0.01
replicatedepthvision-understanding

Marker PDF Extract

MultimodalReplicate

Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.

€0.008
replicateocrvision-understanding

Mask2Former

MultimodalReplicate

Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.

€0.009
replicatesegmentationvision-understanding

MediaPipe Pose

MultimodalGoogle DeepMind

Google MediaPipe Pose. Lightweight on-device-friendly 33-keypoint 3D pose estimator with optional segmentation mask output.

€0.003
replicateposevision-understanding

MiDaS v3.1

MultimodalReplicate

Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.

€0.004
replicatedepthvision-understanding

MiniCPM-V 2.6

MultimodalReplicate

OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.

€0.008
replicatemultimodalvision-understanding

Mistral OCR

MultimodalMistral AI

Mistral OCR API. Document-understanding model with strong table and equation extraction, and structured JSON output.

€0.001
mistralocrvision-understanding

Mistral Pixtral Large (124B)

MultimodalMistral AI

Mistral's 124B multimodal flagship. 123B decoder + 1B vision encoder, 128k ctx, up to 30 images per request.

Free
mistralpixtralmultimodal

MMPose

MultimodalReplicate

OpenMMLab MMPose toolbox. Wraps RTMPose, HRNet, HigherHRNet and many other pose models behind a unified inference API.

€0.006
replicateposevision-understanding

olmOCR

MultimodalReplicate

Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.

€0.01
replicateocrvision-understanding

OpenPose

MultimodalReplicate

CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.

€0.005
replicateposevision-understanding

PaddleOCR v3

MultimodalReplicate

Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.

€0.003
replicateocrvision-understanding

Phi-3.5 Vision

MultimodalMicrosoft

Microsoft Phi-3.5 Vision Instruct. Small (4.2B) multimodal model with strong document, OCR and multi-image reasoning at low cost.

€0.005
replicatemultimodalvision-understanding

Qwen2-VL-72B Instruct

MultimodalAlibaba / Qwen

Alibaba's 72B vision-language model with M-RoPE and dynamic resolution. Strong document and video understanding.

Free
qwenalibabamultimodal

Reka Core

MultimodalCustom

Reka's frontier multimodal model supporting text, image, video and audio inputs.

Free
rekamultimodalvideo-understanding

Reka Edge

MultimodalCustom

Reka's small on-device-friendly multimodal model. ~7B parameters, 16k context.

Free
rekamultimodaledge

Reka Flash

MultimodalCustom

Reka's 21B dense multimodal model balancing speed and quality. Up to 128k context.

Free
rekamultimodalcost-efficient

SAM HQ

MultimodalReplicate

ETH Zurich SAM-HQ. High-quality mask refinement on top of SAM. Sharper edges and finer structure than the original Segment Anything model.

€0.01
replicatesegmentationvision-understanding

Segformer B5

MultimodalReplicate

NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.

€0.007
replicatesegmentationvision-understanding

TrOCR Large

MultimodalMicrosoft

Microsoft TrOCR large transformer-based OCR. End-to-end visual encoder plus text decoder, trained on synthetic and printed real-world data.

€0.004
replicateocrvision-understanding

ViTPose

MultimodalReplicate

ViTPose plain-vision-transformer pose estimator. State-of-the-art keypoint accuracy on MS-COCO with a minimal architecture.

€0.006
replicateposevision-understanding

Yi-VL 34B

Multimodal01.AI

01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.

€0.02
replicatemultimodalvision-understanding

ZoeDepth

MultimodalReplicate

Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.

€0.005
replicatedepthvision-understanding

Top multimodal models picks

Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.

Best overall
Claude Opus 4.7

Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.

Learn more
Cheapest
GPT-5.4 Mini

OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.

Learn more
Longest context
Gemini 3.1 Pro

Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.

Learn more
Fastest
Claude Opus 4.7

Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.

Learn more

Pricing is per-token like regular LLMs, with one twist: every image is counted as a fixed number of tokens — typically 250-1,500 tokens depending on resolution and detail mode. A standard 1024×1024 image costs roughly the same as a 1,000-word text input. High-detail mode (preserving fine text and small UI elements) costs 2-4× more. Plan budgets accordingly — a workload that processes 10,000 receipt scans per day can easily run €10-€50 per day at flagship rates.

The trade-off is OCR accuracy, reasoning quality, and cost. Flagships (GPT-5 Vision, Claude 4.6, Gemini 2.5) read complex layouts and reason over chart contents very reliably. Specialized OCR-first models (Qwen 2.5 VL, Pixtral, InternVL) sometimes outperform on pure text extraction at a fraction of the cost. For pure document-to-JSON pipelines, a specialized model with a strict JSON schema usually wins. For document-and-reasoning workloads ('extract this invoice AND tell me if the tax math is right'), flagships win.

Watch out for resolution limits: most models downscale very large images before processing, which can destroy fine text in screenshots and dense documents. Pre-process — slice tall documents into single-page images, upscale low-resolution scans before sending — to preserve readability. Also watch out for hallucinations on hard-to-read regions; multimodal models tend to confidently invent text where the source is illegible.

Top picks above cover the most accurate flagship, the cheapest workhorse, the highest-resolution supporter, and the fastest streaming option.

Frequently asked questions

Start Building with AI

Access all models through a single API. Get free credits when you sign up — no credit card required.