Multimodal Models

Models that combine text, vision, and other modalities

Modelos multimodais para vision, OCR e document understanding

Os modelos multimodais aceitam texto mais imagens (por vezes mais áudio ou vídeo) e produzem output em texto. Recorre-se a um quando o input contém imagens e o output é informação estruturada: extrair uma fatura, descrever um gráfico, transcrever uma nota manuscrita, responder a perguntas sobre um screenshot de UI.

57 models available

Claude Opus 4.7

MultimodalAnthropic
NewPopular

Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.

Free
anthropicflagshipreasoning

Claude Sonnet 4.6

MultimodalAnthropic
NewPopular

Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.

Free
anthropicbalancedproduction

Depth Anything v2

MultimodalReplicate
Popular

Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.

€0.005
replicatedepthvision-understanding

Gemini 3 Flash

MultimodalGoogle DeepMind
NewPopular

Google's April 2026 fast multimodal model. Combines Gemini 3 Pro's reasoning with Flash-tier latency and price. Default model in the Gemini app.

Free
googledeepmindbalanced

Gemini 3.1 Pro

MultimodalGoogle DeepMind
NewPopular

Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.

Free
googledeepmindflagship

GPT-5.4

MultimodalOpenAI
NewPopular

OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.

Free
openaiflagshipreasoning

GPT-5.4 Mini

MultimodalOpenAI
NewPopular

OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.

Free
openaibalancedcost-efficient

Grok 4.3

MultimodalxAI
NewPopular

xAI's May 2026 flagship. 1M context, vision, always-on reasoning, real-time X/web retrieval via DeepSearch.

Free
xaiflagshipreasoning

SAM 2 (Segment Anything 2)

MultimodalMeta
Popular

Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.

€0.01
replicatesegmentationmeta

Claude Haiku 4.5

MultimodalAnthropic
New

Anthropic's fastest and cheapest 4.x model. Strong vision and tool use at ultra-low latency, ideal for high-concurrency workloads.

Free
anthropiccost-efficientlow-latency

CogVLM2 19B

MultimodalReplicate

Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.

€0.01
replicatemultimodalvision-understanding

DeepSeek-VL 7B

MultimodalReplicate

DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.

€0.008
replicatemultimodalvision-understanding

Detectron2

MultimodalReplicate

Meta Detectron2 object-detection and segmentation toolkit. Mask R-CNN, Cascade R-CNN, panoptic FPN and many other model variants in one wrapper.

€0.008
replicatesegmentationvision-understanding

DINOv2

MultimodalReplicate

Meta DINOv2 self-supervised vision backbone. Pretrained features for classification, segmentation and depth without task-specific fine-tuning.

€0.005
replicatesegmentationvision-understanding

Donut Document

MultimodalReplicate

Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.

€0.008
replicateocrvision-understanding

Dots OCR

MultimodalReplicate

Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.

€0.008
replicateocrvision-understanding

DWPose

MultimodalReplicate

DWPose whole-body 2D pose estimator. Two-stage knowledge-distilled model with strong accuracy on face, hands and body keypoints simultaneously.

€0.005
replicateposevision-understanding

EasyOCR

MultimodalReplicate

JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.

€0.002
replicateocrvision-understanding

Florence-2 Large

MultimodalMicrosoft

Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.

€0.008
replicatemultimodalvision-understanding

Florence-2 Segmentation

MultimodalCommunity

Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.

€0.009
replicatesegmentationvision-understanding

GLPN Depth

MultimodalReplicate

Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.

€0.004
replicatedepthvision-understanding

GOT-OCR 2.0

MultimodalReplicate

StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.

€0.009
replicateocrvision-understanding

GPT-5.4 Nano

MultimodalOpenAI
New

OpenAI's smallest and cheapest GPT-5.4 variant. Built for high-volume classification, extraction and coding subagents at edge-grade latency.

Free
openaicost-efficientlow-latency

Grok 2 Vision

MultimodalxAI

xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.

Free
xaivisionlegacy

Grok 4.1 Fast

MultimodalxAI
New

xAI's cost-efficient high-throughput model. 2M context, optional reasoning, optimized for agentic loops and real-time apps.

Free
xaicost-efficientvision

Grounded-SAM

MultimodalReplicate

Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.

€0.01
replicatesegmentationvision-understanding

HRNet Pose

MultimodalReplicate

Microsoft HRNet high-resolution pose-estimation backbone. Parallel multi-resolution streams yield strong accuracy on COCO keypoint benchmarks.

€0.005
replicateposevision-understanding

Idefics3 8B

MultimodalReplicate

Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.

€0.007
replicatemultimodalvision-understanding

InternVL 2.5

MultimodalReplicate

OpenGVLab InternVL 2.5 78B. Open-source vision-language model approaching GPT-4o on MMMU, OCRBench and Math-Vista benchmarks.

€0.03
replicatemultimodalvision-understanding

LayoutLMv3

MultimodalMicrosoft

Microsoft LayoutLMv3 multimodal document model. Unified text/image masking pretraining for form understanding, receipts and document QA.

€0.007
replicateocrvision-understanding

Llama 3.2 90B Vision (multimodal)

MultimodalMeta

Meta's flagship vision-language model. 90B parameters, image understanding + chat, strong VQA performance.

Free
metallamamultimodal

Llama 3.2 Vision 90B

MultimodalMeta

Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.

€0.02
replicatemultimodalvision-understanding

LLaVA-OneVision 72B

MultimodalReplicate

LMMs-Lab LLaVA-OneVision 72B. Unified single-image, multi-image and video instruction-tuned VLM with task-transfer across modalities.

€0.02
replicatemultimodalvision-understanding

Lotus-G

MultimodalReplicate

Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.

€0.01
replicatedepthvision-understanding

Marigold

MultimodalReplicate

ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.

€0.01
replicatedepthvision-understanding

Marker PDF Extract

MultimodalReplicate

Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.

€0.008
replicateocrvision-understanding

Mask2Former

MultimodalReplicate

Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.

€0.009
replicatesegmentationvision-understanding

MediaPipe Pose

MultimodalGoogle DeepMind

Google MediaPipe Pose. Lightweight on-device-friendly 33-keypoint 3D pose estimator with optional segmentation mask output.

€0.003
replicateposevision-understanding

MiDaS v3.1

MultimodalReplicate

Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.

€0.004
replicatedepthvision-understanding

MiniCPM-V 2.6

MultimodalReplicate

OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.

€0.008
replicatemultimodalvision-understanding

Mistral OCR

MultimodalMistral AI

Mistral OCR API. Document-understanding model with strong table and equation extraction, and structured JSON output.

€0.001
mistralocrvision-understanding

Mistral Pixtral Large (124B)

MultimodalMistral AI

Mistral's 124B multimodal flagship. 123B decoder + 1B vision encoder, 128k ctx, up to 30 images per request.

Free
mistralpixtralmultimodal

MMPose

MultimodalReplicate

OpenMMLab MMPose toolbox. Wraps RTMPose, HRNet, HigherHRNet and many other pose models behind a unified inference API.

€0.006
replicateposevision-understanding

olmOCR

MultimodalReplicate

Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.

€0.01
replicateocrvision-understanding

OpenPose

MultimodalReplicate

CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.

€0.005
replicateposevision-understanding

PaddleOCR v3

MultimodalReplicate

Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.

€0.003
replicateocrvision-understanding

Phi-3.5 Vision

MultimodalMicrosoft

Microsoft Phi-3.5 Vision Instruct. Small (4.2B) multimodal model with strong document, OCR and multi-image reasoning at low cost.

€0.005
replicatemultimodalvision-understanding

Qwen2-VL-72B Instruct

MultimodalAlibaba / Qwen

Alibaba's 72B vision-language model with M-RoPE and dynamic resolution. Strong document and video understanding.

Free
qwenalibabamultimodal

Reka Core

MultimodalCustom

Reka's frontier multimodal model supporting text, image, video and audio inputs.

Free
rekamultimodalvideo-understanding

Reka Edge

MultimodalCustom

Reka's small on-device-friendly multimodal model. ~7B parameters, 16k context.

Free
rekamultimodaledge

Reka Flash

MultimodalCustom

Reka's 21B dense multimodal model balancing speed and quality. Up to 128k context.

Free
rekamultimodalcost-efficient

SAM HQ

MultimodalReplicate

ETH Zurich SAM-HQ. High-quality mask refinement on top of SAM. Sharper edges and finer structure than the original Segment Anything model.

€0.01
replicatesegmentationvision-understanding

Segformer B5

MultimodalReplicate

NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.

€0.007
replicatesegmentationvision-understanding

TrOCR Large

MultimodalMicrosoft

Microsoft TrOCR large transformer-based OCR. End-to-end visual encoder plus text decoder, trained on synthetic and printed real-world data.

€0.004
replicateocrvision-understanding

ViTPose

MultimodalReplicate

ViTPose plain-vision-transformer pose estimator. State-of-the-art keypoint accuracy on MS-COCO with a minimal architecture.

€0.006
replicateposevision-understanding

Yi-VL 34B

Multimodal01.AI

01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.

€0.02
replicatemultimodalvision-understanding

ZoeDepth

MultimodalReplicate

Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.

€0.005
replicatedepthvision-understanding

Top multimodal models picks

Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.

Melhor no global
Claude Opus 4.7

Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.

Learn more
Mais barato
GPT-5.4 Mini

OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.

Learn more
Contexto mais longo
Gemini 3.1 Pro

Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.

Learn more
Mais rápido
Claude Opus 4.7

Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.

Learn more

O pricing é por token como nos LLMs normais, com uma particularidade: cada imagem é contada como um número fixo de tokens — tipicamente 250-1.500 tokens conforme a resolução e o modo de detalhe. Uma imagem padrão de 1024×1024 custa aproximadamente o mesmo que um input de texto de 1.000 palavras. O modo high-detail (preservando texto fino e elementos pequenos de UI) custa 2-4× mais. Planeie orçamentos em conformidade — uma carga que processa 10.000 scans de recibos por dia pode facilmente atingir €10-€50 por dia a tarifas flagship.

O compromisso é exatidão de OCR, qualidade de raciocínio e custo. Os flagships (GPT-5 Vision, Claude 4.6, Gemini 2.5) leem layouts complexos e raciocinam sobre o conteúdo de gráficos com grande fiabilidade. Modelos especializados OCR-first (Qwen 2.5 VL, Pixtral, InternVL) por vezes superam na extração pura de texto a uma fração do custo. Para pipelines puras de documento-para-JSON, um modelo especializado com um JSON schema estrito ganha quase sempre. Para cargas de documento-e-raciocínio ('extrai esta fatura E diz-me se as contas do IVA estão certas'), ganham os flagships.

Atenção aos limites de resolução: a maior parte dos modelos reduz imagens muito grandes antes de processar, o que pode destruir texto fino em screenshots e em documentos densos. Pré-processe — corte documentos altos em imagens single-page, faça upscale de scans de baixa resolução antes de enviar — para preservar a legibilidade. Atenção também a alucinações em regiões difíceis de ler; os modelos multimodais tendem a inventar com confiança texto onde a fonte é ilegível.

As top picks acima cobrem o flagship mais exato, o cavalo de batalha mais barato, o que suporta resolução mais alta e a opção de streaming mais rápida.

Frequently asked questions

Start Building with AI

Access all models through a single API. Get free credits when you sign up — no credit card required.