Multimodal Models
Models that combine text, vision, and other modalities
Modelos multimodais para vision, OCR e document understanding
Os modelos multimodais aceitam texto mais imagens (por vezes mais áudio ou vídeo) e produzem output em texto. Recorre-se a um quando o input contém imagens e o output é informação estruturada: extrair uma fatura, descrever um gráfico, transcrever uma nota manuscrita, responder a perguntas sobre um screenshot de UI.
57 models available
Claude Opus 4.7
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Claude Sonnet 4.6
Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.
Depth Anything v2
Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.
Gemini 3 Flash
Google's April 2026 fast multimodal model. Combines Gemini 3 Pro's reasoning with Flash-tier latency and price. Default model in the Gemini app.
Gemini 3.1 Pro
Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.
GPT-5.4
OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.
GPT-5.4 Mini
OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.
Grok 4.3
xAI's May 2026 flagship. 1M context, vision, always-on reasoning, real-time X/web retrieval via DeepSearch.
SAM 2 (Segment Anything 2)
Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.
Claude Haiku 4.5
Anthropic's fastest and cheapest 4.x model. Strong vision and tool use at ultra-low latency, ideal for high-concurrency workloads.
CogVLM2 19B
Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.
DeepSeek-VL 7B
DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.
Detectron2
Meta Detectron2 object-detection and segmentation toolkit. Mask R-CNN, Cascade R-CNN, panoptic FPN and many other model variants in one wrapper.
DINOv2
Meta DINOv2 self-supervised vision backbone. Pretrained features for classification, segmentation and depth without task-specific fine-tuning.
Donut Document
Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.
Dots OCR
Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.
DWPose
DWPose whole-body 2D pose estimator. Two-stage knowledge-distilled model with strong accuracy on face, hands and body keypoints simultaneously.
EasyOCR
JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.
Florence-2 Large
Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.
Florence-2 Segmentation
Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.
GLPN Depth
Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.
GOT-OCR 2.0
StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.
GPT-5.4 Nano
OpenAI's smallest and cheapest GPT-5.4 variant. Built for high-volume classification, extraction and coding subagents at edge-grade latency.
Grok 2 Vision
xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.
Grok 4.1 Fast
xAI's cost-efficient high-throughput model. 2M context, optional reasoning, optimized for agentic loops and real-time apps.
Grounded-SAM
Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.
HRNet Pose
Microsoft HRNet high-resolution pose-estimation backbone. Parallel multi-resolution streams yield strong accuracy on COCO keypoint benchmarks.
Idefics3 8B
Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.
InternVL 2.5
OpenGVLab InternVL 2.5 78B. Open-source vision-language model approaching GPT-4o on MMMU, OCRBench and Math-Vista benchmarks.
LayoutLMv3
Microsoft LayoutLMv3 multimodal document model. Unified text/image masking pretraining for form understanding, receipts and document QA.
Llama 3.2 90B Vision (multimodal)
Meta's flagship vision-language model. 90B parameters, image understanding + chat, strong VQA performance.
Llama 3.2 Vision 90B
Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.
LLaVA-OneVision 72B
LMMs-Lab LLaVA-OneVision 72B. Unified single-image, multi-image and video instruction-tuned VLM with task-transfer across modalities.
Lotus-G
Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.
Marigold
ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.
Marker PDF Extract
Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.
Mask2Former
Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.
MediaPipe Pose
Google MediaPipe Pose. Lightweight on-device-friendly 33-keypoint 3D pose estimator with optional segmentation mask output.
MiDaS v3.1
Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.
MiniCPM-V 2.6
OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.
Mistral OCR
Mistral OCR API. Document-understanding model with strong table and equation extraction, and structured JSON output.
Mistral Pixtral Large (124B)
Mistral's 124B multimodal flagship. 123B decoder + 1B vision encoder, 128k ctx, up to 30 images per request.
MMPose
OpenMMLab MMPose toolbox. Wraps RTMPose, HRNet, HigherHRNet and many other pose models behind a unified inference API.
olmOCR
Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.
OpenPose
CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.
PaddleOCR v3
Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.
Phi-3.5 Vision
Microsoft Phi-3.5 Vision Instruct. Small (4.2B) multimodal model with strong document, OCR and multi-image reasoning at low cost.
Qwen2-VL-72B Instruct
Alibaba's 72B vision-language model with M-RoPE and dynamic resolution. Strong document and video understanding.
Reka Core
Reka's frontier multimodal model supporting text, image, video and audio inputs.
Reka Edge
Reka's small on-device-friendly multimodal model. ~7B parameters, 16k context.
Reka Flash
Reka's 21B dense multimodal model balancing speed and quality. Up to 128k context.
SAM HQ
ETH Zurich SAM-HQ. High-quality mask refinement on top of SAM. Sharper edges and finer structure than the original Segment Anything model.
Segformer B5
NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.
TrOCR Large
Microsoft TrOCR large transformer-based OCR. End-to-end visual encoder plus text decoder, trained on synthetic and printed real-world data.
ViTPose
ViTPose plain-vision-transformer pose estimator. State-of-the-art keypoint accuracy on MS-COCO with a minimal architecture.
Yi-VL 34B
01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.
ZoeDepth
Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.
Top multimodal models picks
Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Learn moreOpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.
Learn moreGoogle DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.
Learn moreAnthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Learn moreO pricing é por token como nos LLMs normais, com uma particularidade: cada imagem é contada como um número fixo de tokens — tipicamente 250-1.500 tokens conforme a resolução e o modo de detalhe. Uma imagem padrão de 1024×1024 custa aproximadamente o mesmo que um input de texto de 1.000 palavras. O modo high-detail (preservando texto fino e elementos pequenos de UI) custa 2-4× mais. Planeie orçamentos em conformidade — uma carga que processa 10.000 scans de recibos por dia pode facilmente atingir €10-€50 por dia a tarifas flagship.
O compromisso é exatidão de OCR, qualidade de raciocínio e custo. Os flagships (GPT-5 Vision, Claude 4.6, Gemini 2.5) leem layouts complexos e raciocinam sobre o conteúdo de gráficos com grande fiabilidade. Modelos especializados OCR-first (Qwen 2.5 VL, Pixtral, InternVL) por vezes superam na extração pura de texto a uma fração do custo. Para pipelines puras de documento-para-JSON, um modelo especializado com um JSON schema estrito ganha quase sempre. Para cargas de documento-e-raciocínio ('extrai esta fatura E diz-me se as contas do IVA estão certas'), ganham os flagships.
Atenção aos limites de resolução: a maior parte dos modelos reduz imagens muito grandes antes de processar, o que pode destruir texto fino em screenshots e em documentos densos. Pré-processe — corte documentos altos em imagens single-page, faça upscale de scans de baixa resolução antes de enviar — para preservar a legibilidade. Atenção também a alucinações em regiões difíceis de ler; os modelos multimodais tendem a inventar com confiança texto onde a fonte é ilegível.
As top picks acima cobrem o flagship mais exato, o cavalo de batalha mais barato, o que suporta resolução mais alta e a opção de streaming mais rápida.
Popular use cases
Common patterns built with multimodal models on Railwail.
Frequently asked questions
Start Building with AI
Access all models through a single API. Get free credits when you sign up — no credit card required.