Multimodal Models
Models that combine text, vision, and other modalities
Modelli multimodali per vision, OCR e document understanding
I modelli multimodali accettano testo più immagini (a volte più audio o video) e producono output testuale. Si ricorre a uno di questi quando l'input contiene immagini e l'output è informazione strutturata: estrarre una fattura, descrivere un grafico, trascrivere una nota scritta a mano, rispondere a domande su uno screenshot di UI.
57 models available
Claude Opus 4.7
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Claude Sonnet 4.6
Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.
Depth Anything v2
Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.
Gemini 3 Flash
Google's April 2026 fast multimodal model. Combines Gemini 3 Pro's reasoning with Flash-tier latency and price. Default model in the Gemini app.
Gemini 3.1 Pro
Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.
GPT-5.4
OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.
GPT-5.4 Mini
OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.
Grok 4.3
xAI's May 2026 flagship. 1M context, vision, always-on reasoning, real-time X/web retrieval via DeepSearch.
SAM 2 (Segment Anything 2)
Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.
Claude Haiku 4.5
Anthropic's fastest and cheapest 4.x model. Strong vision and tool use at ultra-low latency, ideal for high-concurrency workloads.
CogVLM2 19B
Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.
DeepSeek-VL 7B
DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.
Detectron2
Meta Detectron2 object-detection and segmentation toolkit. Mask R-CNN, Cascade R-CNN, panoptic FPN and many other model variants in one wrapper.
DINOv2
Meta DINOv2 self-supervised vision backbone. Pretrained features for classification, segmentation and depth without task-specific fine-tuning.
Donut Document
Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.
Dots OCR
Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.
DWPose
DWPose whole-body 2D pose estimator. Two-stage knowledge-distilled model with strong accuracy on face, hands and body keypoints simultaneously.
EasyOCR
JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.
Florence-2 Large
Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.
Florence-2 Segmentation
Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.
GLPN Depth
Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.
GOT-OCR 2.0
StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.
GPT-5.4 Nano
OpenAI's smallest and cheapest GPT-5.4 variant. Built for high-volume classification, extraction and coding subagents at edge-grade latency.
Grok 2 Vision
xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.
Grok 4.1 Fast
xAI's cost-efficient high-throughput model. 2M context, optional reasoning, optimized for agentic loops and real-time apps.
Grounded-SAM
Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.
HRNet Pose
Microsoft HRNet high-resolution pose-estimation backbone. Parallel multi-resolution streams yield strong accuracy on COCO keypoint benchmarks.
Idefics3 8B
Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.
InternVL 2.5
OpenGVLab InternVL 2.5 78B. Open-source vision-language model approaching GPT-4o on MMMU, OCRBench and Math-Vista benchmarks.
LayoutLMv3
Microsoft LayoutLMv3 multimodal document model. Unified text/image masking pretraining for form understanding, receipts and document QA.
Llama 3.2 90B Vision (multimodal)
Meta's flagship vision-language model. 90B parameters, image understanding + chat, strong VQA performance.
Llama 3.2 Vision 90B
Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.
LLaVA-OneVision 72B
LMMs-Lab LLaVA-OneVision 72B. Unified single-image, multi-image and video instruction-tuned VLM with task-transfer across modalities.
Lotus-G
Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.
Marigold
ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.
Marker PDF Extract
Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.
Mask2Former
Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.
MediaPipe Pose
Google MediaPipe Pose. Lightweight on-device-friendly 33-keypoint 3D pose estimator with optional segmentation mask output.
MiDaS v3.1
Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.
MiniCPM-V 2.6
OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.
Mistral OCR
Mistral OCR API. Document-understanding model with strong table and equation extraction, and structured JSON output.
Mistral Pixtral Large (124B)
Mistral's 124B multimodal flagship. 123B decoder + 1B vision encoder, 128k ctx, up to 30 images per request.
MMPose
OpenMMLab MMPose toolbox. Wraps RTMPose, HRNet, HigherHRNet and many other pose models behind a unified inference API.
olmOCR
Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.
OpenPose
CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.
PaddleOCR v3
Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.
Phi-3.5 Vision
Microsoft Phi-3.5 Vision Instruct. Small (4.2B) multimodal model with strong document, OCR and multi-image reasoning at low cost.
Qwen2-VL-72B Instruct
Alibaba's 72B vision-language model with M-RoPE and dynamic resolution. Strong document and video understanding.
Reka Core
Reka's frontier multimodal model supporting text, image, video and audio inputs.
Reka Edge
Reka's small on-device-friendly multimodal model. ~7B parameters, 16k context.
Reka Flash
Reka's 21B dense multimodal model balancing speed and quality. Up to 128k context.
SAM HQ
ETH Zurich SAM-HQ. High-quality mask refinement on top of SAM. Sharper edges and finer structure than the original Segment Anything model.
Segformer B5
NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.
TrOCR Large
Microsoft TrOCR large transformer-based OCR. End-to-end visual encoder plus text decoder, trained on synthetic and printed real-world data.
ViTPose
ViTPose plain-vision-transformer pose estimator. State-of-the-art keypoint accuracy on MS-COCO with a minimal architecture.
Yi-VL 34B
01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.
ZoeDepth
Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.
Top multimodal models picks
Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Learn moreOpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.
Learn moreGoogle DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.
Learn moreAnthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Learn moreIl pricing è per token come negli LLM normali, con un dettaglio: ogni immagine viene conteggiata come un numero fisso di token — tipicamente 250-1.500 token a seconda della risoluzione e della modalità detail. Un'immagine 1024×1024 standard costa più o meno come un input di 1.000 parole di testo. La modalità ad alto dettaglio (preserva testo fine e piccoli elementi UI) costa 2-4× di più. Pianificate i budget di conseguenza — un carico di lavoro che processa 10.000 scansioni di ricevute al giorno può facilmente arrivare a €10-€50 al giorno a tariffe flagship.
Il compromesso è accuratezza OCR, qualità di ragionamento e costo. I flagship (GPT-5 Vision, Claude 4.6, Gemini 2.5) leggono layout complessi e ragionano sui contenuti dei grafici in modo molto affidabile. I modelli specializzati OCR-first (Qwen 2.5 VL, Pixtral, InternVL) a volte superano sull'estrazione di puro testo a una frazione del costo. Per pipeline pure documento-to-JSON, un modello specializzato con uno schema JSON stretto di solito vince. Per carichi documento-e-ragionamento ('estrai questa fattura E dimmi se i calcoli IVA sono corretti'), vincono i flagship.
Attenzione ai limiti di risoluzione: la maggior parte dei modelli ridimensiona le immagini molto grandi prima di processarle, e questo può distruggere il testo fine negli screenshot e nei documenti densi. Pre-processate — affettate i documenti alti in immagini single-page, fate upscale delle scansioni a bassa risoluzione prima di inviarle — per preservare la leggibilità . Attenzione anche alle allucinazioni sulle regioni difficili da leggere; i modelli multimodali tendono a inventare confidentemente testo dove la fonte è illeggibile.
Le top picks qui sopra coprono il flagship più accurato, il workhorse più economico, quello che supporta la risoluzione più alta e l'opzione di streaming più veloce.
Popular use cases
Common patterns built with multimodal models on Railwail.
Frequently asked questions
Start Building with AI
Access all models through a single API. Get free credits when you sign up — no credit card required.