Multimodal Models
Models that combine text, vision, and other modalities
Multimodal models for vision, OCR, and document understanding
Multimodal models accept text plus images (sometimes plus audio or video) and produce text output. Reach for one when your input contains images and your output is structured information: extract an invoice, describe a chart, transcribe a handwritten note, answer questions about a UI screenshot.
57 models available
Claude Opus 4.7
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Claude Sonnet 4.6
Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.
Depth Anything v2
Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.
Gemini 3 Flash
Google's April 2026 fast multimodal model. Combines Gemini 3 Pro's reasoning with Flash-tier latency and price. Default model in the Gemini app.
Gemini 3.1 Pro
Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.
GPT-5.4
OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.
GPT-5.4 Mini
OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.
Grok 4.3
xAI's May 2026 flagship. 1M context, vision, always-on reasoning, real-time X/web retrieval via DeepSearch.
SAM 2 (Segment Anything 2)
Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.
Claude Haiku 4.5
Anthropic's fastest and cheapest 4.x model. Strong vision and tool use at ultra-low latency, ideal for high-concurrency workloads.
CogVLM2 19B
Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.
DeepSeek-VL 7B
DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.
Detectron2
Meta Detectron2 object-detection and segmentation toolkit. Mask R-CNN, Cascade R-CNN, panoptic FPN and many other model variants in one wrapper.
DINOv2
Meta DINOv2 self-supervised vision backbone. Pretrained features for classification, segmentation and depth without task-specific fine-tuning.
Donut Document
Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.
Dots OCR
Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.
DWPose
DWPose whole-body 2D pose estimator. Two-stage knowledge-distilled model with strong accuracy on face, hands and body keypoints simultaneously.
EasyOCR
JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.
Florence-2 Large
Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.
Florence-2 Segmentation
Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.
GLPN Depth
Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.
GOT-OCR 2.0
StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.
GPT-5.4 Nano
OpenAI's smallest and cheapest GPT-5.4 variant. Built for high-volume classification, extraction and coding subagents at edge-grade latency.
Grok 2 Vision
xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.
Grok 4.1 Fast
xAI's cost-efficient high-throughput model. 2M context, optional reasoning, optimized for agentic loops and real-time apps.
Grounded-SAM
Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.
HRNet Pose
Microsoft HRNet high-resolution pose-estimation backbone. Parallel multi-resolution streams yield strong accuracy on COCO keypoint benchmarks.
Idefics3 8B
Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.
InternVL 2.5
OpenGVLab InternVL 2.5 78B. Open-source vision-language model approaching GPT-4o on MMMU, OCRBench and Math-Vista benchmarks.
LayoutLMv3
Microsoft LayoutLMv3 multimodal document model. Unified text/image masking pretraining for form understanding, receipts and document QA.
Llama 3.2 90B Vision (multimodal)
Meta's flagship vision-language model. 90B parameters, image understanding + chat, strong VQA performance.
Llama 3.2 Vision 90B
Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.
LLaVA-OneVision 72B
LMMs-Lab LLaVA-OneVision 72B. Unified single-image, multi-image and video instruction-tuned VLM with task-transfer across modalities.
Lotus-G
Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.
Marigold
ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.
Marker PDF Extract
Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.
Mask2Former
Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.
MediaPipe Pose
Google MediaPipe Pose. Lightweight on-device-friendly 33-keypoint 3D pose estimator with optional segmentation mask output.
MiDaS v3.1
Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.
MiniCPM-V 2.6
OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.
Mistral OCR
Mistral OCR API. Document-understanding model with strong table and equation extraction, and structured JSON output.
Mistral Pixtral Large (124B)
Mistral's 124B multimodal flagship. 123B decoder + 1B vision encoder, 128k ctx, up to 30 images per request.
MMPose
OpenMMLab MMPose toolbox. Wraps RTMPose, HRNet, HigherHRNet and many other pose models behind a unified inference API.
olmOCR
Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.
OpenPose
CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.
PaddleOCR v3
Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.
Phi-3.5 Vision
Microsoft Phi-3.5 Vision Instruct. Small (4.2B) multimodal model with strong document, OCR and multi-image reasoning at low cost.
Qwen2-VL-72B Instruct
Alibaba's 72B vision-language model with M-RoPE and dynamic resolution. Strong document and video understanding.
Reka Core
Reka's frontier multimodal model supporting text, image, video and audio inputs.
Reka Edge
Reka's small on-device-friendly multimodal model. ~7B parameters, 16k context.
Reka Flash
Reka's 21B dense multimodal model balancing speed and quality. Up to 128k context.
SAM HQ
ETH Zurich SAM-HQ. High-quality mask refinement on top of SAM. Sharper edges and finer structure than the original Segment Anything model.
Segformer B5
NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.
TrOCR Large
Microsoft TrOCR large transformer-based OCR. End-to-end visual encoder plus text decoder, trained on synthetic and printed real-world data.
ViTPose
ViTPose plain-vision-transformer pose estimator. State-of-the-art keypoint accuracy on MS-COCO with a minimal architecture.
Yi-VL 34B
01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.
ZoeDepth
Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.
Top multimodal models picks
Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Learn moreOpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.
Learn moreGoogle DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.
Learn moreAnthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Learn morePricing is per-token like regular LLMs, with one twist: every image is counted as a fixed number of tokens — typically 250-1,500 tokens depending on resolution and detail mode. A standard 1024×1024 image costs roughly the same as a 1,000-word text input. High-detail mode (preserving fine text and small UI elements) costs 2-4× more. Plan budgets accordingly — a workload that processes 10,000 receipt scans per day can easily run €10-€50 per day at flagship rates.
The trade-off is OCR accuracy, reasoning quality, and cost. Flagships (GPT-5 Vision, Claude 4.6, Gemini 2.5) read complex layouts and reason over chart contents very reliably. Specialized OCR-first models (Qwen 2.5 VL, Pixtral, InternVL) sometimes outperform on pure text extraction at a fraction of the cost. For pure document-to-JSON pipelines, a specialized model with a strict JSON schema usually wins. For document-and-reasoning workloads ('extract this invoice AND tell me if the tax math is right'), flagships win.
Watch out for resolution limits: most models downscale very large images before processing, which can destroy fine text in screenshots and dense documents. Pre-process — slice tall documents into single-page images, upscale low-resolution scans before sending — to preserve readability. Also watch out for hallucinations on hard-to-read regions; multimodal models tend to confidently invent text where the source is illegible.
Top picks above cover the most accurate flagship, the cheapest workhorse, the highest-resolution supporter, and the fastest streaming option.
Popular use cases
Common patterns built with multimodal models on Railwail.
Frequently asked questions
Start Building with AI
Access all models through a single API. Get free credits when you sign up — no credit card required.