Multimodal Models

Salesforce BLIP. Vision-language model for image captioning and visual question answering. Given an image it writes a short natural-language caption, or answers a question about the image when one is supplied. A widely used baseline for automatic captioning.

€1.00

replicateblipcaptioning

Claude 3.5 Sonnet (vision)

Anthropic Claude 3.5 Sonnet with image input. 200k context, strong on dense documents, tables, charts and handwriting. Reliable structured extraction from screenshots and scans.

anthropicvisionmultimodal

Claude Opus 4.7

Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.

anthropicflagshipreasoning

Claude Sonnet 4.6

Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.

anthropicbalancedproduction

CLIP Interrogator

replicateclip-interrogatorcaptioning

pharmapsychotic's CLIP Interrogator. Takes an image and produces a Stable-Diffusion-style text prompt by combining BLIP captioning with CLIP to rank likely subjects, artists, mediums and styles. Commonly used to reverse-engineer a prompt from an existing picture.

€1.00

Depth Anything v2

Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.

€0.005

Gemini 1.5 Pro (vision)

Google Gemini 1.5 Pro with native multimodal input. Reads images, long PDFs, audio and video in up to a 2M-token context, useful for whole-document and long-video understanding.

googlegeminivision

Gemini 3 Flash

Google's April 2026 fast multimodal model. Combines Gemini 3 Pro's reasoning with Flash-tier latency and price. Default model in the Gemini app.

googledeepmindbalanced

Gemini 3.1 Pro

Google DeepMind's February 2026 flagship. 2M-token context, native multimodal (text/image/audio/video), Deep Think reasoning.

googledeepmindflagship

GPT-4o (vision)

OpenAI's GPT-4o with native image input. Handles text and images in a single context, 128k window, strong on chart reading, document QA, screenshots and visual reasoning.

openaivisionmultimodal

GPT-5.4

OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.

openaiflagshipreasoning

GPT-5.4 Mini

OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.

openaibalancedcost-efficient

Grok 4.3

MultimodalxAI

xAI's May 2026 flagship. 1M context, vision, always-on reasoning, real-time X/web retrieval via DeepSearch.

xaiflagshipreasoning

SAM 2 (Segment Anything 2)

MultimodalMeta

Meta Segment Anything 2. Promptable segmentation across images and video with temporal memory. Zero-shot, point/box/mask prompts, fast on a single H100.

replicatesegmentationmeta

BLIP Image Captioning Large

Multimodalhuggingface

Salesforce BLIP large checkpoint for image captioning, served through Hugging Face Inference. Given a photo it returns a short English caption. The large variant gives more accurate captions than the base model and is a common drop-in for alt-text and image indexing.

€1.00

huggingfaceblipcaptioning

Claude Haiku 4.5

New

Anthropic's fastest and cheapest 4.x model. Strong vision and tool use at ultra-low latency, ideal for high-concurrency workloads.

anthropiccost-efficientlow-latency

CogVLM2 19B

Tsinghua CogVLM2 19B with Llama-3 8B base plus 11B vision expert. Strong document understanding and visual reasoning, 8k context.

DeepSeek-VL 7B

DeepSeek-VL 7B chat model. Vision-language model with hybrid vision encoder and strong real-world visual question answering performance.

Donut Document

Naver CLOVA Donut OCR-free document-understanding transformer. End-to-end JSON extraction from forms, receipts and invoices without explicit OCR.

Dots OCR

Rednote Hilab Dots OCR. End-to-end document parsing model with layout, text and reading-order prediction in one transformer.

EasyOCR

JaidedAI EasyOCR. Simple Python OCR wrapper supporting 80+ languages with deep-learning text detection and recognition.

€0.002

Florence-2 Large

Microsoft Florence-2 Large. Unified prompt-based vision foundation model for captioning, detection, segmentation and OCR with a single 770M-param backbone.

Florence-2 Segmentation

Microsoft Florence-2 unified vision model with referring expression segmentation. Text-prompted region and mask generation in one model.

Gemini 1.5 Flash (vision)

Google Gemini 1.5 Flash, the fast low-cost multimodal model. 1M-token context, image/audio/video input, good for high-volume captioning, classification and long-video skim tasks.

googlegeminivision

GLPN Depth

Global-Local Path Networks depth-estimation model. Combines hierarchical transformer encoder with selective feature fusion for sharp boundaries.

€0.004

GOT-OCR 2.0

StepFun GOT-OCR 2.0. Unified end-to-end OCR-2.0 model handling text, formulas, charts, sheet music and geometric shapes in one architecture.

GPT-4o mini (vision)

OpenAI's small multimodal model with image input. Much cheaper than GPT-4o, 128k context, good for high-volume captioning, OCR-style reads, tagging and screenshot understanding.

openaivisionmultimodal

GPT-5.4 Nano

New

OpenAI's smallest and cheapest GPT-5.4 variant. Built for high-volume classification, extraction and coding subagents at edge-grade latency.

openaicost-efficientlow-latency

Grok 2 Vision

MultimodalxAI

xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.

xaivisionlegacy

Grok 4.1 Fast

MultimodalxAI

New

xAI's cost-efficient high-throughput model. 2M context, optional reasoning, optimized for agentic loops and real-time apps.

xaicost-efficientvision

Grounded-SAM

Grounding DINO plus SAM. Open-vocabulary text-prompted detection and segmentation in one pipeline for fully-automatic mask generation.

Idefics3 8B

Hugging Face Idefics3 8B. Llama-3 based open-source vision-language model with strong document QA and chart-understanding performance.

€0.007

Llama 3.2 Vision 11B (Ollama)

Meta Llama 3.2 11B Vision served via Ollama on Replicate. Open-weights multimodal model for image captioning, document and chart reading, and visual question answering.

replicatemetallama

Llama 3.2 Vision 90B

Meta Llama 3.2 90B Vision. Largest open-weights Llama vision model. Strong visual reasoning, chart, OCR and document understanding.

€0.02

LLaVA 1.6 Vicuna 13B

LLaVA 1.6 (LLaVA-NeXT) with a Vicuna-13B language backbone. Open vision-language chat model that describes images, answers questions, reads charts and reasons about scenes. Version 1.6 adds higher input resolution and better OCR and reasoning than LLaVA 1.5.

€2.00

replicatellavacaptioning

LLaVA v1.6 34B

LLaVA v1.6 on a Nous-Hermes-2 34B base, served on Replicate. Open-source vision-language assistant for image question answering, description and visual reasoning at higher resolution.

replicatellavavision-understanding

Lotus-G

Lotus generative depth model. Treats depth as a generation task using a diffusion model, producing higher-fidelity depth on textured surfaces.

Marigold

ETH Zurich Marigold. Diffusion-based monocular depth-estimation model fine-tuned from Stable Diffusion with strong fine-detail recovery.

Marker PDF Extract

Marker PDF-to-Markdown conversion pipeline. Combines layout, OCR and equation models to produce clean Markdown with preserved tables and formulas.

Mask2Former

Meta Mask2Former universal image-segmentation transformer. Single architecture for panoptic, instance and semantic segmentation tasks.

MiDaS v3.1

Intel MiDaS v3.1 relative depth-estimation model. Robust zero-shot single-image depth across diverse domains and resolutions.

€0.004

MiniCPM-V 2.6

OpenBMB MiniCPM-V 2.6. 8B vision-language model with strong single-image, multi-image and video understanding plus OCR capabilities.

Molmo 7B

Allen AI Molmo 7B-D on Replicate. Open vision-language model trained on the PixMo data, notable for pointing at and locating objects in images, not just describing them.

replicateallenaimolmo

Moondream2

replicatemoondreamvision-understanding

Moondream2 small vision-language model on Replicate. About 1.9B params, designed to run on edge devices, handles captioning, visual QA and short OCR-style reads at very low cost.

€0.003

olmOCR

Allen AI olmOCR. Open-source 7B vision-language model fine-tuned for high-fidelity document parsing including math, code and tables.

OpenPose

replicateposevision-understanding

CMU OpenPose multi-person 2D pose estimator. Real-time keypoint detection for body, hand, face and foot using Part Affinity Fields.

€0.005

PaddleOCR v3

Baidu PaddleOCR v3 PP-OCR pipeline. Lightweight detector plus recognizer optimized for production use with 80+ language support.

€0.003

Qwen2-VL 7B Instruct

Alibaba Qwen2-VL 7B served on Replicate. Open-weights vision-language model that chats about images and video, with dynamic resolution and strong OCR and document QA for its size.

replicateqwenalibaba

Qwen2.5-VL 7B Instruct (HF)

Multimodalhuggingface

Alibaba Qwen2.5-VL 7B via Hugging Face Inference. Open-weights image-text-to-text model with improved OCR, chart and table reading, object grounding and long-document understanding.

€0.007

huggingfaceqwenalibaba

Reka Core

MultimodalCustom

Reka's frontier multimodal model supporting text, image, video and audio inputs.

rekamultimodalvideo-understanding

Reka Edge

MultimodalCustom

Reka's small on-device-friendly multimodal model. ~7B parameters, 16k context.

rekamultimodaledge

Reka Flash

MultimodalCustom

Reka's 21B dense multimodal model balancing speed and quality. Up to 128k context.

rekamultimodalcost-efficient

Segformer B5

NVIDIA SegFormer-B5 semantic segmentation. Hierarchical transformer encoder with lightweight MLP decoder, strong ADE20k and Cityscapes results.

€0.007

Yi-VL 34B

Multimodal01.AI

01.AI Yi-VL 34B vision-language model. Bilingual (CN/EN) image understanding, strong CMMMU and MMMU performance among open-weights VLMs.

€0.02

ZoeDepth

Intel ZoeDepth metric depth-estimation model. Combines relative-depth pretraining with metric fine-tuning for absolute distance in real units.

€0.005

Top multimodal models picks

Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.

Migliore in assoluto

BLIP

Più economico

GPT-5.4 Mini

OpenAI's efficient mid-tier model. 2x faster than its predecessor, 400k context, approaches GPT-5.4 quality on SWE-Bench Pro at a fraction of the cost.

Contesto più lungo

Gemini 1.5 Pro (vision)

Google Gemini 1.5 Pro with native multimodal input. Reads images, long PDFs, audio and video in up to a 2M-token context, useful for whole-document and long-video understanding.

Più veloce

BLIP