Multimodal Models

Models that combine text, vision, and other modalities

12 models available

Gemini 2.0 Flash (Multimodal)

MultimodalGoogle DeepMind
Popular

Google's multimodal model accepting text, images, audio, and video. Native multimodal understanding across input types.

€0.007
visionaudiovideo-understanding

GPT-4o (Vision)

MultimodalOpenAI
Popular

GPT-4o's vision capabilities. Analyze images, charts, documents, and screenshots with detailed understanding and reasoning.

€0.01
visiondocument-analysischarts

Claude 3.5 Sonnet (Vision)

MultimodalAnthropic

Claude's vision capabilities. Excellent at analyzing images, documents, and code screenshots with detailed, accurate descriptions.

€0.02
visiondocumentscode-screenshots

CogVLM

MultimodalCommunity

Powerful visual language model from Tsinghua. Deep image understanding with detailed visual reasoning.

€0.005
visionreasoningdetailed

Florence 2

MultimodalMicrosoft

Microsoft's foundation vision model. Object detection, captioning, segmentation, and OCR in one model.

€0.003
multi-taskMicrosoftOCR

InternVL 2

MultimodalInternVL

Open-source vision-language model rivaling GPT-4V. Strong visual understanding across diverse domains.

€0.005
GPT-4V rivalopen-source26B

LLaVA 1.6 34B

MultimodalTogether AI

Open-source multimodal model combining language and vision. Strong visual understanding with conversational capabilities.

€0.004
open-sourcevisionconversational

LLaVA v1.6 13B

MultimodalCommunity

Open-source multimodal model. Analyze and describe images with natural language understanding.

€0.003
visionopen-sourceanalysis

Moondream 2

MultimodalCommunity

Tiny but capable vision-language model. Only 1.8B params yet surprisingly good at image understanding.

€0.001
tiny1.8Befficient

OCR with GPT-4o

MultimodalCommunity

Accurate text extraction from images using GPT-4o vision. Extract text, tables, and structured data.

€0.01
OCRtext-extractiontables

Pixtral Large

MultimodalMistral AI

Mistral's vision-language model. 124B parameters with native image understanding, document analysis, and visual reasoning.

€0.01
vision124Bdocument-analysis

Qwen VL Plus

MultimodalCommunity

Alibaba's vision-language model. Strong at document understanding, charts, and multilingual visual QA.

€0.003
documentschartsmultilingual

Start Building with AI

Access all models through a single API. Get free credits when you sign up — no credit card required.