Multimodal Models
Models that combine text, vision, and other modalities
12 models available
Gemini 2.0 Flash (Multimodal)
Google's multimodal model accepting text, images, audio, and video. Native multimodal understanding across input types.
GPT-4o (Vision)
GPT-4o's vision capabilities. Analyze images, charts, documents, and screenshots with detailed understanding and reasoning.
Claude 3.5 Sonnet (Vision)
Claude's vision capabilities. Excellent at analyzing images, documents, and code screenshots with detailed, accurate descriptions.
CogVLM
Powerful visual language model from Tsinghua. Deep image understanding with detailed visual reasoning.
Florence 2
Microsoft's foundation vision model. Object detection, captioning, segmentation, and OCR in one model.
InternVL 2
Open-source vision-language model rivaling GPT-4V. Strong visual understanding across diverse domains.
LLaVA 1.6 34B
Open-source multimodal model combining language and vision. Strong visual understanding with conversational capabilities.
LLaVA v1.6 13B
Open-source multimodal model. Analyze and describe images with natural language understanding.
Moondream 2
Tiny but capable vision-language model. Only 1.8B params yet surprisingly good at image understanding.
OCR with GPT-4o
Accurate text extraction from images using GPT-4o vision. Extract text, tables, and structured data.
Pixtral Large
Mistral's vision-language model. 124B parameters with native image understanding, document analysis, and visual reasoning.
Qwen VL Plus
Alibaba's vision-language model. Strong at document understanding, charts, and multilingual visual QA.
Start Building with AI
Access all models through a single API. Get free credits when you sign up — no credit card required.