Llama 3.2 90B Vision (multimodal)
Meta's flagship vision-language model. 90B parameters, image understanding + chat, strong VQA performance.
Llama 3.2 90B Vision (multimodal) is multimodal AI model from Meta, priced at β¬1.20 per 1M input tokens with a 131.1K tokens context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate Llama 3.2 90B Vision (multimodal) into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple β just pass a string
const reply = await rw.run("llama-3-2-90b-vision-mm", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("llama-3-2-90b-vision-mm", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("llama-3-2-90b-vision-mm", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive β Meta AI (FAIR)'s Llama 3.2 90B Vision (multimodal)
Meta AI is the research arm of Meta Platforms, established in 2013 as Facebook AI Research (FAIR) by Yann LeCun. FAIR has open-sourced many foundational models including PyTorch, RoBERTa, DETR, SAM and the LLaMA family. LLaMA 1 was released in February 2023, LLaMA 2 in July 2023, LLaMA 3 in April 2024 and LLaMA 3.1 (405B) in July 2024. Llama 3.2 launched in September 2024 at Meta Connect, introducing the first multimodal models in the LLaMA family (vision-enabled 11B and 90B) together with tiny on-device text-only siblings (1B, 3B). All Llama 3.2 vision weights are released under the Llama 3 Community Licence and are widely used by enterprise customers via Meta's partner ecosystem (Hugging Face, AWS Bedrock, Azure AI Studio, Google Vertex, Together AI, Groq, Fireworks).
Visit Meta AI (FAIR) βLlama 3.2 90B Vision combines the 70B-parameter Llama 3.1 text backbone (extended to 90B with vision components) and a Vision Transformer image encoder integrated via cross-attention adapter layers, similar in spirit to Flamingo but reusing the LLaMA architecture. The vision tower processes each image to a sequence of visual tokens which are injected into specific cross-attention layers of the LLM decoder while the original text-only weights remain frozen during the multimodal training stage, preserving text-only performance. Pretraining used 6B image-text pairs followed by multi-stage supervised fine-tuning and Direct Preference Optimisation (DPO) on a curated set of image instructions, math and chart data. The model supports a 128K context window and accepts up to 1120x1120 image inputs natively (with tiling for larger images). It does not support video or audio. Llama 3.2 90B Vision is released under the Llama 3 Community Licence (free for commercial use under 700M MAU).
- Parameters
- 90B
- Context
- 128K tokens
- Open-weights 90B vision-language model under Llama 3 Community Licence
- 128K token context window
- Image input up to 1120x1120 with tiling for larger images
- Chart, diagram, OCR and document understanding
- Strong on MMMU, MathVista, ChartQA and DocVQA among open-weights models
- Multilingual: English, German, French, Italian, Portuguese, Spanish, Hindi, Thai
- Tool use and JSON output via Llama 3.1 alignment recipe
- Best for: open-weights multimodal apps, on-premise document AI, indie research
Pretrained on 6B image-text pairs from public web and licensed sources; supervised fine-tuning and DPO on curated multimodal instruction data. Text knowledge inherited from Llama 3.1 (15T tokens).
License: Llama 3 Community Licence: free for commercial use up to 700M MAU; redistribution must include the licence and acceptable use policy.
Known limitations
- No video or audio input
- Latency and cost dominated by 90B params; requires multi-GPU serving
- Licence restricts the largest hyperscaler use cases
- Vision quality below GPT-4o and Claude 3.5 Sonnet on hardest charts
- English-centric in vision domain
Frequently asked questions
Related Models
View all MultimodalClaude Opus 4.7
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Claude Sonnet 4.6
Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.
Depth Anything v2
Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.
GPT-5.4
OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.
Start using Llama 3.2 90B Vision (multimodal) today
Get started with free credits. No credit card required. Access Llama 3.2 90B Vision (multimodal) and 100+ other models through a single API.