Llama 3.2 90B Vision (multimodal)

Meta
Multimodal

Meta's flagship vision-language model. 90B parameters, image understanding + chat, strong VQA performance.

Try Llama 3.2 90B Vision (multimodal) now
Send a single prompt and stream a response inline. Hit Cmd+Enter to submit.
Sign in to try this model with €5 free credits.
Sign in
Press Cmd+Enter to send
Response appears here.
TL;DRΒ·Last updated May 16, 2026

Llama 3.2 90B Vision (multimodal) is multimodal AI model from Meta, priced at €1.20 per 1M input tokens with a 131.1K tokens context window.

Try Llama 3.2 90B Vision (multimodal)

0.7

Sign in to generate β€” 50 free credits on sign-up

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Llama 3.2 90B Vision (multimodal) into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple β€” just pass a string
const reply = await rw.run("llama-3-2-90b-vision-mm", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("llama-3-2-90b-vision-mm", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("llama-3-2-90b-vision-mm", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Context window
131,072 tokens
Max output
8,192 tokens
Developer
Meta
Category
Multimodal
Supported Formats
text
image
Tags
meta
llama
multimodal
vision
open-weights

Deep dive β€” Meta AI (FAIR)'s Llama 3.2 90B Vision (multimodal)

About Meta AI (FAIR)
Founded 2013 Β· Menlo Park, California, USA

Meta AI is the research arm of Meta Platforms, established in 2013 as Facebook AI Research (FAIR) by Yann LeCun. FAIR has open-sourced many foundational models including PyTorch, RoBERTa, DETR, SAM and the LLaMA family. LLaMA 1 was released in February 2023, LLaMA 2 in July 2023, LLaMA 3 in April 2024 and LLaMA 3.1 (405B) in July 2024. Llama 3.2 launched in September 2024 at Meta Connect, introducing the first multimodal models in the LLaMA family (vision-enabled 11B and 90B) together with tiny on-device text-only siblings (1B, 3B). All Llama 3.2 vision weights are released under the Llama 3 Community Licence and are widely used by enterprise customers via Meta's partner ecosystem (Hugging Face, AWS Bedrock, Azure AI Studio, Google Vertex, Together AI, Groq, Fireworks).

Visit Meta AI (FAIR) β†’
Architecture
Decoder-only Transformer with cross-attended vision encoder

Llama 3.2 90B Vision combines the 70B-parameter Llama 3.1 text backbone (extended to 90B with vision components) and a Vision Transformer image encoder integrated via cross-attention adapter layers, similar in spirit to Flamingo but reusing the LLaMA architecture. The vision tower processes each image to a sequence of visual tokens which are injected into specific cross-attention layers of the LLM decoder while the original text-only weights remain frozen during the multimodal training stage, preserving text-only performance. Pretraining used 6B image-text pairs followed by multi-stage supervised fine-tuning and Direct Preference Optimisation (DPO) on a curated set of image instructions, math and chart data. The model supports a 128K context window and accepts up to 1120x1120 image inputs natively (with tiling for larger images). It does not support video or audio. Llama 3.2 90B Vision is released under the Llama 3 Community Licence (free for commercial use under 700M MAU).

Parameters
90B
Context
128K tokens
What it can do
  • Open-weights 90B vision-language model under Llama 3 Community Licence
  • 128K token context window
  • Image input up to 1120x1120 with tiling for larger images
  • Chart, diagram, OCR and document understanding
  • Strong on MMMU, MathVista, ChartQA and DocVQA among open-weights models
  • Multilingual: English, German, French, Italian, Portuguese, Spanish, Hindi, Thai
  • Tool use and JSON output via Llama 3.1 alignment recipe
  • Best for: open-weights multimodal apps, on-premise document AI, indie research
Training & License

Pretrained on 6B image-text pairs from public web and licensed sources; supervised fine-tuning and DPO on curated multimodal instruction data. Text knowledge inherited from Llama 3.1 (15T tokens).

License: Llama 3 Community Licence: free for commercial use up to 700M MAU; redistribution must include the licence and acceptable use policy.

Known limitations
  • No video or audio input
  • Latency and cost dominated by 90B params; requires multi-GPU serving
  • Licence restricts the largest hyperscaler use cases
  • Vision quality below GPT-4o and Claude 3.5 Sonnet on hardest charts
  • English-centric in vision domain

Frequently asked questions

Start using Llama 3.2 90B Vision (multimodal) today

Get started with free credits. No credit card required. Access Llama 3.2 90B Vision (multimodal) and 100+ other models through a single API.