How much does Llama 3.2 90B Vision (multimodal) cost via Railwail?

Input: €1.20 per 1M tokens. Output: €1.20 per 1M tokens. No monthly minimum, no subscription. Start with €5 free credits.

What is the context window of Llama 3.2 90B Vision (multimodal)?

Llama 3.2 90B Vision (multimodal) supports a 131.1K tokens context window — enough for long books, technical manuals, and extended analysis.

How fast is Llama 3.2 90B Vision (multimodal)?

Latency depends on prompt length and load — typically 200ms to 2s for short prompts. We measure p50/p95 in real-time on /rankings.

Is Llama 3.2 90B Vision (multimodal) better than Claude Opus 4.7?

It depends on your use case. Llama 3.2 90B Vision (multimodal) (Meta) and Claude Opus 4.7 (Anthropic) are both strong choices in multimodal. Compare them side-by-side at /compare/llama-3-2-90b-vision-mm-vs-claude-opus-4-7.

Does Llama 3.2 90B Vision (multimodal) support image input (vision)?

Yes — Llama 3.2 90B Vision (multimodal) accepts image inputs in addition to text. Send images via the standard OpenAI-compatible `messages` array with `image_url` content blocks. Supported formats: text, image.

Llama 3.2 90B Vision (multimodal)

Name: Llama 3.2 90B Vision (multimodal)
Brand: Together AI
SKU: llama-3-2-90b-vision-mm
Price: 0.0012 EUR
Availability: InStock

Pricing

Price per Generation

Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Llama 3.2 90B Vision (multimodal) into your application.

Install

npm install railwail

JavaScript / TypeScript

import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("llama-3-2-90b-vision-mm", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("llama-3-2-90b-vision-mm", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("llama-3-2-90b-vision-mm", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);

Specifications

Context window

131,072 tokens

Max output

8,192 tokens

Developer

Deep dive — Meta AI (FAIR)'s Llama 3.2 90B Vision (multimodal)

About Meta AI (FAIR)

Founded 2013 · Menlo Park, California, USA

Meta AI is the research arm of Meta Platforms, established in 2013 as Facebook AI Research (FAIR) by Yann LeCun. FAIR has open-sourced many foundational models including PyTorch, RoBERTa, DETR, SAM and the LLaMA family. LLaMA 1 was released in February 2023, LLaMA 2 in July 2023, LLaMA 3 in April 2024 and LLaMA 3.1 (405B) in July 2024. Llama 3.2 launched in September 2024 at Meta Connect, introducing the first multimodal models in the LLaMA family (vision-enabled 11B and 90B) together with tiny on-device text-only siblings (1B, 3B). All Llama 3.2 vision weights are released under the Llama 3 Community Licence and are widely used by enterprise customers via Meta's partner ecosystem (Hugging Face, AWS Bedrock, Azure AI Studio, Google Vertex, Together AI, Groq, Fireworks).

Visit Meta AI (FAIR) →

Architecture

Decoder-only Transformer with cross-attended vision encoder

Llama 3.2 90B Vision combines the 70B-parameter Llama 3.1 text backbone (extended to 90B with vision components) and a Vision Transformer image encoder integrated via cross-attention adapter layers, similar in spirit to Flamingo but reusing the LLaMA architecture. The vision tower processes each image to a sequence of visual tokens which are injected into specific cross-attention layers of the LLM decoder while the original text-only weights remain frozen during the multimodal training stage, preserving text-only performance. Pretraining used 6B image-text pairs followed by multi-stage supervised fine-tuning and Direct Preference Optimisation (DPO) on a curated set of image instructions, math and chart data. The model supports a 128K context window and accepts up to 1120x1120 image inputs natively (with tiling for larger images). It does not support video or audio. Llama 3.2 90B Vision is released under the Llama 3 Community Licence (free for commercial use under 700M MAU).

Parameters: 90B
Context: 128K tokens

What it can do

Open-weights 90B vision-language model under Llama 3 Community Licence
128K token context window
Image input up to 1120x1120 with tiling for larger images
Chart, diagram, OCR and document understanding
Strong on MMMU, MathVista, ChartQA and DocVQA among open-weights models
Multilingual: English, German, French, Italian, Portuguese, Spanish, Hindi, Thai
Tool use and JSON output via Llama 3.1 alignment recipe
Best for: open-weights multimodal apps, on-premise document AI, indie research

Training & License

Pretrained on 6B image-text pairs from public web and licensed sources; supervised fine-tuning and DPO on curated multimodal instruction data. Text knowledge inherited from Llama 3.1 (15T tokens).

License: Llama 3 Community Licence: free for commercial use up to 700M MAU; redistribution must include the licence and acceptable use policy.

Known limitations