Grok 2 Vision
xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.
Grok 2 Vision is multimodal AI model from xAI, priced at β¬2.00 per 1M input tokens with a 32.8K tokens context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate Grok 2 Vision into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple β just pass a string
const reply = await rw.run("grok-2-vision", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("grok-2-vision", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("grok-2-vision", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive β xAI's Grok 2 Vision
xAI was founded in March 2023 by Elon Musk together with co-founders from DeepMind, OpenAI, Google Research and Microsoft Research, including Igor Babuschkin, Manuel Kroiss, Yuhuai Wu (now back at Google), Christian Szegedy, Jimmy Ba, Toby Pohlen, Ross Nordeen, Kyle Kosic and Greg Yang. The company is closely affiliated with X (formerly Twitter), Tesla and SpaceX. xAI raised $6B Series B in May 2024 followed by $6B Series C in December 2024 at a reported $50B valuation, with backers including Andreessen Horowitz, Sequoia, Fidelity, Kingdom Holding, Lightspeed and Saudi Prince Alwaleed. The flagship Grok model family launched in late 2023 (Grok-1, briefly open-sourced under Apache 2.0), Grok-2 in August 2024 and Grok-3 in February 2025. Grok 2 Vision arrived in October 2024 as xAI's first multimodal model with image input, made available via the X premium feature and the xAI API.
Visit xAI βGrok 2 Vision (model id grok-2-vision-1212 and successors) is a multimodal large language model that adds an image encoder to xAI's Grok 2 text backbone. The architecture follows the now-standard cross-attention multimodal LLM pattern: a Vision Transformer encodes the input image into visual tokens, which are projected into the LLM token space and concatenated with text tokens before the decoder. xAI has not published a technical paper, but the model card mentions a 'mixture of public web data, X data and licensed sources' with a knowledge cutoff in mid-2024. The model accepts up to 10 images per request, with a maximum image side of around 8,000 pixels, and supports the standard chat/completion API with a 131,072-token context window. Grok 2 Vision is positioned as a competitor to GPT-4o and Claude 3.5 Sonnet for chart understanding, OCR-heavy documents and screenshot reasoning. xAI ships safety filters consistent with their stated 'maximum truth-seeking' posture, which is more permissive on controversial content than OpenAI.
- Parameters
- Undisclosed
- Context
- 131.1K tokens
- Image and text input (up to 10 images per request)
- 131,072-token context window
- Chart, diagram and screenshot reasoning
- OCR-heavy document understanding (PDFs as images)
- Real-time search-grounded responses via X / Grok web tool
- JSON / structured output and function calling
- More permissive content policy than OpenAI / Anthropic on controversial topics
- Best for: chart and screenshot QA, X-integrated agents, code-with-image bug reports
Not disclosed. xAI references 'public web data, licensed third-party data and X user posts that have opted in', with a knowledge cutoff in mid-2024.
License: Proprietary commercial API and X Premium product. Generated outputs may be used commercially under the xAI terms.
Known limitations
- Closed weights, hosted only
- No video or audio input (image-only multimodal)
- Quality on math / vision benchmarks below GPT-4o and Claude 3.5 Sonnet
- Lighter safety filtering may produce unsafe content
- Knowledge cutoff mid-2024 without web tool
Frequently asked questions
Related Models
View all MultimodalClaude Opus 4.7
Anthropic's April 2026 flagship. 87.6% on SWE-bench Verified, 3x higher image resolution, output self-verification, vision + reasoning.
Claude Sonnet 4.6
Anthropic's balanced mid-tier model from February 2026. Best price/performance for production workloads: 5x cheaper than Opus, near-flagship quality.
Depth Anything v2
Monocular depth-estimation model trained on 595k labeled and 62M unlabeled images. Strong zero-shot generalization in indoor and outdoor scenes.
GPT-5.4
OpenAI's unified flagship combining GPT and o-series reasoning into one model. 1M context, multimodal, top SWE-Bench Pro and OSWorld scores.
Start using Grok 2 Vision today
Get started with free credits. No credit card required. Access Grok 2 Vision and 100+ other models through a single API.