Grok 2 Vision

xAI
Multimodal

xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.

Try Grok 2 Vision now
Send a single prompt and stream a response inline. Hit Cmd+Enter to submit.
Sign in to try this model with €5 free credits.
Sign in
Press Cmd+Enter to send
Response appears here.
TL;DR·Last updated May 16, 2026

Grok 2 Vision is multimodal AI model from xAI, priced at €2.00 per 1M input tokens with a 32.8K tokens context window.

Try Grok 2 Vision

0.7

Direct API access coming soon

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Grok 2 Vision into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("grok-2-vision", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("grok-2-vision", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("grok-2-vision", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Context window
32,768 tokens
Max output
4,096 tokens
Developer
xAI
Category
Multimodal
Supported Formats
text
image
Tags
xai
vision
legacy

Deep dive — xAI's Grok 2 Vision

About xAI
Founded 2023 · Palo Alto, California, USA

xAI was founded in March 2023 by Elon Musk together with co-founders from DeepMind, OpenAI, Google Research and Microsoft Research, including Igor Babuschkin, Manuel Kroiss, Yuhuai Wu (now back at Google), Christian Szegedy, Jimmy Ba, Toby Pohlen, Ross Nordeen, Kyle Kosic and Greg Yang. The company is closely affiliated with X (formerly Twitter), Tesla and SpaceX. xAI raised $6B Series B in May 2024 followed by $6B Series C in December 2024 at a reported $50B valuation, with backers including Andreessen Horowitz, Sequoia, Fidelity, Kingdom Holding, Lightspeed and Saudi Prince Alwaleed. The flagship Grok model family launched in late 2023 (Grok-1, briefly open-sourced under Apache 2.0), Grok-2 in August 2024 and Grok-3 in February 2025. Grok 2 Vision arrived in October 2024 as xAI's first multimodal model with image input, made available via the X premium feature and the xAI API.

Visit xAI →
Architecture
Decoder-only Transformer with vision encoder (multimodal LLM)

Grok 2 Vision (model id grok-2-vision-1212 and successors) is a multimodal large language model that adds an image encoder to xAI's Grok 2 text backbone. The architecture follows the now-standard cross-attention multimodal LLM pattern: a Vision Transformer encodes the input image into visual tokens, which are projected into the LLM token space and concatenated with text tokens before the decoder. xAI has not published a technical paper, but the model card mentions a 'mixture of public web data, X data and licensed sources' with a knowledge cutoff in mid-2024. The model accepts up to 10 images per request, with a maximum image side of around 8,000 pixels, and supports the standard chat/completion API with a 131,072-token context window. Grok 2 Vision is positioned as a competitor to GPT-4o and Claude 3.5 Sonnet for chart understanding, OCR-heavy documents and screenshot reasoning. xAI ships safety filters consistent with their stated 'maximum truth-seeking' posture, which is more permissive on controversial content than OpenAI.

Parameters
Undisclosed
Context
131.1K tokens
What it can do
  • Image and text input (up to 10 images per request)
  • 131,072-token context window
  • Chart, diagram and screenshot reasoning
  • OCR-heavy document understanding (PDFs as images)
  • Real-time search-grounded responses via X / Grok web tool
  • JSON / structured output and function calling
  • More permissive content policy than OpenAI / Anthropic on controversial topics
  • Best for: chart and screenshot QA, X-integrated agents, code-with-image bug reports
Training & License

Not disclosed. xAI references 'public web data, licensed third-party data and X user posts that have opted in', with a knowledge cutoff in mid-2024.

License: Proprietary commercial API and X Premium product. Generated outputs may be used commercially under the xAI terms.

Known limitations
  • Closed weights, hosted only
  • No video or audio input (image-only multimodal)
  • Quality on math / vision benchmarks below GPT-4o and Claude 3.5 Sonnet
  • Lighter safety filtering may produce unsafe content
  • Knowledge cutoff mid-2024 without web tool

Frequently asked questions

Start using Grok 2 Vision today

Get started with free credits. No credit card required. Access Grok 2 Vision and 100+ other models through a single API.