How much does Grok 2 Vision cost via Railwail?

Input: €2.00 per 1M tokens. Output: €10.00 per 1M tokens. No monthly minimum, no subscription. Start with €5 free credits.

What is the context window of Grok 2 Vision?

Grok 2 Vision supports a 32.8K tokens context window — enough for long books, technical manuals, and extended analysis.

How fast is Grok 2 Vision?

Latency depends on prompt length and load — typically 200ms to 2s for short prompts. We measure p50/p95 in real-time on /rankings.

Is Grok 2 Vision better than BLIP?

It depends on your use case. Grok 2 Vision (xAI) and BLIP (Salesforce) are both strong choices in multimodal. Compare them side-by-side at /compare/grok-2-vision-vs-blip-captioning.

Does Grok 2 Vision support image input (vision)?

Yes — Grok 2 Vision accepts image inputs in addition to text. Send images via the standard OpenAI-compatible `messages` array with `image_url` content blocks. Supported formats: text, image.

Grok 2 Vision

Name: Grok 2 Vision
Brand: Custom
SKU: grok-2-vision
Price: 0.002 EUR
Availability: InStock

xAI

Multimodal

xAI's vision-capable Grok 2 snapshot. Image-in, text-out with strong multilingual instruction following.

Try Grok 2 Vision now

Send a single prompt and stream a response inline. Hit Cmd+Enter to submit.

Press Cmd+Enter to send

Response appears here.

TL;DR·Last updated June 24, 2026

Grok 2 Vision is multimodal AI model from xAI, priced at €2.00 per 1M input tokens with a 32.8K tokens context window.

Try Grok 2 Vision

System Prompt

Message

Temperature

0.7

Max Tokens

Direct API access coming soon

Pricing

Price per Generation

Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Grok 2 Vision into your application.

Install

npm install railwail

JavaScript / TypeScript

import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("grok-2-vision", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("grok-2-vision", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("grok-2-vision", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);

Specifications

Context window

32,768 tokens

Max output

4,096 tokens

Developer

xAI

Deep dive — xAI's Grok 2 Vision

About xAI

Founded 2023 · Palo Alto, California, USA

xAI was founded in March 2023 by Elon Musk together with co-founders from DeepMind, OpenAI, Google Research and Microsoft Research, including Igor Babuschkin, Manuel Kroiss, Yuhuai Wu (now back at Google), Christian Szegedy, Jimmy Ba, Toby Pohlen, Ross Nordeen, Kyle Kosic and Greg Yang. The company is closely affiliated with X (formerly Twitter), Tesla and SpaceX. xAI raised $6B Series B in May 2024 followed by $6B Series C in December 2024 at a reported $50B valuation, with backers including Andreessen Horowitz, Sequoia, Fidelity, Kingdom Holding, Lightspeed and Saudi Prince Alwaleed. The flagship Grok model family launched in late 2023 (Grok-1, briefly open-sourced under Apache 2.0), Grok-2 in August 2024 and Grok-3 in February 2025. Grok 2 Vision arrived in October 2024 as xAI's first multimodal model with image input, made available via the X premium feature and the xAI API.

Visit xAI →

Architecture

Decoder-only Transformer with vision encoder (multimodal LLM)

Grok 2 Vision (model id grok-2-vision-1212 and successors) is a multimodal large language model that adds an image encoder to xAI's Grok 2 text backbone. The architecture follows the now-standard cross-attention multimodal LLM pattern: a Vision Transformer encodes the input image into visual tokens, which are projected into the LLM token space and concatenated with text tokens before the decoder. xAI has not published a technical paper, but the model card mentions a 'mixture of public web data, X data and licensed sources' with a knowledge cutoff in mid-2024. The model accepts up to 10 images per request, with a maximum image side of around 8,000 pixels, and supports the standard chat/completion API with a 131,072-token context window. Grok 2 Vision is positioned as a competitor to GPT-4o and Claude 3.5 Sonnet for chart understanding, OCR-heavy documents and screenshot reasoning. xAI ships safety filters consistent with their stated 'maximum truth-seeking' posture, which is more permissive on controversial content than OpenAI.

Parameters: Undisclosed
Context: 131.1K tokens

What it can do

Image and text input (up to 10 images per request)
131,072-token context window
Chart, diagram and screenshot reasoning
OCR-heavy document understanding (PDFs as images)
Real-time search-grounded responses via X / Grok web tool
JSON / structured output and function calling
More permissive content policy than OpenAI / Anthropic on controversial topics
Best for: chart and screenshot QA, X-integrated agents, code-with-image bug reports

Training & License

Not disclosed. xAI references 'public web data, licensed third-party data and X user posts that have opted in', with a knowledge cutoff in mid-2024.

License: Proprietary commercial API and X Premium product. Generated outputs may be used commercially under the xAI terms.

Known limitations

Closed weights, hosted only
No video or audio input (image-only multimodal)
Quality on math / vision benchmarks below GPT-4o and Claude 3.5 Sonnet
Lighter safety filtering may produce unsafe content
Knowledge cutoff mid-2024 without web tool

Research papers

Frequently asked questions

Related Models

View all Multimodal

BLIP

Salesforce

Salesforce BLIP. Vision-language model for image captioning and visual question answering. Given an image it writes a short natural-language caption, or answers a question about the image when one is supplied. A widely used baseline for automatic captioning.

€1.00

CLIP Interrogator

Community

pharmapsychotic's CLIP Interrogator. Takes an image and produces a Stable-Diffusion-style text prompt by combining BLIP captioning with CLIP to rank likely subjects, artists, mediums and styles. Commonly used to reverse-engineer a prompt from an existing picture.

€1.00

Claude 3.5 Sonnet (vision)

Anthropic

Anthropic Claude 3.5 Sonnet with image input. 200k context, strong on dense documents, tables, charts and handwriting. Reliable structured extraction from screenshots and scans.

Free

Claude Opus 4.7