Why an AI Model Comparison Matrix?
The AI model landscape changed faster in 2025 than in the entire decade preceding it. By mid-2026 we count 275+ generally available frontier and open-weight models on Railwail alone β across text generation, image generation, video, audio synthesis, embedding, code completion, multimodal reasoning, and the emerging category of Vision-Language-Action (VLA) models for robotics. Each model carries its own pricing structure, context window, latency profile, and capability matrix. Picking the right one for a workload has become a non-trivial exercise that can swing infrastructure costs by an order of magnitude.
This page exists to make that exercise tractable. Rather than clicking through 275 individual model detail pages, you can filter the entire catalog by what you actually care about β category, provider, price ceiling, minimum context window, vision support, tool use, streaming β and sort the surviving shortlist by the dimension that matters most for your project (cheapest output? longest context? lowest latency? highest ELO ranking?). Once you have a 2-4 model shortlist, click Compare on each row to enable the diff-highlight feature: cells where the selected models differ are highlighted in yellow, so you can see at a glance what distinguishes them.
How to read each column
- Model + Provider: The model name and the upstream provider it routes through. Some models (e.g. Llama 70B variants) are routable through multiple providers β the column shows the canonical route Railwail uses by default, which is generally the cheapest available with comparable latency.
- Category: The model's primary modality β text, image, video, audio, speech_tts, transcription_stt, embedding, code, multimodal, or vla_robotics. Multimodal models accept multiple input modalities (e.g. text + images); models tagged code are text models specialised for source-code generation with elevated training weight on programming corpora.
- Context window: The maximum prompt size the model accepts, measured in tokens. A token is roughly 4 characters of English text, so a 200K-token window covers ~150K words β about the length of a 500-page novel. For non-text models (image, video) this column shows the maximum conditioning input size if applicable, otherwise is blank.
- Input / Output price (β¬/1M): Per-token pricing in EUR, displayed as cost per million tokens to make cross-model comparison numerically intuitive. Input is what you pay for the prompt; output is what you pay for the generated completion. For most models, output is 2-5x more expensive than input β keep that in mind when estimating workload cost.
- ELO rating: Drawn from real head-to-head user votes on the Arena leaderboard. 1500 is the starting baseline. A 200-point gap implies the higher-rated model wins ~76% of head-to-head matchups, all else equal. ELO is the best available proxy for "which model do humans actually prefer in blind tests," but it does not measure correctness on domain-specific benchmarks.
- Status pills: "Featured" models are hand-curated as the strongest in their category β typically what we'd recommend to a new user. "New" models were added in the last 30 days.
Three view modes for three workflows
The Compact view shows the essential 8 columns β Model, Category, Context, Input price, Output price, ELO, Status, Actions. Use this when you're surveying the field and want maximum screen density.
The Detailed view adds Max Output Tokens, Avg Latency, Tags (capability flags), Supported Formats, and Last Updated. Use this when you're doing a serious technical evaluation and need every dimension visible at once.
The Pricing-focused view drops the categorical columns (Category, ELO, Status) and surfaces the Fixed β¬ column β useful for image/video/audio models that bill per generation rather than per token. Use this when your decision is purely cost-driven.
Designed for power users
Every filter and sort is written to the URL as a query parameter. That means you can apply a complex filter β say, "all text-generation models from OpenAI, Anthropic, and Google, with at least 100K context and β€β¬3/1M output, sorted by ELO descending" β copy the URL, and share it in Slack or paste it into Notion. Whoever opens the link sees the exact same filtered view. The same applies to embedding the matrix in internal docs or runbooks: a permalink survives indefinitely.
The CSV export button dumps the currently filtered and sorted view as a comma-separated file you can drop into Excel, Google Sheets, Pandas, or any other analytical tool. All columns are included, with prices already multiplied out to β¬/1M so you don't need to do the arithmetic yourself.
The diff-highlight feature is the killer feature for shortlisting: add 2-4 models to your compare cart via the Compare button on each row, and every cell where the selected models differ lights up yellow. You see instantly that, for example, GPT-4o and Claude 3.5 Sonnet have similar context windows but Claude's output is 40% cheaper, or that Gemini 1.5 Pro's latency is ~2x slower than its peers despite having a 1M context window. These are the comparisons that drive real engineering decisions.
Live data, hourly updates
The matrix is regenerated every hour via Next.js Incremental Static Regeneration. Prices change, new models drop, latency numbers shift β and the matrix reflects all of it within 60 minutes. Our model database is the same source of truth that Railwail's API gateway routes against, so what you see here is what you'd actually pay if you called the model right now.
The matrix lives alongside our curated comparison landing page (head-to-head deep-dives with code examples) and the Arena leaderboard (live ELO rankings). The matrix is the breadth view; deep-dives are the depth view. Use both.