AI Model Leaderboard
Last updated
Live ranking of 275+ AI models across 8 dimensions. Updated every 60 minutes.
275 models9 providersLast updated
Leaders at a glance
Cheapest input
Lowest input price per 1M tokens. Top 25 models, updated hourly.
| Rank | Model | Metric | Action |
|---|---|---|---|
| 1 | Grok 4.3 Custom New | $1.00 / 1M tokens | Leader |
| 2 | Gemini 3 Flash Google New | $1.00 / 1M tokens | Compare |
| 3 | GPT-5.4 Mini OpenAI New | $1.00 / 1M tokens | Compare |
| 4 | Claude Haiku 4.5 Anthropic New | $1.00 / 1M tokens | Compare |
| 5 | Gemini 3.1 Pro Google New | $2.00 / 1M tokens | Compare |
| 6 | GPT-5.4 OpenAI New | $3.00 / 1M tokens | Compare |
| 7 | Claude Sonnet 4.6 Anthropic New | $3.00 / 1M tokens | Compare |
| 8 | Claude Opus 4.7 Anthropic New | $5.00 / 1M tokens | Compare |
| 9 | $20.00 / 1M tokens | Compare | |
| 10 | Cartesia Sonic Custom | $30.00 / 1M tokens | Compare |
| 11 | Voyage AI voyage-3 Custom | $60.00 / 1M tokens | Compare |
| 12 | $100.00 / 1M tokens | Compare | |
| 13 | Reka Edge Custom | $100.00 / 1M tokens | Compare |
| 14 | $150.00 / 1M tokens | Compare | |
| 15 | Voyage AI voyage-code-3 Custom | $180.00 / 1M tokens | Compare |
| 16 | MiniMax-01 Custom | $200.00 / 1M tokens | Compare |
| 17 | Reka Flash Custom | $200.00 / 1M tokens | Compare |
| 18 | AI21 Jamba 1.5 Mini Custom | $200.00 / 1M tokens | Compare |
| 19 | Text Embedding 3 Small OpenAI | $200.00 / 1M tokens | Compare |
| 20 | DeepSeek V3.1 DeepSeek | $270.00 / 1M tokens | Compare |
| 21 | ElevenLabs v3 (alpha) ElevenLabs | $300.00 / 1M tokens | Compare |
| 22 | $300.00 / 1M tokens | Compare | |
| 23 | Kimi K2 (Moonshot) Custom | $600.00 / 1M tokens | Compare |
| 24 | Qwen 3 235B Instruct Together | $900.00 / 1M tokens | Compare |
| 25 | Perplexity Sonar Custom | $1000.00 / 1M tokens | Compare |
How we rank
- Cost (input / output)
- Normalised to USD per 1M tokens, sourced from public provider list prices, refreshed weekly. Free models are excluded from cost rankings so the leaderboard reflects production economics.
- Context window
- Taken from each provider's official model card. Capped at the input-side window — output-only context is reported separately.
- Latency
- p50 measured from Railwail's own request logs over the trailing 30 days, with a minimum sample threshold of 100 requests per model. Latency is end-to-end (queue + provider + network).
- Popularity
- Total job count over the last 30 days. Excludes test traffic and synthetic load. A single user's repeat usage is weighted to avoid skew from large customers.
- Freshness
- Provider's official public release date. Models markedNewwere released in the last 30 days.
- Community rating
- ELO derived from head-to-head Arena votes by Railwail users. Default 1500 for unrated models. We require >30 matches before a rating is considered stable.
- Best for code
- Models tagged for coding (category
codeor tags includingcoding/developer), ordered by popularity within the cohort. Empirically tracks real developer adoption better than synthetic benchmarks.
Why no benchmarks?
MMLU, HumanEval, MT-Bench and similar are increasingly contaminated by training-set leakage and gamed via prompt engineering. They tell you nothing about a model's real cost in production, its tail latency, or whether developers actually keep choosing it after the launch hype fades. This leaderboard uses observable, real-world signals only — what people pay, how long they wait, and what they choose again.
Spot something off? We update prices and specs every week — but errors creep in.
Submit a correction →