AI Model Leaderboard

Last updated

Live ranking of 275+ AI models across 8 dimensions. Updated every 60 minutes.

275 models9 providersLast updated

Leaders at a glance

Cheapest input

Lowest input price per 1M tokens. Top 25 models, updated hourly.

RankModelMetricAction
1
Grok 4.3
Custom
New
$1.00
/ 1M tokens
Leader
2
$1.00
/ 1M tokens
Compare
3
GPT-5.4 Mini
OpenAI
New
$1.00
/ 1M tokens
Compare
4
Claude Haiku 4.5
Anthropic
New
$1.00
/ 1M tokens
Compare
5
$2.00
/ 1M tokens
Compare
6
GPT-5.4
OpenAI
New
$3.00
/ 1M tokens
Compare
7
Claude Sonnet 4.6
Anthropic
New
$3.00
/ 1M tokens
Compare
8
Claude Opus 4.7
Anthropic
New
$5.00
/ 1M tokens
Compare
9
$20.00
/ 1M tokens
Compare
10
$30.00
/ 1M tokens
Compare
11
$60.00
/ 1M tokens
Compare
12
$100.00
/ 1M tokens
Compare
13
Reka Edge
Custom
$100.00
/ 1M tokens
Compare
14
$150.00
/ 1M tokens
Compare
15
$180.00
/ 1M tokens
Compare
16
$200.00
/ 1M tokens
Compare
17
$200.00
/ 1M tokens
Compare
18
$200.00
/ 1M tokens
Compare
19
$200.00
/ 1M tokens
Compare
20
$270.00
/ 1M tokens
Compare
21
$300.00
/ 1M tokens
Compare
22
$300.00
/ 1M tokens
Compare
23
$600.00
/ 1M tokens
Compare
24
$900.00
/ 1M tokens
Compare
25
$1000.00
/ 1M tokens
Compare

How we rank

Cost (input / output)
Normalised to USD per 1M tokens, sourced from public provider list prices, refreshed weekly. Free models are excluded from cost rankings so the leaderboard reflects production economics.
Context window
Taken from each provider's official model card. Capped at the input-side window — output-only context is reported separately.
Latency
p50 measured from Railwail's own request logs over the trailing 30 days, with a minimum sample threshold of 100 requests per model. Latency is end-to-end (queue + provider + network).
Popularity
Total job count over the last 30 days. Excludes test traffic and synthetic load. A single user's repeat usage is weighted to avoid skew from large customers.
Freshness
Provider's official public release date. Models markedNewwere released in the last 30 days.
Community rating
ELO derived from head-to-head Arena votes by Railwail users. Default 1500 for unrated models. We require >30 matches before a rating is considered stable.
Best for code
Models tagged for coding (category code or tags including coding / developer), ordered by popularity within the cohort. Empirically tracks real developer adoption better than synthetic benchmarks.

Why no benchmarks?

MMLU, HumanEval, MT-Bench and similar are increasingly contaminated by training-set leakage and gamed via prompt engineering. They tell you nothing about a model's real cost in production, its tail latency, or whether developers actually keep choosing it after the launch hype fades. This leaderboard uses observable, real-world signals only — what people pay, how long they wait, and what they choose again.

Spot something off? We update prices and specs every week — but errors creep in.

Submit a correction →

Explore further