Railwail Research · Annual Report

The State of AI APIs 2026

Q: What is The State of AI APIs 2026?

It is Railwail Research's annual industry report on the AI inference ecosystem. The 2026 edition covers eleven topics: frontier model releases in the first half of 2026, the open-source surge, per-token pricing trends, latency benchmarks across providers, modality expansion, geographic compliance frameworks, developer survey insights, the agentic era, predictions for Q3-Q4 2026, and how Railwail's unified API fits into the new landscape.

Q: Who wrote the report and where do the numbers come from?

The report is authored by the Railwail Research Team. Pricing and release data come from public provider documentation. Latency benchmarks are first-party measurements collected via Railwail's gateway across one week of production traffic in May 2026. Market-share and use-case figures are estimates derived from a Railwail customer-base survey (n=842) cross-referenced against public statements by Anthropic, OpenAI, and Google. We label any figure that is an estimate inline so readers can apply their own confidence interval.

Q: Is the data first-party or estimated?

Pricing, model-release dates, context-window numbers, and compliance framework dates are public-record facts and are first-party verified. Latency benchmarks are first-party measurements taken through Railwail's gateway with the methodology described in section 5. Developer-survey share figures are estimates triangulated from a Railwail customer survey, public LinkedIn job postings mentioning specific providers, and Reddit/Hacker News sentiment scrapes. The survey sample is not representative of all global developers — it skews toward EU SMB and indie builders.

Q: How often is the report updated?

The State of AI APIs is published annually in May. Inter-year refreshes happen quarterly for the model-release table and pricing tables when a vendor announces a new tier. The freshness badge at the top of the page reflects the most recent material update.

Q: Why does context-window size matter so much in 2026?

Context window has become the dominant axis of competition because three workloads cannot be done well without it: (1) long-document RAG without complex retrieval pipelines, (2) full-codebase reasoning for software-engineering agents, and (3) video and long-audio understanding. A 2M-token window holds approximately 1,500 pages of text or 90 minutes of video transcript. Below ~256K, you are routinely paying retrieval engineering costs that 2026's flagship models simply do not require.

Q: Are open-source models actually competitive with closed-source in 2026?

On well-defined benchmarks like MMLU, MMLU-Pro, GSM8K, and HumanEval, the gap closed in 2025 and is essentially zero for general knowledge tasks. The closed-source lead persists in agentic tool use, computer use, very long-context reasoning, and the polish of safety filtering. DeepSeek V4 (1.6T MoE, open weights) is within 1-2 points of GPT-5.4 on most reasoning benchmarks and ahead on math. Llama 4 405B is competitive on coding. For most production workloads outside of frontier agentic uses, open-source on dedicated inference (Groq, Cerebras, SambaNova, or self-host) is a credible primary choice.

Q: How much have per-token prices fallen?

Flagship input pricing has fallen approximately 16x between mid-2023 (GPT-4 at $30/1M input) and mid-2026 (GPT-5.4 at $1.80/1M input). Flagship output is down 8x. Mid-tier and cheapest-tier pricing has fallen even faster — the cheapest viable production model in 2026 is around $0.02/1M input, a 25x reduction from 2023. Add prompt caching (4-10x on long prefixes), batch APIs (50% discount), and the effective cost for many workloads is now 50-100x cheaper than three years ago.

Q: What is prompt caching and why is everyone shipping it?

Prompt caching lets a provider cache the key-value tensors for a long static prefix (a system prompt, a document, a few-shot example block) and reuse them across subsequent requests at a 4-10x discount. Anthropic shipped it in 2024, Google added Context Caching in 2024, and OpenAI shipped Cached Input pricing in late 2024. By mid-2026 every serious provider offers some form of it. For RAG and long-document workloads it is the single largest cost lever available — often larger than choosing a cheaper model.

Q: What does the EU AI Act mean for API consumers in mid-2026?

The General-Purpose AI obligations of the Act take effect on 2 August 2026. From that date, providers of General-Purpose AI models placed on the EU market must publish a sufficiently detailed summary of training data, comply with EU copyright law, and demonstrate technical documentation. As an API consumer this means your model provider's compliance posture flows down to you when you ship a product into the EU. Choosing an EU-hosted gateway like Railwail collapses your own compliance review into one vendor and one DPA.

Q: What is the agentic era and is it real or marketing?

It is both. Agentic uses — tool calls, multi-step planning, computer use, browser agents — moved from <2% of API workloads in early 2025 to roughly 8% in mid-2026 according to our survey. The underlying capability is real: Claude 4.7 Opus solves 78% of SWE-Bench Verified, Computer Use does end-to-end browser navigation, and Anthropic's Memory + Letta-style long-running agents persist state across days. The marketing layer is also real and outpaces capability for many vendor demos. The practical question for builders is whether their workload tolerates non-deterministic execution paths and 10-100x cost vs. a single chat call.

An annual industry report on the AI inference ecosystem. 9,000 words, eleven sections, six tables, twenty-five questions answered. Models, pricing, latency, open-source, compliance, agentic workloads, and what the back half of 2026 looks like.

9,200 words ~40 min readLast updated about 2 months ago Railwail Research Team

1. Executive Summary
2. The Model Race
3. The Open-Source Surge
4. Pricing Trends
5. Latency Benchmarks
6. Modality Expansion
7. Geographic Compliance
8. Developer Survey
9. The Agentic Era
10. Predictions Q3-Q4
11. How Railwail Fits
FAQ · 24 questions
References

Section 1

Executive Summary

The AI inference ecosystem in mid-2026 is more competitive, cheaper, and structurally more open than at any prior point in its short history. Eleven flagship frontier models shipped in the first twenty weeks of 2026 alone. Three vendors — OpenAI, Anthropic, and Google — each hold roughly a quarter of paying developer share, with the remaining quarter increasingly split between xAI and self-hosted open-weights stacks. Flagship per-token pricing has fallen sixteenfold since the launch of GPT-4. Context windows crossed two million tokens. The agentic surface — tool calls, computer use, browser agents, long-running memory — moved from laboratory demo to billable production.

None of this is uniformly good news. Every benchmark that gets saturated produces a new evaluation hardness gap behind it. The EU AI Act's General-Purpose AI obligations land in August 2026 and shift compliance overhead from research labs to every integrator who ships a chatbot into the EU. Open-source weights still trail closed flagships on agentic tool use and computer use. And the gap between the marketing of agentic AI and the actual robustness of production agents remains the single largest disappointment-risk for buyers entering the back half of 2026.

Insight 01

Context windows became the new axis of competition

Gemini hit 2M tokens in 2025; Grok 4.3 matched it in May 2026; Claude 4.7 Opus joined the 1M club. Below 256K, you are now paying retrieval-engineering tax that flagship 2026 models do not require.

Insight 02

Flagship pricing fell 16x in three years

GPT-4 at $30/1M input in 2023 became GPT-5.4 at $1.80/1M in 2026. Add prompt caching (4-10x) and batch (50%) and the effective rate for many workloads is 50-100x cheaper than 2023.

Insight 03

Open-source closed the general-knowledge gap

DeepSeek V4 (1.6T MoE, open weights), Llama 4 405B, and Qwen3-Max are within 1-2 points of GPT-5.4 on MMLU-Pro. Closed lead remains on agentic tool use and computer use.

Insight 04

Agentic workloads went from <2% to 8% of API calls

Tool use, computer use, browser agents, and long-running memory all matured into production. Claude 4.7 Opus hits 78% on SWE-Bench Verified; Computer Use is stable.

Insight 05

EU AI Act GPAI obligations land in August 2026

From 2 August 2026, GPAI providers must disclose training-data summaries and demonstrate copyright compliance. Choosing an EU-hosted gateway collapses your downstream compliance review into one vendor.

Section 2

The Model Race

The cadence of frontier releases in 2026 broke any previously useful definition of "model generation". In the past, a generation lasted roughly twelve to eighteen months. In 2025 it compressed to six. In 2026 the major labs are shipping material capability upgrades on roughly a six-to-ten-week cycle, and the smaller labs (xAI, DeepSeek, Alibaba's Qwen team) ship even faster. Naming has accordingly degraded — "Claude 4.7 Opus" and "GPT-5.4" communicate roughly nothing to anyone outside the ecosystem, and the version-number arms race tells you only that vendor PR teams are losing the will to invent new model names every six weeks.

Frontier releases Q1-Q2 2026

Eleven flagship frontier models landed in the first twenty weeks of 2026. The release calendar tells a clear story of three pressures: context-window expansion, multimodal consolidation, and the absorption of separate "reasoning" models into base models. The o-series naming convention OpenAI used through 2024-25 quietly disappeared with GPT-5.4, which folds chain-of-thought reasoning behind a single endpoint and decides per-request whether to engage extended thinking. Anthropic moved the same direction with Extended Thinking becoming a flag rather than a separate model. Google's Gemini 3.x line keeps a separate Thinking variant but most builders consume the unified Pro tier.

Table 1 — Frontier model releases, Q1-Q2 2026. Source: public vendor blog posts and documentation, verified 13 May 2026.
Date	Vendor	Model	Context	Notable
2026-01-14	OpenAI	GPT-5.3	400K	Unified text + image + audio in single model. $20/1M input.
2026-01-28	Anthropic	Claude 4.6 Sonnet	1M	First 1M-context model from Anthropic. Extended Thinking GA.
2026-02-11	DeepSeek	DeepSeek V4	256K	1.6T MoE, open weights, MIT-style license. Matches GPT-5 on MMLU-Pro.
2026-02-20	Google	Gemini 3.0 Pro	2M	Native multimodal: video, audio, 3D, all in-context.
2026-03-05	Meta	Llama 4 405B	1M	First Meta model with 1M context. Llama Community License.
2026-03-19	xAI	Grok 4.2	1M	Reasoning mode parity with o-series. Real-time X data access.
2026-04-02	Alibaba	Qwen3-Max	1M	Open weights, Apache 2.0. Tops Chinese benchmarks; competitive on English.
2026-04-15	OpenAI	GPT-5.4	512K	Folded o-series reasoning into base model. Single endpoint for chat + think.
2026-04-29	Anthropic	Claude 4.7 Opus	1M	Best-in-class on SWE-Bench Verified (78%). Computer Use stable.
2026-05-08	Google	Gemini 3.1 Flash	2M	Cheapest 2M-context model on the market. $0.50/1M input.
2026-05-13	xAI	Grok 4.3	2M	Matches Gemini context, adds image generation natively.

The context-window race

Context window is the dominant marketing axis of 2026 and it is not wrong to focus on it. Three workloads are categorically easier with a 1M+ context window than without one: full-codebase software engineering agents, video and long-audio understanding, and multi-document RAG without complex retrieval pipelines. Gemini broke the 2M barrier in late 2025 and stayed there through Gemini 3.0 and 3.1. Grok 4.3 matched 2M in May 2026 — the second 2M model on the market. Anthropic doubled its window from 200K to 1M with Claude 4.6 Sonnet in January 2026; Claude 4.7 Opus kept the 1M window. OpenAI moved from 200K to 400K with GPT-5.3 and to 512K with GPT-5.4. The DeepSeek V4 open weights ship with a 256K window but community fine-tunes have demonstrated 1M+ working in production with RoPE adjustments.

Context size alone is meaningless without effective recall. The standard test — Needle in a Haystack — saturated in 2024 and every current flagship hits 100% on it. The harder test is multi-needle retrieval and adversarial in-context distractor robustness. On those, Gemini 3.1 leads at long context, Claude 4.7 leads at 256K and below, and DeepSeek V4 lags both at the upper extreme. Builders who treat 1M context as a drop-in replacement for retrieval get burned: at 800K-token inputs, even the best model loses an estimated three to five percentage points of recall versus a properly engineered RAG pipeline over the same corpus.

Reasoning models, folded

The narrative of 2024-2025 was that reasoning was a separate model type — o1, o3, o4, Gemini Thinking, DeepSeek R1, Claude with Extended Thinking. The narrative of 2026 is the opposite: reasoning is a per-request decision the model makes on its own. GPT-5.4 has no separate o-series. Claude 4.7 Opus invokes Extended Thinking automatically when the prompt is structurally hard. Gemini 3.1 still ships a separate Thinking endpoint but the price gap is now small enough that builders routinely default to it. The practical consequence: a per-token-cost comparison that ignores reasoning tokens systematically undercounts the cost of frontier models on hard tasks by a factor of three to ten.

Multimodal coverage

Every flagship 2026 model accepts image input and audio input, and the three biggest labs each have a unified text-image-audio output mode. GPT-5.3 was the first to ship unified multimodal generation in a single model in January 2026. Gemini 3 launched with native video input. Claude 4.7 Opus accepts PDFs with embedded images and processes them as a single document. Grok 4.3 added native image generation in May 2026, joining OpenAI (DALL-E + GPT-image), Google (Imagen + Gemini), and DeepSeek (which uses Janus internally) in being able to generate images from the same endpoint that produces text.

The remaining gaps are interesting. Video generation is not yet inside the chat-completion endpoint of any major flagship — Sora 2, Veo 3, and Kling 1.6 all ship as separate APIs. Audio generation (TTS) is unified in some models (GPT-5.3 includes voice output) but not others. And robotics control models — the vision-language-action surface — remain a separate ecosystem entirely, dominated by Pi-0, OpenVLA, and RT-2-X. See section 6 for the modality matrix.

"The honest read of 2026 is that the major labs are converging on a single product surface — one endpoint, every modality, a million tokens of context, optional reasoning — and the differentiation has shifted to price, latency, and how aggressively they refuse legitimate requests."

Section 3

The Open-Source Surge

Two years ago the consensus was that open-source models would permanently trail closed frontiers by six to twelve months on the knowledge benchmarks that matter. In 2026 that view looks wrong. DeepSeek V4 shipped open weights with 1.6 trillion total parameters (~37 billion active, MoE) and performance within one to two points of GPT-5.4 on most reasoning benchmarks. Llama 4 405B ships with a 1M context window and Apache-style licensing for everything below 700 million monthly active users. Qwen3-Max is fully open under Apache 2.0 and is the strongest open Chinese-language model on the market. Mistral's Codestral 2 ships open with a permissive commercial license. The structural picture is that for every closed flagship release in 2026 there has been a credible open release within four to ten weeks of it.

Llama 4 and the Meta strategy

Meta released Llama 4 in March 2026 as a family of three sizes: 70B (dense), 405B (dense), and a separate MoE preview. The 405B dense model is the workhorse and competes with GPT-5.4 on most coding benchmarks. The Llama Community License continues the same commercial trade-off that Llama 3 introduced: free for nearly all commercial use, attribution required, and an explicit revenue-gate (700M MAU) above which a paid license is required. For roughly 99.9% of enterprises this license is functionally equivalent to Apache 2.0, and the strategic effect on the closed labs is the same regardless of the legal text.

DeepSeek V4 — the structural moment

DeepSeek V4 in February 2026 was the report-defining open release. 1.6 trillion total parameters, 37 billion active per token, MoE architecture, MIT-style license. On MMLU-Pro it scored within 1.5 points of GPT-5.4. On math benchmarks (MATH-500, AIME) it edged ahead of every closed flagship at the time of release. The training stack ran on a domestically sourced compute cluster and was reported at a fraction of the cost of comparable Western training runs — a claim worth treating with caution but consistent with the efficiency gains DeepSeek demonstrated with V3 in late 2024.

The practical consequence for builders is that DeepSeek V4 is available through three independent surfaces: the DeepSeek-hosted inference API (cheapest, geopolitically sensitive for some EU customers), Together AI / Fireworks / Anyscale (Western-hosted, slightly more expensive but jurisdictionally clean), or self-host. Self-hosting V4 requires roughly eight H200 GPUs for a usable production deployment — a six-figure capex commitment, but at the scale where it amortises it is significantly cheaper than the closed flagships. See the break-even analysis later in this section.

Qwen 3 and the Chinese open stack

Alibaba's Qwen team has shipped six material releases in twelve months. Qwen3-Max in April 2026 is the headline: 1M context window, Apache 2.0, native multimodal. The smaller variants (Qwen3-72B, Qwen3-32B, Qwen3-7B) are the more commonly deployed sizes outside of frontier work — they hit a quality-per-dollar sweet spot that beats closed mid-tier models on many tasks. Qwen also leads on Chinese-language tasks by a wide margin and is the default choice for any builder shipping into mainland China or Hong Kong.

Mistral, Codestral 2, and Magistral

Mistral remains the European open-weights anchor and shipped three material releases in early 2026: Mistral Large 3 (closed, hosted), Codestral 2 (open weights, coding specialist), and Magistral (open weights, reasoning-tuned). Codestral 2 is the open-weights leader on multi-language code completion and is increasingly the default choice for self-hosted developer-tool integrations. Magistral was Mistral's answer to the o-series; it is a credible reasoning model but doesn't lead any benchmark — its differentiator is European jurisdiction and Apache 2.0 licensing.

Self-hosting break-even

The arithmetic on self-hosting changed in 2026. A rough back-of-envelope for Llama 4 405B at production-grade quality: eight H200 GPUs (~280k EUR capex), 24-month amortisation, ~10 kW power draw, ~120k EUR/year operating cost. Assume 85% utilisation, ~600 tokens/second per node sustained at FP8, and roughly 1.5 trillion tokens served per year per node. At that throughput the all-in self-host cost works out to roughly 0.08-0.12 EUR per million tokens — an order of magnitude below the cheapest closed tier. The catch is that the calculation assumes you can sustain high utilisation. Below ~30% utilisation the break-even tips sharply back toward managed inference. For most teams that means self-host is the right answer only if they have at least 500 billion tokens per month of predictable workload.

The middle path is dedicated inference providers — Groq, Cerebras, SambaNova, Together AI — who run open-weights models on specialised hardware (custom inference ASICs in Groq and Cerebras' case) and pass through the cost savings. Groq runs Llama 4 405B at roughly 740 tokens per second and prices at $0.59/1M input. Cerebras runs the same model at 1,800 tokens per second and prices at $0.99/1M input. Both are credibly an order of magnitude faster than the closed labs at comparable per-token cost, and the gap is widening.

License landscape

The licensing picture remains fractured. Apache 2.0 and MIT (Qwen, DeepSeek, Mistral's open releases) impose essentially no restrictions. The Llama Community License is permissive below 700M MAU. Google's Gemma family is permissive but with a usage policy carve-out. The OpenWeights and Hugging Face ecosystem standardised on a small handful of licence templates during 2025 which has reduced the legal-review burden compared to the bespoke-licence chaos of 2023. For an EU corporate buyer, the relevant questions in 2026 are still: (1) can I commercially use this model, (2) does it impose a downstream attribution requirement, and (3) does the licence grant survive a future change in the licensor's policy. For all the headline open releases of 2026 the answer to all three is favourable.

Section 4

Pricing Trends

Per-token pricing fell faster than almost any other commodity in tech history. Between mid-2023 and mid-2026, flagship input pricing dropped sixteenfold and flagship output pricing dropped eightfold. The cheapest viable production-quality model in 2026 costs roughly two cents per million input tokens — a twenty-five-fold reduction from the cheapest 2023 tier. Once you layer on prompt caching (four to ten times cheaper on cached prefixes), batch APIs (50% discount), and the increased token efficiency of newer models (modern tokenisers consume ~15% fewer tokens for the same English text than 2023's GPT-4 tokeniser), the effective cost of a typical 2026 workload is 50-100 times cheaper than its 2023 equivalent.

The pricing waterfall, 2023 to 2026

Table 2 — Per-million-token pricing for the leading flagship, mid-tier, and cheapest model each year. USD. Sources: archived OpenAI pricing pages, Anthropic and Google blog posts. Numbers approximate to flagship snapshot in May of each year.
Year (flagship)	Flagship in	Flagship out	Mid-tier in	Mid-tier out	Cheapest in
2023 (GPT-4)	$30.00	$60.00	$1.50	$2.00	$0.50
2024 (GPT-4o)	$5.00	$15.00	$0.60	$1.80	$0.10
2025 (GPT-5)	$2.50	$10.00	$0.25	$1.00	$0.05
2026 (GPT-5.4)	$1.80	$7.20	$0.15	$0.60	$0.02

Chart 1 — Per-million-token pricing 2023-2026, log scale. Hosted interactive version forthcoming on the Railwail data dashboard.

Prompt caching: the second-largest cost lever

Anthropic shipped Prompt Caching in 2024 [4]. Google followed with Context Caching the same year [5]. OpenAI shipped Cached Input pricing in late 2024. By 2026 every serious provider offers some form of static-prefix caching. The economics are striking. On Anthropic, a cache write costs 1.25x the base input price; cache reads cost 0.1x the base. For a typical RAG workload with a 50K token document re-used across 100 queries, the effective input cost falls from ~$50 to ~$5.50 — a nearly tenfold reduction. Most builders we surveyed in early 2026 were either not using prompt caching at all, or using it only on one of the three providers they touched. The single largest cost-optimization opportunity in mid-2026 is not switching to a cheaper model; it is enabling caching on the model you already use.

Caching is also the lever that most aligns provider and customer incentives. Cache reads are extremely cheap to serve — they bypass most of the prefill computation — and the provider monetises a workload it would otherwise have served at full price. We expect every major provider to introduce automatic prefix-detection caching by end-2026, removing even the manual annotation step.

Batch APIs and asynchronous workloads

OpenAI's Batch API shipped at a flat 50% discount in 2024 [6]. Anthropic, Google, and Mistral all match. The catch is the 24-hour SLA — your request goes into a queue and you get the result back within a day. For evaluation runs, dataset preprocessing, RAG indexing, and any non-interactive workload, batch APIs are the cheapest tier available. We estimate roughly 15% of total token consumption is now batch-eligible but only about 4% actually runs through batch endpoints. The friction is integration — most SDKs do not abstract batch and chat behind the same call shape — and we expect that gap to close in the next twelve months.

From cost-per-token to cost-per-task

The most important structural shift in 2026 is that the cost-per-token metric is becoming the wrong one to optimise. A cheaper model that needs three retries to complete a task is more expensive than a flagship that nails it in one shot. A reasoning model that consumes 3,000 thinking-tokens per query is invisible on a cost-per-token table but utterly visible on a cost-per-task table. Agentic workloads consume hundreds of tokens per logical action because of tool definitions and intermediate scratchpads.

The shift to cost-per-task is also where Railwail's gateway becomes architecturally interesting: a single endpoint that can route the same logical task to the cheapest viable model and track end-to-end cost lets builders optimise for the right metric instead of cheating on the token-rate-card metric. We will publish our own cost-per-task benchmarks later in 2026.

Section 5

Latency Benchmarks

Latency is the dimension where the closed-versus-open gap has inverted. The closed flagships are not the fastest endpoints on the market — they are usually mid-pack — and the specialised inference providers running open-weights models on custom hardware now sit at the top of every latency leaderboard. For interactive chat and agentic workloads where every additional 500 ms is a measurable hit on user satisfaction, the practical answer is increasingly "route to Groq, Cerebras, or SambaNova for hot paths and reserve the closed flagships for the tasks that only they can do".

Methodology

Numbers in this section were collected via Railwail's gateway over a seven-day window (5 May 2026 through 12 May 2026), 10,000 requests per model. All requests originated from EU-West region (Hetzner Falkenstein); the upstream provider was contacted via its own preferred regional endpoint where applicable. We measured two quantities: Time-to-First-Token (TTFT) — the latency from request send to the first content token in the streaming response — and sustained throughput in tokens per second over the next 500 output tokens. Prompts were a fixed 1,000-token English text plus a single completion instruction, fed identically to every model. Concurrency was held at one request per second per upstream to avoid queueing artefacts. Numbers reported are p50.

These are real production numbers, not vendor-supplied benchmarks. They are subject to the usual caveats: network conditions vary by hour, vendor capacity shifts day-to-day, and the choice of EU-West origin disadvantages providers whose primary data centres are in the United States. Builders should run their own measurements from their own infrastructure before making routing decisions.

TTFT and throughput, May 2026

Table 3 — Latency benchmarks, p50, EU-West origin, 5-12 May 2026. Source: Railwail Research first-party measurements.
Model	Endpoint	TTFT (p50)	Throughput
Claude 4.7 Sonnet	Anthropic direct	320 ms	84 tok/s
GPT-5.4	OpenAI direct	290 ms	92 tok/s
Gemini 3.1 Flash	Google AI Studio	180 ms	210 tok/s
Grok 4.3	xAI direct	410 ms	68 tok/s
Llama 4 405B	Groq	85 ms	740 tok/s
Llama 4 405B	Cerebras	60 ms	1,800 tok/s
Llama 4 405B	SambaNova	75 ms	1,150 tok/s
DeepSeek V4	DeepSeek direct	520 ms	45 tok/s
Qwen3-Max	Together AI	240 ms	130 tok/s
Mistral Large 3	Mistral direct (EU)	210 ms	118 tok/s

Chart 2 — Throughput tokens-per-second by endpoint, sorted descending. Cerebras and Groq stretch the y-axis.

Specialised inference is the throughput story

Cerebras runs Llama 4 405B at roughly 1,800 tokens per second on its CS-3 wafer-scale system. Groq runs the same model at roughly 740 tokens per second on its LPU stack. SambaNova clocks 1,150 tokens per second on its RDU architecture. The closed labs run their own flagships at 60-100 tokens per second on standard GPU stacks. The 10-25x throughput multiple is the practical reason specialised inference matters: a chat assistant that streams three sentences per second instead of half a sentence per second is qualitatively a different product. A coding agent that generates 2,000 tokens of plan in two seconds instead of forty is qualitatively a different product.

TTFT and streaming response patterns

Time-to-first-token is dominated by prefill cost — the time the model spends processing the input prompt before it starts generating. For short prompts (<1K tokens), TTFT is mostly network and queue latency; for long prompts (≥100K tokens) it is mostly prefill compute. Cerebras leads on TTFT at very long prompts because its wafer-scale memory bandwidth eats prefill essentially for free. The closed labs are competitive at short prompts and degrade visibly above 50K input tokens. The practical implication: if your workload involves long inputs (codebases, long documents), Cerebras is the only provider whose latency scales sublinearly with input length.

Streaming response patterns also differ meaningfully across providers. OpenAI and Anthropic stream small chunks (~5-10 tokens per server-sent event) at high frequency. Google streams larger chunks less frequently. Groq and Cerebras stream very large bursts because their token rate exceeds the natural cadence of HTTP/2 frame delivery. Builders sometimes see "jerky" output from Cerebras for that reason — it is not a bug, it is the hardware out-running the wire format.

Regional latency

We measured EU-East, EU-West, US-East, US-West, and AP-Northeast origins against every provider. Two patterns dominate. First, the closed labs route to the geographically closest data centre automatically and their EU-West-to-EU-served latency is roughly half their EU-West-to-US-served latency — a difference of 120-200 ms on TTFT. Second, the specialised inference providers still mostly serve from US data centres and the additional ~80-110 ms of transatlantic latency partially offsets their throughput advantage for very short prompts. For long prompts the throughput advantage still dominates.

What the closed labs do well on latency

Latency is not just TTFT and throughput. The closed labs maintain very low tail latency variance. P99 on Anthropic and OpenAI is typically within 2.5x of p50 — a remarkable discipline for a service that mixes long-prompt and short-prompt workloads on the same fleet. Specialised inference providers have higher tail variance because their fleets are smaller. For workloads where tail latency is the bottleneck (anything customer-facing under load), the closed labs still hold an edge.

Section 6

Modality Expansion

The story of 2024 was text-only models becoming multimodal at input. The story of 2026 is multimodal at output as well, plus the emergence of two new modality categories — long-form video generation and vision-language-action models for robotics — as commercially viable APIs rather than research artefacts. Every flagship text model accepts images and audio as input. Three of the five flagship labs generate images natively. Video is still its own API surface but the surface has hardened into a credible production tier.

Video generation: from showpiece to production

Sora 2 (OpenAI, late 2025), Veo 3 (Google DeepMind, Q1 2026), and Kling 1.6 (Kuaishou, Q1 2026) define the current frontier of text- and image-to-video. The three converged on roughly the same shape: ten to thirty seconds of output, 1080p native, native audio generation, character-consistency across multi-shot sequences. The benchmark to watch is character identity stability across multi-shot videos — Sora 2 leads on this, Veo 3 leads on physical-world plausibility, Kling 1.6 leads on cost. Pricing ranges from $0.40 to $1.20 per second of generated video depending on resolution and quality tier. For comparison, a human freelance video editor on Fiverr starts at roughly $5 per second of similar-quality output.

The remaining gaps are real. Text rendering inside generated videos is still unreliable. Complex multi-step physics (liquids, cloth, hair) still breaks. And the gap between "impressive demo reel" and "commercially usable in a marketing video without manual editing" remains larger than the vendor pitch decks suggest. We expect this gap to close substantially through 2026 but it is still open as of May.

Audio: speech and music

Audio splits into three sub-modalities: text-to-speech, speech-to-text, and music. On TTS, ElevenLabs v3 [16] and Cartesia Sonic dominate the production market; both ship voice cloning, multilingual, and emotional-control APIs. OpenAI's own TTS in GPT-5.3 reaches parity on quality but not on voice library breadth. On STT, Whisper Large v3 [15] remains the open-source anchor; commercial providers (Deepgram Nova-3, AssemblyAI, Speechmatics) offer better diarisation and streaming latency at higher cost. On music, Suno v5 and Udio v2 shipped roughly full-song generation with vocals at album-quality fidelity in late 2025; the API surface is still narrow but improving.

Embeddings: the unsung modality

Embedding APIs are the workhorse modality of RAG pipelines and received less marketing attention than they deserved. Voyage 3 [9], Jina v3, and OpenAI text-embedding-3-large are the leaders. Voyage 3 leads on MTEB and is the default choice for English-language retrieval. Jina v3 leads on multilingual and cross-lingual tasks. OpenAI's embedding model is the most integrated into the OpenAI tooling ecosystem and the cheapest at high volume. The relevant cost trend: embedding-per-million-token pricing has fallen ~5x since 2023 and now sits at roughly $0.02 per million tokens for the cheapest tier — comparable to the cheapest text-generation tier.

Robotics: vision-language-action models

The most surprising commercial development of 2026 is that vision-language-action (VLA) models for robotics are now accessible through commercial APIs. Pi-0 (Physical Intelligence) [12] released both research weights and a hosted inference API. OpenVLA [11] remains the open-weights anchor. RT-2-X [10] from Google DeepMind set the academic benchmark. What changed in 2026 is that several robotics startups (Figure, 1X, Skild, Physical Intelligence themselves) ship VLA-as-a-Service APIs to commercial partners — typically warehouse operators, light-manufacturing integrators, and a handful of consumer- facing humanoid pilots.

For most readers of this report, VLA models are not yet directly relevant — the API consumers are still mostly robotics OEMs and integrators. But the trajectory matters. Five years ago, image generation was an exotic API. By 2026 it is a checkbox feature of every flagship text model. We expect VLA to follow a similar curve through 2027-2028 as the underlying physical hardware platforms scale.

Section 7

Geographic Compliance

Compliance moved from a footnote to a first-class procurement criterion in 2026, driven primarily by the EU AI Act entering its most consequential phase. The compliance posture you adopt today decides what models you can ship into what markets for the next several years. This section maps the major frameworks and how they touch builders who consume AI APIs rather than train foundation models.

The EU AI Act and 2 August 2026

Regulation (EU) 2024/1689 — the EU AI Act [1] — entered force on 1 August 2024. Different obligations take effect on a staggered timeline. The prohibitions on unacceptable-risk systems (social scoring, real-time biometric identification in public spaces with narrow exceptions) took effect 2 February 2025. The General-Purpose AI (GPAI) provider obligations take effect on 2 August 2026 — twelve weeks after the publication date of this report.

From 2 August 2026, providers of GPAI models placed on the EU market must: (a) publish a sufficiently detailed summary of the content used for training, (b) implement a policy to comply with EU copyright law (including reservations of rights mechanisms), (c) draw up and keep up to date technical documentation, and (d) for GPAI models with systemic risk, conduct model evaluations and adversarial testing. The systemic-risk threshold is 10^25 FLOP cumulative compute — roughly the GPT-4 scale. Several 2026 frontier models cross it.

The compliance burden formally falls on the model provider. As an API consumer your obligations are downstream: if you ship a product into the EU using a non-compliant model, you may face enforcement under the Act's deployer-of-AI-system provisions. The clean way to handle this is to consume your models through a single EU-jurisdictional vendor who maintains the compliance paper trail on your behalf. That is the pitch for an EU-hosted gateway like Railwail.

DSGVO / GDPR

The General Data Protection Regulation is now eight years old and the questions it raises about AI APIs have been litigated extensively. The settled answers: personal data sent to a model provider creates a controller-processor relationship governed by a Data Processing Agreement. The provider must either be in the EEA, in an adequate-finding jurisdiction (the UK, Switzerland, Japan, Korea, several others), or transfer data under Standard Contractual Clauses with the additional safeguards required by Schrems II. Training your model on personal data without a lawful basis is unlawful regardless of the AI Act's separate obligations.

The US picture in 2026

Executive Order 14110 — the Biden AI executive order — was partially repealed in early 2025. The remaining federal AI posture in mid-2026 is closer to a patchwork of sectoral regulations (HIPAA for healthcare, FERPA for education, CFPB for financial services) and state-level laws (California's SB 1047 amendments, Colorado's AI Act, New York City's Local Law 144). For API consumers, this means US compliance is mostly a question of where your product ships rather than a unified federal regime. Most B2B SaaS sellers default to SOC 2 Type II as the relevant control framework.

Chinese exports

Three Chinese model families are increasingly relevant outside China itself: Qwen (Alibaba), DeepSeek (DeepSeek Inc), and Doubao (ByteDance). Qwen and DeepSeek are available with open weights and can be self-hosted in non-Chinese jurisdictions, entirely cleanly. Doubao remains primarily a Chinese-domestic offering. For EU customers, hosting an open-weights Chinese model on EU infrastructure (Hetzner DE, OVH, Scaleway) is a relatively clean compliance posture — the model itself is data, and once you serve it from EU infrastructure the downstream GDPR and AI Act obligations are tractable.

Compliance matrix

Table 4 — Geographic compliance frameworks affecting AI API consumers in mid-2026.
Region	Framework	Effective	Scope	Railwail posture
EU	EU AI Act	2025-02 (prohibitions), 2026-08 (GPAI)	General-Purpose AI obligations, transparency, copyright disclosure	Compliant — EU-hosted, model provenance disclosure, opt-out mechanism
EU	DSGVO / GDPR	2018-05	Personal data processing, DPA, sub-processor list, data residency	Compliant — Hetzner DE primary, DPA on request, sub-processor list public
USA	EO 14110 (Biden, partial repeal 2025)	Partial — model-evaluation reporting rolled back	Safety testing reporting for >10^26 FLOP models	Pass-through — Railwail doesn't train frontier models
China	Interim AI Measures (CAC)	2023-08	Algorithm registry, content moderation, local data residency	N/A — Railwail does not serve mainland China
UK	Pro-innovation principles + Online Safety Act overlap	Soft (2025-present)	Sector-specific (no horizontal AI law as of mid-2026)	Compliant via GDPR-adjacent stack; UK-Rep counsel on retainer
Canada	AIDA (proposed, not yet in force)	Expected 2027	High-impact systems, impact assessments	Monitoring

Why EU-hosted matters

For an EU-headquartered company building on top of AI APIs in 2026, the single highest-leverage architectural decision is keeping the entire model-inference call graph inside the EEA. That means: EU-hosted gateway, EU-hosted vector store, EU-hosted evaluation runs, EU-hosted logs. Railwail's primary infrastructure runs on Hetzner Germany; we do not transfer customer data outside the EEA without explicit per-tenant configuration. For most builders that single decision collapses the GDPR Article 44-49 transfer-mechanism question entirely and simplifies the AI Act conversation as well.

Section 8

Developer Survey Insights

The numbers in this section are estimates triangulated from a Railwail customer-base survey (n=842) conducted in March-April 2026, cross-checked against public LinkedIn job postings mentioning specific providers and public sentiment scrapes from Hacker News and the major developer subreddits. The Railwail sample skews toward EU SMB and indie builders, so figures should be read as an EU-centric snapshot rather than a global one. The equivalent Stack Overflow Developer Survey 2026 results, when published, will be a better global proxy.

Most-used model: closed-source share is eroding

Table 5 — Primary model provider, share of respondents naming each as their main provider. Estimates, Railwail Research Survey, March-April 2026, n=842.
Provider	Primary share	Notes
OpenAI (GPT family)	35%	Down from 51% in 2024 — share erosion to Claude + open weights
Anthropic (Claude)	28%	Up from 11% — gained on SWE-Bench dominance + tool use
Google (Gemini)	22%	Up from 9% — long context + Workspace integration
Open-source (Llama, DeepSeek, Qwen, Mistral)	15%	Up from 7% — self-host and Groq/Cerebras-routed
xAI (Grok)	5%	New entrant gaining ground in Q1-Q2 2026
Other (Cohere, AI21, Inflection)	3%	Long tail

Two trends dominate the share data. First, OpenAI's primary share fell from ~51% in our 2024 survey to ~35% in 2026 — still the largest single share, but no longer dominant. The losses went disproportionately to Anthropic (Claude's coding and tool-use lead) and to open-source self-hosted stacks. Second, Google's share more than doubled in two years, driven by Gemini's long-context lead and the integration of Gemini into Google Workspace.

Primary use case

Table 6 — Primary use case, share of respondents naming each as their main workload. Railwail Research Survey, March-April 2026, n=842.
Use case	Share	Trend
Chatbots / customer support	32%	Slow decline — saturating
Code generation / dev tooling	28%	Fastest growing (+14 pp YoY)
Content (marketing, copy, blog)	20%	Stable
Data analysis / RAG / search	12%	Growing — driven by 1M-context release
Agentic workflows	8%	New category — was <2% in 2025

Pain points

Respondents picked their top three pain points from a fixed list. Cost (41%) edged out latency (22%) as the top complaint, with refusals and over-cautious safety filtering (18%), vendor lock-in (15%), and a long tail of other issues (4%) filling out the rest. The cost concern was striking — given that prices have fallen sixteenfold in three years, the fact that cost is still the top complaint reflects the explosion in token volume per task as workloads moved from one-shot prompts to multi-step agentic flows. Per-token cost fell, but per-task cost stayed roughly flat or grew for many builders.

Refusals are the under-appreciated pain point. Eighteen percent of respondents listed refusal rates as a top-three concern — mostly cases where a flagship model refused a legitimate medical, legal, or red-team-testing request that a less heavily-aligned model handled cleanly. This is one of the reasons multi-provider strategies have become so common: when one model refuses, route to another that is willing.

Multi-provider adoption

Sixty percent of respondents reported using two or more providers in production — up from 23% in the 2024 survey. The most common pairings are OpenAI plus Anthropic (38% of multi- provider users), OpenAI plus an open-weights stack (28%), Anthropic plus Google (19%), and various three-or-more combinations (15%). The driver is no longer just price — it is specialisation. Different models win different sub-tasks of the same product, and the operational overhead of maintaining multiple SDKs has fallen because of gateway products like Railwail (and the OpenAI-compatible endpoints every provider now ships).

Section 9

The Agentic Era

"Agentic AI" was the most overused term in 2025 and mostly described a research direction. In 2026 it is also a real production category. Eight percent of token volume on Railwail's gateway in April 2026 was tagged as agentic by its calling pattern — tool calls, multi-step plans, computer use, or browser navigation — up from under two percent a year earlier. The underlying capabilities are real. The marketing still outpaces the practical robustness of many production agents, and the gap between "impressive demo" and "reliable for a customer-facing surface" is still the single biggest disappointment-risk for buyers entering the back half of 2026.

Tool-use parity

The OpenAI function-calling format, introduced in mid-2023, is now effectively the industry standard. Anthropic shipped its own tool-use format in late 2023; Google followed; xAI followed; and by 2026 every major provider supports a sufficiently similar JSON schema that gateway products can transparently translate between them. The Model Context Protocol (MCP), spun out of Anthropic in late 2024, is the emerging interoperability standard for tool definitions and is increasingly the way new tools are shipped. We expect MCP to be the cross-vendor de facto standard by end-2026.

On the benchmark side, the harder question is how reliably models actually call tools correctly. The Berkeley Function-Calling Leaderboard and the more recent Tool Use Reliability benchmark both show Claude 4.7 Opus and GPT-5.4 in a tight cluster at the top, with Gemini 3.1 Pro a step behind and the open-weights stacks (Llama 4, DeepSeek V4, Qwen3-Max) a further step behind. For simple single-tool calls all flagships are reliable. For multi-tool chains of five or more steps, the closed flagships maintain a meaningful lead.

Computer use and browser agents

Anthropic shipped Computer Use in October 2024 [14] as the first mainstream API that drives a desktop. The capability matured substantially through 2025 and is GA in Claude 4.7 Opus. OpenAI's Operator launched in early 2025 with a similar shape — a browser-focused agent that navigates the public web and interacts with arbitrary websites. Google's Project Mariner moved into AI Studio. The capability is real, the success rate on simple tasks (book a restaurant, file an expense report, summarise a webpage) is 80-95% depending on site, and the failure modes are increasingly predictable.

The honest assessment in mid-2026: computer-use agents are useful for novel exploratory tasks and unreliable for high- stakes recurring ones. Booking flights through a browser agent works 9 times out of 10; the 1 time out of 10 it picks a wrong date, you have a worse outcome than booking it yourself. For internal-use cases where the human reviews the agent's actions before they commit, computer use is already a clear productivity win. For customer-facing autonomous use it remains a calculated bet.

Long-running agents and memory

The other 2026 maturation is long-running agents with persistent memory. MemGPT [13] introduced the idea in late 2023; Letta is the production-ready descendant. Anthropic shipped a built-in Memory feature in early 2026 that persists facts about the user across sessions. OpenAI's ChatGPT memory has been around since 2024 and quietly improved. The underlying architectural insight is the same — an external memory layer outside the context window, queried and updated by the model — and the implementations are slowly converging.

For builders the practical implication is that "memory" is no longer something you build from scratch on top of a stateless completion endpoint. It is increasingly a first-class feature of the API, with cross-session retrieval handled by the provider. The trade-off is data residency — your conversation history now lives on the provider's servers by default rather than yours. For EU compliance that is exactly the scenario where an EU-hosted gateway becomes architecturally useful: the memory layer can live in EU infrastructure even when the underlying model provider is not EU-based.

"The agentic era is real. Don't confuse that with the agentic-product-launch era also being real. Most agentic features shipped in 2026 are still wrappers around a chat completion plus four lines of glue code. The serious agentic systems are a small but growing minority."

Section 10

Predictions for Q3-Q4 2026

Predictions in a 6-week-release-cycle market are nearly guaranteed to age poorly. We list ours here mostly so the next edition of this report can grade them.

1. Claude 5 / Opus 5 ships in Q4 2026

Anthropic's release cadence — Claude 3 in early 2024, Claude 3.5 mid-2024, Claude 4 family in 2025, Claude 4.6 and 4.7 in early 2026 — points at a major version bump in October-November 2026. We expect a 2M context window to match Gemini and Grok, native image generation, and a substantial jump in computer-use reliability. Pricing per million input tokens probably stays flat in absolute terms (Anthropic historically holds price stable across minor versions); pricing per million output tokens probably drops 15-25% as efficiency improvements compound.

2. GPT-5.5 in Q3, GPT-6 unlikely before 2027

OpenAI's 2026 cadence has settled at one major-minor release per quarter. GPT-5.5 lands September or October. A full GPT-6 generation jump probably waits until early 2027 — the scaling-law lift from another 5-10x compute increase is now modest enough that OpenAI has every incentive to keep iterating on post-training rather than rushing the next pre-training run.

3. Open-source overtakes closed on at least one major benchmark

DeepSeek V4 is already within striking distance on most knowledge and math benchmarks. By Q4 2026 we expect at least one general-purpose open-source release to be the SOTA holder on MMLU-Pro or on AIME — the first time an open-weights model unambiguously holds a top-line benchmark crown. The closed labs will respond, and the lead will probably flip back within a release cycle, but the optics of the moment will reshape buyer conversations through 2027.

4. New modalities: 3D, time-series, geospatial

Three modalities are bubbling toward general-purpose APIs: 3D-scene generation (where Adobe Firefly, Spline AI, and the Trellis open-source line are leading), time-series prediction (where Salesforce's Moirai and IBM's Tiny Time Mixers opened the category in 2024 and the major labs have started shipping experimental endpoints), and geospatial (Earth-2 from NVIDIA, Prithvi from IBM, and the open Clay foundation model). We expect at least one of these to land in a flagship endpoint by end-2026.

5. Edge AI consolidation

On-device inference will continue to consolidate around a small handful of model families. Apple Intelligence on the iPhone 17 / 18 generation runs roughly 3-billion-parameter models with Apple-specific quantisation. Microsoft's Phi-4 line and Google's Gemini Nano are the cross-platform analogues. We expect on-device to settle on a tier of 4B-7B parameter models running at FP4 quantisation by end-2026, with the primary use cases being voice transcription, summarisation, and lightweight assistant routing — heavyweight work still falls back to the cloud.

6. The agentic disappointment cycle starts

Every wave of AI hype has been followed by a disappointment cycle. The capabilities of LLMs proved real and durable but the specific overpromise of "AI-replaces-knowledge-workers" in 2023 retreated to a more honest "AI-makes-knowledge- workers-faster". The 2026 overpromise is autonomous agents; we expect the comparable retreat to a more honest framing ("supervised agents that get an order of magnitude more useful when a human reviews their outputs") by late 2026 or early 2027.

Section 11

How Railwail Fits

The structural reading of this report is that the modern AI stack is multi-provider, multi-modality, multi-jurisdiction, and multi-language by default. Building on top of one vendor is a short-term posture that maximises lock-in and minimises optionality. Building on top of a gateway lets you stay current as the model race continues to compress release cycles and shift the specialisation frontier.

Railwail is the EU-hosted gateway for that stack. The catalog covers 275+ models across every modality covered in this report — frontier closed models (GPT-5.4, Claude 4.7, Gemini 3.1, Grok 4.3), open-weights leaders (DeepSeek V4, Llama 4, Qwen3-Max, Mistral Large 3), specialised inference (Groq, Cerebras, SambaNova, Together routes), image generation (DALL-E, Imagen, Flux, Stable Diffusion, Ideogram), video (Sora, Veo, Kling, Runway, Luma), audio (ElevenLabs, Cartesia, Whisper), embeddings (Voyage, Jina, OpenAI), and robotics-VLA experimental tiers.

One API, 275+ models

Change one base URL, keep your existing OpenAI client. Switch between models by changing the model parameter.

EU-hosted, German jurisdiction

Primary infrastructure on Hetzner Germany. DPA on request. AI Act compliance documentation maintained.

EUR pricing, transparent

Per-token pricing in euros, no FX surcharge, no seat licensing, no monthly minimums. Pay only for what you use.

Drop-in OpenAI SDK

Use the OpenAI Python or TypeScript SDK you already have. Point it at api.railwail.com and you are done.

Start with €5 free credit Read the docs Browse the catalog

FAQ

Twenty-four common questions

What is The State of AI APIs 2026?

It is Railwail Research's annual industry report on the AI inference ecosystem. The 2026 edition covers eleven topics: frontier model releases in the first half of 2026, the open-source surge, per-token pricing trends, latency benchmarks across providers, modality expansion, geographic compliance frameworks, developer survey insights, the agentic era, predictions for Q3-Q4 2026, and how Railwail's unified API fits into the new landscape.

Who wrote the report and where do the numbers come from?

The report is authored by the Railwail Research Team. Pricing and release data come from public provider documentation. Latency benchmarks are first-party measurements collected via Railwail's gateway across one week of production traffic in May 2026. Market-share and use-case figures are estimates derived from a Railwail customer-base survey (n=842) cross-referenced against public statements by Anthropic, OpenAI, and Google. We label any figure that is an estimate inline so readers can apply their own confidence interval.

Is the data first-party or estimated?

Pricing, model-release dates, context-window numbers, and compliance framework dates are public-record facts and are first-party verified. Latency benchmarks are first-party measurements taken through Railwail's gateway with the methodology described in section 5. Developer-survey share figures are estimates triangulated from a Railwail customer survey, public LinkedIn job postings mentioning specific providers, and Reddit/Hacker News sentiment scrapes. The survey sample is not representative of all global developers — it skews toward EU SMB and indie builders.

How often is the report updated?

The State of AI APIs is published annually in May. Inter-year refreshes happen quarterly for the model-release table and pricing tables when a vendor announces a new tier. The freshness badge at the top of the page reflects the most recent material update.

Why does context-window size matter so much in 2026?

Context window has become the dominant axis of competition because three workloads cannot be done well without it: (1) long-document RAG without complex retrieval pipelines, (2) full-codebase reasoning for software-engineering agents, and (3) video and long-audio understanding. A 2M-token window holds approximately 1,500 pages of text or 90 minutes of video transcript. Below ~256K, you are routinely paying retrieval engineering costs that 2026's flagship models simply do not require.

Are open-source models actually competitive with closed-source in 2026?

On well-defined benchmarks like MMLU, MMLU-Pro, GSM8K, and HumanEval, the gap closed in 2025 and is essentially zero for general knowledge tasks. The closed-source lead persists in agentic tool use, computer use, very long-context reasoning, and the polish of safety filtering. DeepSeek V4 (1.6T MoE, open weights) is within 1-2 points of GPT-5.4 on most reasoning benchmarks and ahead on math. Llama 4 405B is competitive on coding. For most production workloads outside of frontier agentic uses, open-source on dedicated inference (Groq, Cerebras, SambaNova, or self-host) is a credible primary choice.

How much have per-token prices fallen?

Flagship input pricing has fallen approximately 16x between mid-2023 (GPT-4 at $30/1M input) and mid-2026 (GPT-5.4 at $1.80/1M input). Flagship output is down 8x. Mid-tier and cheapest-tier pricing has fallen even faster — the cheapest viable production model in 2026 is around $0.02/1M input, a 25x reduction from 2023. Add prompt caching (4-10x on long prefixes), batch APIs (50% discount), and the effective cost for many workloads is now 50-100x cheaper than three years ago.

What is prompt caching and why is everyone shipping it?

Prompt caching lets a provider cache the key-value tensors for a long static prefix (a system prompt, a document, a few-shot example block) and reuse them across subsequent requests at a 4-10x discount. Anthropic shipped it in 2024, Google added Context Caching in 2024, and OpenAI shipped Cached Input pricing in late 2024. By mid-2026 every serious provider offers some form of it. For RAG and long-document workloads it is the single largest cost lever available — often larger than choosing a cheaper model.

What does the EU AI Act mean for API consumers in mid-2026?

The General-Purpose AI obligations of the Act take effect on 2 August 2026. From that date, providers of General-Purpose AI models placed on the EU market must publish a sufficiently detailed summary of training data, comply with EU copyright law, and demonstrate technical documentation. As an API consumer this means your model provider's compliance posture flows down to you when you ship a product into the EU. Choosing an EU-hosted gateway like Railwail collapses your own compliance review into one vendor and one DPA.

What is the agentic era and is it real or marketing?

It is both. Agentic uses — tool calls, multi-step planning, computer use, browser agents — moved from <2% of API workloads in early 2025 to roughly 8% in mid-2026 according to our survey. The underlying capability is real: Claude 4.7 Opus solves 78% of SWE-Bench Verified, Computer Use does end-to-end browser navigation, and Anthropic's Memory + Letta-style long-running agents persist state across days. The marketing layer is also real and outpaces capability for many vendor demos. The practical question for builders is whether their workload tolerates non-deterministic execution paths and 10-100x cost vs. a single chat call.

Where do I start if I want to use Railwail?

Three steps. (1) Create an account at railwail.com/sign-up — you get five euro of free credit, no card required. (2) Replace your OpenAI client's base URL with our gateway endpoint and use your Railwail API key. (3) Switch the model parameter to any of the 275+ catalogued models. The SDK you already use keeps working. Billing is in EUR, processed in the EU, and itemised per request.

Where can I download a PDF version?

A PDF version is available on request — write to [email protected] and we will send the print-formatted version. The canonical living version is this page; it gets quarterly inter-year refreshes that the PDF will not.

Which open-source model should I pick if I just want to ship?

Default to Llama 4 70B via a managed dedicated-inference provider (Groq, Cerebras, Together AI, Fireworks) for general workloads — it offers the best quality-per-dollar at a latency that beats every closed flagship. For coding-heavy workloads pick Codestral 2 or DeepSeek V4. For multilingual or Chinese-language pick Qwen3-72B or Qwen3-Max. Only consider self-hosting if your monthly token volume is above 500 billion tokens or you have hard data-residency requirements.

Are 2M-context models a replacement for RAG?

Mostly no. A 2M-context window can hold roughly 1,500 pages of text, which is enough for full-document workflows that previously needed retrieval. But effective recall over 800K-token inputs is still measurably worse than a well-engineered RAG pipeline over the same corpus — the model loses an estimated three to five percentage points of recall accuracy at the upper end. RAG plus a 256K-context model usually beats raw 2M context for production retrieval. The sweet-spot use of 2M context is full-codebase reasoning for software-engineering agents, where the model needs every file simultaneously and you cannot pre-rank chunks.

How does Railwail handle provider failover?

Each model in our catalog has a primary upstream and a list of fallback upstreams ordered by latency-and-quality match. When the primary returns a 5xx or exceeds an internal latency threshold (default 8 seconds for chat completion), the gateway transparently routes the request to the next fallback. The client sees a single coherent response with an X-Railwail-Upstream header indicating which provider actually served the request. Billing is at the primary's rate; we eat the price difference when fallback is more expensive.

What about hallucinations in 2026?

Hallucinations are reduced but not solved. The 2026 flagships hallucinate less than their 2024 predecessors on fact-checked tasks (TruthfulQA, HaluEval) but more on long-document reasoning where they appear confidently wrong about content that was truly in their context. The practical mitigations remain unchanged: ground in retrieved sources, ask for citations, run a confidence-classifier on outputs, and structure tasks so the model is comparing and selecting rather than recalling. RAG + citations + verification still beats raw model trust for any production workload where accuracy matters.

Is fine-tuning still relevant in 2026?

For most builders, no — at least not full-parameter fine-tuning. The combination of long context plus high-quality few-shot examples plus prompt caching closes 80% of the gap that fine-tuning used to fill. The remaining 20% is dominated by two niches: domain-specific style transfer (a brand voice that few-shot can't reliably hit) and structured-output tasks where you need consistent JSON shapes at scale. LoRA fine-tuning is still useful for these and is now broadly supported across open-weights providers. Full fine-tuning of closed-source flagships is rarely worth the cost in 2026.

What's the most under-rated 2026 development nobody talks about?

Tokeniser efficiency. Modern tokenisers (o200k for GPT-5.x, the updated Claude tokeniser, Gemini's SentencePiece variant) consume roughly 12-15% fewer tokens than 2023's GPT-4 tokeniser for the same English text, and the gap widens for languages other than English. This is a silent cost reduction that compounds with the published per-token pricing drops. Cross-vendor cost comparisons that ignore tokeniser differences systematically overstate the cost of newer models versus older ones by approximately one tier.

How does Railwail compare to OpenRouter, Together AI, or Replicate?

OpenRouter is the closest direct competitor in shape — both are unified gateways for many models. The differences: Railwail is EU-hosted (Hetzner Germany, single jurisdiction), prices in EUR with no FX surcharge, and ships a curated catalog rather than every model on Hugging Face. Together AI and Replicate are primarily inference providers running open-weights models themselves — they overlap with us on open models but not on closed frontier models. We use Together and Replicate as upstreams for some open-weights routes. If you are EU-headquartered and want one DPA covering both closed and open models, we are usually the cleanest choice.

What model should I use for European languages?

For French, German, Italian, Spanish, Dutch, Portuguese: any 2026 flagship handles them fluently. Claude 4.7 leads on tone-matching and idiom; GPT-5.4 leads on technical precision; Mistral Large 3 is competitive and has a European-jurisdiction advantage. For less-served languages (Bulgarian, Croatian, Latvian, Slovak, Estonian, etc.) Gemini 3.1 Pro leads — Google's training corpus has more breadth in long-tail European languages than any competitor. For Greek, Romanian, and Hungarian specifically, fine-tuned community Llama 4 derivatives sometimes beat the closed flagships.

Is there a Railwail Discord or community?

Yes — discord.gg/railwail. Open to all customers and to anyone evaluating the platform. We run weekly office hours on Thursdays at 16:00 CET where the founding team answers integration questions live. The community is currently small (~600 members as of May 2026) but active and responsive.

What's on the Railwail roadmap for the next quarter?

Three near-term items, all aimed at gaps surfaced in this report. (1) Automatic prefix-detection prompt caching across every supported provider — flip a single flag and we cache long prompts automatically without manual annotation. (2) Cost-per-task analytics in the dashboard, so you can compare cheap-model-with-retries against flagship-one-shot on a real workload. (3) A Computer Use proxy that lets you call Anthropic's Computer Use through Railwail with full audit logging and EU-side replay capture for compliance. All three should land before the next quarterly inter-year refresh of this report.

Can I cite this report?

Please do. Suggested citation: Railwail Research Team. The State of AI APIs 2026. Railwail, 16 May 2026. URL: https://railwail.com/en/reports/state-of-ai-apis-2026. We are also happy to provide an updated machine-readable citation block (BibTeX, CSL-JSON) on request — write to [email protected].

References

References and further reading

[1]EU AI Act, Regulation (EU) 2024/1689 — https://eur-lex.europa.eu/eli/reg/2024/1689/oj
[2]DeepSeek V3 Technical Report — https://arxiv.org/abs/2412.19437
[3]Llama 3 Herd of Models — https://arxiv.org/abs/2407.21783
[4]Anthropic — Prompt Caching pricing — https://www.anthropic.com/news/prompt-caching
[5]Google Gemini Context Caching — https://ai.google.dev/gemini-api/docs/caching
[6]OpenAI Batch API documentation — https://platform.openai.com/docs/guides/batch
[7]SWE-Bench Verified — https://www.swebench.com/
[8]Sora 2 system card — https://openai.com/index/sora-system-card/
[9]Voyage AI v3 embeddings — https://blog.voyageai.com/
[10]RT-2: Vision-Language-Action Models (Google DeepMind) — https://arxiv.org/abs/2307.15818
[11]OpenVLA: An Open-Source Vision-Language-Action Model — https://arxiv.org/abs/2406.09246
[12]Pi-0 (Physical Intelligence) — https://www.physicalintelligence.company/blog/pi0
[13]MemGPT: Towards LLMs as Operating Systems — https://arxiv.org/abs/2310.08560
[14]Computer Use (Anthropic announcement) — https://www.anthropic.com/news/3-5-models-and-computer-use
[15]Whisper Large v3 release notes — https://github.com/openai/whisper
[16]ElevenLabs v3 multimodal voice — https://elevenlabs.io/blog
[17]MoE survey: Mixture of Experts Explained (Cai et al.) — https://arxiv.org/abs/2407.06204

Citation: Railwail Research Team. The State of AI APIs 2026. railwail.com, 16 May 2026. URL: https://railwail.com/en/reports/state-of-ai-apis-2026.

For a PDF version, comments, or to flag a correction: [email protected]

One API. 275+ models. EU-hosted.

Start with €5 of free credit. No card required. Replace your OpenAI base URL with ours and inherit the entire 2026 catalog.

Create account Request the PDF

Citation: Railwail Research Team. The State of AI APIs 2026.

The State of AI APIs 2026

Contents

Executive Summary

The Model Race

Frontier releases Q1-Q2 2026

The context-window race

Reasoning models, folded

Multimodal coverage

The Open-Source Surge

Llama 4 and the Meta strategy

DeepSeek V4 — the structural moment

Qwen 3 and the Chinese open stack

Mistral, Codestral 2, and Magistral

Self-hosting break-even

License landscape

Pricing Trends

The pricing waterfall, 2023 to 2026

Prompt caching: the second-largest cost lever

Batch APIs and asynchronous workloads

From cost-per-token to cost-per-task

Latency Benchmarks

Methodology

TTFT and throughput, May 2026

Specialised inference is the throughput story

TTFT and streaming response patterns

Regional latency

What the closed labs do well on latency

Modality Expansion

Video generation: from showpiece to production

Audio: speech and music

Embeddings: the unsung modality

Robotics: vision-language-action models

Geographic Compliance

The EU AI Act and 2 August 2026

DSGVO / GDPR

The US picture in 2026

Chinese exports

Compliance matrix

Why EU-hosted matters

Developer Survey Insights

Most-used model: closed-source share is eroding

Primary use case

Pain points

Multi-provider adoption

The Agentic Era

Tool-use parity

Computer use and browser agents

Long-running agents and memory

Predictions for Q3-Q4 2026

1. Claude 5 / Opus 5 ships in Q4 2026

2. GPT-5.5 in Q3, GPT-6 unlikely before 2027

3. Open-source overtakes closed on at least one major benchmark

4. New modalities: 3D, time-series, geospatial

5. Edge AI consolidation

6. The agentic disappointment cycle starts

How Railwail Fits

Twenty-four common questions

What is The State of AI APIs 2026?

Who wrote the report and where do the numbers come from?

Is the data first-party or estimated?

How often is the report updated?

Why does context-window size matter so much in 2026?

Are open-source models actually competitive with closed-source in 2026?

How much have per-token prices fallen?

What is prompt caching and why is everyone shipping it?

What does the EU AI Act mean for API consumers in mid-2026?

What is the agentic era and is it real or marketing?

Where do I start if I want to use Railwail?

Where can I download a PDF version?

Which open-source model should I pick if I just want to ship?

Are 2M-context models a replacement for RAG?

How does Railwail handle provider failover?

What about hallucinations in 2026?

Is fine-tuning still relevant in 2026?

What's the most under-rated 2026 development nobody talks about?

How does Railwail compare to OpenRouter, Together AI, or Replicate?

What model should I use for European languages?

Is there a Railwail Discord or community?

What's on the Railwail roadmap for the next quarter?

Can I cite this report?