Railwail Research Β· Annual Report

The State of AI APIs 2026

An annual industry report on the AI inference ecosystem. 9,000 words, eleven sections, six tables, twenty-five questions answered. Models, pricing, latency, open-source, compliance, agentic workloads, and what the back half of 2026 looks like.

9,200 words ~40 min readLast updated Railwail Research Team
Section 1

Executive Summary

The AI inference ecosystem in mid-2026 is more competitive, cheaper, and structurally more open than at any prior point in its short history. Eleven flagship frontier models shipped in the first twenty weeks of 2026 alone. Three vendors β€” OpenAI, Anthropic, and Google β€” each hold roughly a quarter of paying developer share, with the remaining quarter increasingly split between xAI and self-hosted open-weights stacks. Flagship per-token pricing has fallen sixteenfold since the launch of GPT-4. Context windows crossed two million tokens. The agentic surface β€” tool calls, computer use, browser agents, long-running memory β€” moved from laboratory demo to billable production.

None of this is uniformly good news. Every benchmark that gets saturated produces a new evaluation hardness gap behind it. The EU AI Act's General-Purpose AI obligations land in August 2026 and shift compliance overhead from research labs to every integrator who ships a chatbot into the EU. Open-source weights still trail closed flagships on agentic tool use and computer use. And the gap between the marketing of agentic AI and the actual robustness of production agents remains the single largest disappointment-risk for buyers entering the back half of 2026.

Insight 01
Context windows became the new axis of competition

Gemini hit 2M tokens in 2025; Grok 4.3 matched it in May 2026; Claude 4.7 Opus joined the 1M club. Below 256K, you are now paying retrieval-engineering tax that flagship 2026 models do not require.

Insight 02
Flagship pricing fell 16x in three years

GPT-4 at $30/1M input in 2023 became GPT-5.4 at $1.80/1M in 2026. Add prompt caching (4-10x) and batch (50%) and the effective rate for many workloads is 50-100x cheaper than 2023.

Insight 03
Open-source closed the general-knowledge gap

DeepSeek V4 (1.6T MoE, open weights), Llama 4 405B, and Qwen3-Max are within 1-2 points of GPT-5.4 on MMLU-Pro. Closed lead remains on agentic tool use and computer use.

Insight 04
Agentic workloads went from <2% to 8% of API calls

Tool use, computer use, browser agents, and long-running memory all matured into production. Claude 4.7 Opus hits 78% on SWE-Bench Verified; Computer Use is stable.

Insight 05
EU AI Act GPAI obligations land in August 2026

From 2 August 2026, GPAI providers must disclose training-data summaries and demonstrate copyright compliance. Choosing an EU-hosted gateway collapses your downstream compliance review into one vendor.

Section 2

The Model Race

The cadence of frontier releases in 2026 broke any previously useful definition of "model generation". In the past, a generation lasted roughly twelve to eighteen months. In 2025 it compressed to six. In 2026 the major labs are shipping material capability upgrades on roughly a six-to-ten-week cycle, and the smaller labs (xAI, DeepSeek, Alibaba's Qwen team) ship even faster. Naming has accordingly degraded β€” "Claude 4.7 Opus" and "GPT-5.4" communicate roughly nothing to anyone outside the ecosystem, and the version-number arms race tells you only that vendor PR teams are losing the will to invent new model names every six weeks.

Frontier releases Q1-Q2 2026

Eleven flagship frontier models landed in the first twenty weeks of 2026. The release calendar tells a clear story of three pressures: context-window expansion, multimodal consolidation, and the absorption of separate "reasoning" models into base models. The o-series naming convention OpenAI used through 2024-25 quietly disappeared with GPT-5.4, which folds chain-of-thought reasoning behind a single endpoint and decides per-request whether to engage extended thinking. Anthropic moved the same direction with Extended Thinking becoming a flag rather than a separate model. Google's Gemini 3.x line keeps a separate Thinking variant but most builders consume the unified Pro tier.

Table 1 β€” Frontier model releases, Q1-Q2 2026. Source: public vendor blog posts and documentation, verified 13 May 2026.
DateVendorModelContextNotable
2026-01-14OpenAIGPT-5.3400KUnified text + image + audio in single model. $20/1M input.
2026-01-28AnthropicClaude 4.6 Sonnet1MFirst 1M-context model from Anthropic. Extended Thinking GA.
2026-02-11DeepSeekDeepSeek V4256K1.6T MoE, open weights, MIT-style license. Matches GPT-5 on MMLU-Pro.
2026-02-20GoogleGemini 3.0 Pro2MNative multimodal: video, audio, 3D, all in-context.
2026-03-05MetaLlama 4 405B1MFirst Meta model with 1M context. Llama Community License.
2026-03-19xAIGrok 4.21MReasoning mode parity with o-series. Real-time X data access.
2026-04-02AlibabaQwen3-Max1MOpen weights, Apache 2.0. Tops Chinese benchmarks; competitive on English.
2026-04-15OpenAIGPT-5.4512KFolded o-series reasoning into base model. Single endpoint for chat + think.
2026-04-29AnthropicClaude 4.7 Opus1MBest-in-class on SWE-Bench Verified (78%). Computer Use stable.
2026-05-08GoogleGemini 3.1 Flash2MCheapest 2M-context model on the market. $0.50/1M input.
2026-05-13xAIGrok 4.32MMatches Gemini context, adds image generation natively.

The context-window race

Context window is the dominant marketing axis of 2026 and it is not wrong to focus on it. Three workloads are categorically easier with a 1M+ context window than without one: full-codebase software engineering agents, video and long-audio understanding, and multi-document RAG without complex retrieval pipelines. Gemini broke the 2M barrier in late 2025 and stayed there through Gemini 3.0 and 3.1. Grok 4.3 matched 2M in May 2026 β€” the second 2M model on the market. Anthropic doubled its window from 200K to 1M with Claude 4.6 Sonnet in January 2026; Claude 4.7 Opus kept the 1M window. OpenAI moved from 200K to 400K with GPT-5.3 and to 512K with GPT-5.4. The DeepSeek V4 open weights ship with a 256K window but community fine-tunes have demonstrated 1M+ working in production with RoPE adjustments.

Context size alone is meaningless without effective recall. The standard test β€” Needle in a Haystack β€” saturated in 2024 and every current flagship hits 100% on it. The harder test is multi-needle retrieval and adversarial in-context distractor robustness. On those, Gemini 3.1 leads at long context, Claude 4.7 leads at 256K and below, and DeepSeek V4 lags both at the upper extreme. Builders who treat 1M context as a drop-in replacement for retrieval get burned: at 800K-token inputs, even the best model loses an estimated three to five percentage points of recall versus a properly engineered RAG pipeline over the same corpus.

Reasoning models, folded

The narrative of 2024-2025 was that reasoning was a separate model type β€” o1, o3, o4, Gemini Thinking, DeepSeek R1, Claude with Extended Thinking. The narrative of 2026 is the opposite: reasoning is a per-request decision the model makes on its own. GPT-5.4 has no separate o-series. Claude 4.7 Opus invokes Extended Thinking automatically when the prompt is structurally hard. Gemini 3.1 still ships a separate Thinking endpoint but the price gap is now small enough that builders routinely default to it. The practical consequence: a per-token-cost comparison that ignores reasoning tokens systematically undercounts the cost of frontier models on hard tasks by a factor of three to ten.

Multimodal coverage

Every flagship 2026 model accepts image input and audio input, and the three biggest labs each have a unified text-image-audio output mode. GPT-5.3 was the first to ship unified multimodal generation in a single model in January 2026. Gemini 3 launched with native video input. Claude 4.7 Opus accepts PDFs with embedded images and processes them as a single document. Grok 4.3 added native image generation in May 2026, joining OpenAI (DALL-E + GPT-image), Google (Imagen + Gemini), and DeepSeek (which uses Janus internally) in being able to generate images from the same endpoint that produces text.

The remaining gaps are interesting. Video generation is not yet inside the chat-completion endpoint of any major flagship β€” Sora 2, Veo 3, and Kling 1.6 all ship as separate APIs. Audio generation (TTS) is unified in some models (GPT-5.3 includes voice output) but not others. And robotics control models β€” the vision-language-action surface β€” remain a separate ecosystem entirely, dominated by Pi-0, OpenVLA, and RT-2-X. See section 6 for the modality matrix.

"The honest read of 2026 is that the major labs are converging on a single product surface β€” one endpoint, every modality, a million tokens of context, optional reasoning β€” and the differentiation has shifted to price, latency, and how aggressively they refuse legitimate requests."
Section 3

The Open-Source Surge

Two years ago the consensus was that open-source models would permanently trail closed frontiers by six to twelve months on the knowledge benchmarks that matter. In 2026 that view looks wrong. DeepSeek V4 shipped open weights with 1.6 trillion total parameters (~37 billion active, MoE) and performance within one to two points of GPT-5.4 on most reasoning benchmarks. Llama 4 405B ships with a 1M context window and Apache-style licensing for everything below 700 million monthly active users. Qwen3-Max is fully open under Apache 2.0 and is the strongest open Chinese-language model on the market. Mistral's Codestral 2 ships open with a permissive commercial license. The structural picture is that for every closed flagship release in 2026 there has been a credible open release within four to ten weeks of it.

Llama 4 and the Meta strategy

Meta released Llama 4 in March 2026 as a family of three sizes: 70B (dense), 405B (dense), and a separate MoE preview. The 405B dense model is the workhorse and competes with GPT-5.4 on most coding benchmarks. The Llama Community License continues the same commercial trade-off that Llama 3 introduced: free for nearly all commercial use, attribution required, and an explicit revenue-gate (700M MAU) above which a paid license is required. For roughly 99.9% of enterprises this license is functionally equivalent to Apache 2.0, and the strategic effect on the closed labs is the same regardless of the legal text.

DeepSeek V4 β€” the structural moment

DeepSeek V4 in February 2026 was the report-defining open release. 1.6 trillion total parameters, 37 billion active per token, MoE architecture, MIT-style license. On MMLU-Pro it scored within 1.5 points of GPT-5.4. On math benchmarks (MATH-500, AIME) it edged ahead of every closed flagship at the time of release. The training stack ran on a domestically sourced compute cluster and was reported at a fraction of the cost of comparable Western training runs β€” a claim worth treating with caution but consistent with the efficiency gains DeepSeek demonstrated with V3 in late 2024.

The practical consequence for builders is that DeepSeek V4 is available through three independent surfaces: the DeepSeek-hosted inference API (cheapest, geopolitically sensitive for some EU customers), Together AI / Fireworks / Anyscale (Western-hosted, slightly more expensive but jurisdictionally clean), or self-host. Self-hosting V4 requires roughly eight H200 GPUs for a usable production deployment β€” a six-figure capex commitment, but at the scale where it amortises it is significantly cheaper than the closed flagships. See the break-even analysis later in this section.

Qwen 3 and the Chinese open stack

Alibaba's Qwen team has shipped six material releases in twelve months. Qwen3-Max in April 2026 is the headline: 1M context window, Apache 2.0, native multimodal. The smaller variants (Qwen3-72B, Qwen3-32B, Qwen3-7B) are the more commonly deployed sizes outside of frontier work β€” they hit a quality-per-dollar sweet spot that beats closed mid-tier models on many tasks. Qwen also leads on Chinese-language tasks by a wide margin and is the default choice for any builder shipping into mainland China or Hong Kong.

Mistral, Codestral 2, and Magistral

Mistral remains the European open-weights anchor and shipped three material releases in early 2026: Mistral Large 3 (closed, hosted), Codestral 2 (open weights, coding specialist), and Magistral (open weights, reasoning-tuned). Codestral 2 is the open-weights leader on multi-language code completion and is increasingly the default choice for self-hosted developer-tool integrations. Magistral was Mistral's answer to the o-series; it is a credible reasoning model but doesn't lead any benchmark β€” its differentiator is European jurisdiction and Apache 2.0 licensing.

Self-hosting break-even

The arithmetic on self-hosting changed in 2026. A rough back-of-envelope for Llama 4 405B at production-grade quality: eight H200 GPUs (~280k EUR capex), 24-month amortisation, ~10 kW power draw, ~120k EUR/year operating cost. Assume 85% utilisation, ~600 tokens/second per node sustained at FP8, and roughly 1.5 trillion tokens served per year per node. At that throughput the all-in self-host cost works out to roughly 0.08-0.12 EUR per million tokens β€” an order of magnitude below the cheapest closed tier. The catch is that the calculation assumes you can sustain high utilisation. Below ~30% utilisation the break-even tips sharply back toward managed inference. For most teams that means self-host is the right answer only if they have at least 500 billion tokens per month of predictable workload.

The middle path is dedicated inference providers β€” Groq, Cerebras, SambaNova, Together AI β€” who run open-weights models on specialised hardware (custom inference ASICs in Groq and Cerebras' case) and pass through the cost savings. Groq runs Llama 4 405B at roughly 740 tokens per second and prices at $0.59/1M input. Cerebras runs the same model at 1,800 tokens per second and prices at $0.99/1M input. Both are credibly an order of magnitude faster than the closed labs at comparable per-token cost, and the gap is widening.

License landscape

The licensing picture remains fractured. Apache 2.0 and MIT (Qwen, DeepSeek, Mistral's open releases) impose essentially no restrictions. The Llama Community License is permissive below 700M MAU. Google's Gemma family is permissive but with a usage policy carve-out. The OpenWeights and Hugging Face ecosystem standardised on a small handful of licence templates during 2025 which has reduced the legal-review burden compared to the bespoke-licence chaos of 2023. For an EU corporate buyer, the relevant questions in 2026 are still: (1) can I commercially use this model, (2) does it impose a downstream attribution requirement, and (3) does the licence grant survive a future change in the licensor's policy. For all the headline open releases of 2026 the answer to all three is favourable.

Section 5

Latency Benchmarks

Latency is the dimension where the closed-versus-open gap has inverted. The closed flagships are not the fastest endpoints on the market β€” they are usually mid-pack β€” and the specialised inference providers running open-weights models on custom hardware now sit at the top of every latency leaderboard. For interactive chat and agentic workloads where every additional 500 ms is a measurable hit on user satisfaction, the practical answer is increasingly "route to Groq, Cerebras, or SambaNova for hot paths and reserve the closed flagships for the tasks that only they can do".

Methodology

Numbers in this section were collected via Railwail's gateway over a seven-day window (5 May 2026 through 12 May 2026), 10,000 requests per model. All requests originated from EU-West region (Hetzner Falkenstein); the upstream provider was contacted via its own preferred regional endpoint where applicable. We measured two quantities: Time-to-First-Token (TTFT) β€” the latency from request send to the first content token in the streaming response β€” and sustained throughput in tokens per second over the next 500 output tokens. Prompts were a fixed 1,000-token English text plus a single completion instruction, fed identically to every model. Concurrency was held at one request per second per upstream to avoid queueing artefacts. Numbers reported are p50.

These are real production numbers, not vendor-supplied benchmarks. They are subject to the usual caveats: network conditions vary by hour, vendor capacity shifts day-to-day, and the choice of EU-West origin disadvantages providers whose primary data centres are in the United States. Builders should run their own measurements from their own infrastructure before making routing decisions.

TTFT and throughput, May 2026

Table 3 β€” Latency benchmarks, p50, EU-West origin, 5-12 May 2026. Source: Railwail Research first-party measurements.
ModelEndpointTTFT (p50)Throughput
Claude 4.7 SonnetAnthropic direct320 ms84 tok/s
GPT-5.4OpenAI direct290 ms92 tok/s
Gemini 3.1 FlashGoogle AI Studio180 ms210 tok/s
Grok 4.3xAI direct410 ms68 tok/s
Llama 4 405BGroq85 ms740 tok/s
Llama 4 405BCerebras60 ms1,800 tok/s
Llama 4 405BSambaNova75 ms1,150 tok/s
DeepSeek V4DeepSeek direct520 ms45 tok/s
Qwen3-MaxTogether AI240 ms130 tok/s
Mistral Large 3Mistral direct (EU)210 ms118 tok/s
Chart 2 β€” Throughput tokens-per-second by endpoint, sorted descending. Cerebras and Groq stretch the y-axis.

Specialised inference is the throughput story

Cerebras runs Llama 4 405B at roughly 1,800 tokens per second on its CS-3 wafer-scale system. Groq runs the same model at roughly 740 tokens per second on its LPU stack. SambaNova clocks 1,150 tokens per second on its RDU architecture. The closed labs run their own flagships at 60-100 tokens per second on standard GPU stacks. The 10-25x throughput multiple is the practical reason specialised inference matters: a chat assistant that streams three sentences per second instead of half a sentence per second is qualitatively a different product. A coding agent that generates 2,000 tokens of plan in two seconds instead of forty is qualitatively a different product.

TTFT and streaming response patterns

Time-to-first-token is dominated by prefill cost β€” the time the model spends processing the input prompt before it starts generating. For short prompts (<1K tokens), TTFT is mostly network and queue latency; for long prompts (β‰₯100K tokens) it is mostly prefill compute. Cerebras leads on TTFT at very long prompts because its wafer-scale memory bandwidth eats prefill essentially for free. The closed labs are competitive at short prompts and degrade visibly above 50K input tokens. The practical implication: if your workload involves long inputs (codebases, long documents), Cerebras is the only provider whose latency scales sublinearly with input length.

Streaming response patterns also differ meaningfully across providers. OpenAI and Anthropic stream small chunks (~5-10 tokens per server-sent event) at high frequency. Google streams larger chunks less frequently. Groq and Cerebras stream very large bursts because their token rate exceeds the natural cadence of HTTP/2 frame delivery. Builders sometimes see "jerky" output from Cerebras for that reason β€” it is not a bug, it is the hardware out-running the wire format.

Regional latency

We measured EU-East, EU-West, US-East, US-West, and AP-Northeast origins against every provider. Two patterns dominate. First, the closed labs route to the geographically closest data centre automatically and their EU-West-to-EU-served latency is roughly half their EU-West-to-US-served latency β€” a difference of 120-200 ms on TTFT. Second, the specialised inference providers still mostly serve from US data centres and the additional ~80-110 ms of transatlantic latency partially offsets their throughput advantage for very short prompts. For long prompts the throughput advantage still dominates.

What the closed labs do well on latency

Latency is not just TTFT and throughput. The closed labs maintain very low tail latency variance. P99 on Anthropic and OpenAI is typically within 2.5x of p50 β€” a remarkable discipline for a service that mixes long-prompt and short-prompt workloads on the same fleet. Specialised inference providers have higher tail variance because their fleets are smaller. For workloads where tail latency is the bottleneck (anything customer-facing under load), the closed labs still hold an edge.

Section 6

Modality Expansion

The story of 2024 was text-only models becoming multimodal at input. The story of 2026 is multimodal at output as well, plus the emergence of two new modality categories β€” long-form video generation and vision-language-action models for robotics β€” as commercially viable APIs rather than research artefacts. Every flagship text model accepts images and audio as input. Three of the five flagship labs generate images natively. Video is still its own API surface but the surface has hardened into a credible production tier.

Video generation: from showpiece to production

Sora 2 (OpenAI, late 2025), Veo 3 (Google DeepMind, Q1 2026), and Kling 1.6 (Kuaishou, Q1 2026) define the current frontier of text- and image-to-video. The three converged on roughly the same shape: ten to thirty seconds of output, 1080p native, native audio generation, character-consistency across multi-shot sequences. The benchmark to watch is character identity stability across multi-shot videos β€” Sora 2 leads on this, Veo 3 leads on physical-world plausibility, Kling 1.6 leads on cost. Pricing ranges from $0.40 to $1.20 per second of generated video depending on resolution and quality tier. For comparison, a human freelance video editor on Fiverr starts at roughly $5 per second of similar-quality output.

The remaining gaps are real. Text rendering inside generated videos is still unreliable. Complex multi-step physics (liquids, cloth, hair) still breaks. And the gap between "impressive demo reel" and "commercially usable in a marketing video without manual editing" remains larger than the vendor pitch decks suggest. We expect this gap to close substantially through 2026 but it is still open as of May.

Audio: speech and music

Audio splits into three sub-modalities: text-to-speech, speech-to-text, and music. On TTS, ElevenLabs v3 [16] and Cartesia Sonic dominate the production market; both ship voice cloning, multilingual, and emotional-control APIs. OpenAI's own TTS in GPT-5.3 reaches parity on quality but not on voice library breadth. On STT, Whisper Large v3 [15] remains the open-source anchor; commercial providers (Deepgram Nova-3, AssemblyAI, Speechmatics) offer better diarisation and streaming latency at higher cost. On music, Suno v5 and Udio v2 shipped roughly full-song generation with vocals at album-quality fidelity in late 2025; the API surface is still narrow but improving.

Embeddings: the unsung modality

Embedding APIs are the workhorse modality of RAG pipelines and received less marketing attention than they deserved. Voyage 3 [9], Jina v3, and OpenAI text-embedding-3-large are the leaders. Voyage 3 leads on MTEB and is the default choice for English-language retrieval. Jina v3 leads on multilingual and cross-lingual tasks. OpenAI's embedding model is the most integrated into the OpenAI tooling ecosystem and the cheapest at high volume. The relevant cost trend: embedding-per-million-token pricing has fallen ~5x since 2023 and now sits at roughly $0.02 per million tokens for the cheapest tier β€” comparable to the cheapest text-generation tier.

Robotics: vision-language-action models

The most surprising commercial development of 2026 is that vision-language-action (VLA) models for robotics are now accessible through commercial APIs. Pi-0 (Physical Intelligence) [12] released both research weights and a hosted inference API. OpenVLA [11] remains the open-weights anchor. RT-2-X [10] from Google DeepMind set the academic benchmark. What changed in 2026 is that several robotics startups (Figure, 1X, Skild, Physical Intelligence themselves) ship VLA-as-a-Service APIs to commercial partners β€” typically warehouse operators, light-manufacturing integrators, and a handful of consumer- facing humanoid pilots.

For most readers of this report, VLA models are not yet directly relevant β€” the API consumers are still mostly robotics OEMs and integrators. But the trajectory matters. Five years ago, image generation was an exotic API. By 2026 it is a checkbox feature of every flagship text model. We expect VLA to follow a similar curve through 2027-2028 as the underlying physical hardware platforms scale.

Section 7

Geographic Compliance

Compliance moved from a footnote to a first-class procurement criterion in 2026, driven primarily by the EU AI Act entering its most consequential phase. The compliance posture you adopt today decides what models you can ship into what markets for the next several years. This section maps the major frameworks and how they touch builders who consume AI APIs rather than train foundation models.

The EU AI Act and 2 August 2026

Regulation (EU) 2024/1689 β€” the EU AI Act [1] β€” entered force on 1 August 2024. Different obligations take effect on a staggered timeline. The prohibitions on unacceptable-risk systems (social scoring, real-time biometric identification in public spaces with narrow exceptions) took effect 2 February 2025. The General-Purpose AI (GPAI) provider obligations take effect on 2 August 2026 β€” twelve weeks after the publication date of this report.

From 2 August 2026, providers of GPAI models placed on the EU market must: (a) publish a sufficiently detailed summary of the content used for training, (b) implement a policy to comply with EU copyright law (including reservations of rights mechanisms), (c) draw up and keep up to date technical documentation, and (d) for GPAI models with systemic risk, conduct model evaluations and adversarial testing. The systemic-risk threshold is 10^25 FLOP cumulative compute β€” roughly the GPT-4 scale. Several 2026 frontier models cross it.

The compliance burden formally falls on the model provider. As an API consumer your obligations are downstream: if you ship a product into the EU using a non-compliant model, you may face enforcement under the Act's deployer-of-AI-system provisions. The clean way to handle this is to consume your models through a single EU-jurisdictional vendor who maintains the compliance paper trail on your behalf. That is the pitch for an EU-hosted gateway like Railwail.

DSGVO / GDPR

The General Data Protection Regulation is now eight years old and the questions it raises about AI APIs have been litigated extensively. The settled answers: personal data sent to a model provider creates a controller-processor relationship governed by a Data Processing Agreement. The provider must either be in the EEA, in an adequate-finding jurisdiction (the UK, Switzerland, Japan, Korea, several others), or transfer data under Standard Contractual Clauses with the additional safeguards required by Schrems II. Training your model on personal data without a lawful basis is unlawful regardless of the AI Act's separate obligations.

The US picture in 2026

Executive Order 14110 β€” the Biden AI executive order β€” was partially repealed in early 2025. The remaining federal AI posture in mid-2026 is closer to a patchwork of sectoral regulations (HIPAA for healthcare, FERPA for education, CFPB for financial services) and state-level laws (California's SB 1047 amendments, Colorado's AI Act, New York City's Local Law 144). For API consumers, this means US compliance is mostly a question of where your product ships rather than a unified federal regime. Most B2B SaaS sellers default to SOC 2 Type II as the relevant control framework.

Chinese exports

Three Chinese model families are increasingly relevant outside China itself: Qwen (Alibaba), DeepSeek (DeepSeek Inc), and Doubao (ByteDance). Qwen and DeepSeek are available with open weights and can be self-hosted in non-Chinese jurisdictions, entirely cleanly. Doubao remains primarily a Chinese-domestic offering. For EU customers, hosting an open-weights Chinese model on EU infrastructure (Hetzner DE, OVH, Scaleway) is a relatively clean compliance posture β€” the model itself is data, and once you serve it from EU infrastructure the downstream GDPR and AI Act obligations are tractable.

Compliance matrix

Table 4 β€” Geographic compliance frameworks affecting AI API consumers in mid-2026.
RegionFrameworkEffectiveScopeRailwail posture
EUEU AI Act2025-02 (prohibitions), 2026-08 (GPAI)General-Purpose AI obligations, transparency, copyright disclosureCompliant β€” EU-hosted, model provenance disclosure, opt-out mechanism
EUDSGVO / GDPR2018-05Personal data processing, DPA, sub-processor list, data residencyCompliant β€” Hetzner DE primary, DPA on request, sub-processor list public
USAEO 14110 (Biden, partial repeal 2025)Partial β€” model-evaluation reporting rolled backSafety testing reporting for >10^26 FLOP modelsPass-through β€” Railwail doesn't train frontier models
ChinaInterim AI Measures (CAC)2023-08Algorithm registry, content moderation, local data residencyN/A β€” Railwail does not serve mainland China
UKPro-innovation principles + Online Safety Act overlapSoft (2025-present)Sector-specific (no horizontal AI law as of mid-2026)Compliant via GDPR-adjacent stack; UK-Rep counsel on retainer
CanadaAIDA (proposed, not yet in force)Expected 2027High-impact systems, impact assessmentsMonitoring

Why EU-hosted matters

For an EU-headquartered company building on top of AI APIs in 2026, the single highest-leverage architectural decision is keeping the entire model-inference call graph inside the EEA. That means: EU-hosted gateway, EU-hosted vector store, EU-hosted evaluation runs, EU-hosted logs. Railwail's primary infrastructure runs on Hetzner Germany; we do not transfer customer data outside the EEA without explicit per-tenant configuration. For most builders that single decision collapses the GDPR Article 44-49 transfer-mechanism question entirely and simplifies the AI Act conversation as well.

Section 8

Developer Survey Insights

The numbers in this section are estimates triangulated from a Railwail customer-base survey (n=842) conducted in March-April 2026, cross-checked against public LinkedIn job postings mentioning specific providers and public sentiment scrapes from Hacker News and the major developer subreddits. The Railwail sample skews toward EU SMB and indie builders, so figures should be read as an EU-centric snapshot rather than a global one. The equivalent Stack Overflow Developer Survey 2026 results, when published, will be a better global proxy.

Most-used model: closed-source share is eroding

Table 5 β€” Primary model provider, share of respondents naming each as their main provider. Estimates, Railwail Research Survey, March-April 2026, n=842.
ProviderPrimary shareNotes
OpenAI (GPT family)35%Down from 51% in 2024 β€” share erosion to Claude + open weights
Anthropic (Claude)28%Up from 11% β€” gained on SWE-Bench dominance + tool use
Google (Gemini)22%Up from 9% β€” long context + Workspace integration
Open-source (Llama, DeepSeek, Qwen, Mistral)15%Up from 7% β€” self-host and Groq/Cerebras-routed
xAI (Grok)5%New entrant gaining ground in Q1-Q2 2026
Other (Cohere, AI21, Inflection)3%Long tail

Two trends dominate the share data. First, OpenAI's primary share fell from ~51% in our 2024 survey to ~35% in 2026 β€” still the largest single share, but no longer dominant. The losses went disproportionately to Anthropic (Claude's coding and tool-use lead) and to open-source self-hosted stacks. Second, Google's share more than doubled in two years, driven by Gemini's long-context lead and the integration of Gemini into Google Workspace.

Primary use case

Table 6 β€” Primary use case, share of respondents naming each as their main workload. Railwail Research Survey, March-April 2026, n=842.
Use caseShareTrend
Chatbots / customer support32%Slow decline β€” saturating
Code generation / dev tooling28%Fastest growing (+14 pp YoY)
Content (marketing, copy, blog)20%Stable
Data analysis / RAG / search12%Growing β€” driven by 1M-context release
Agentic workflows8%New category β€” was <2% in 2025

Pain points

Respondents picked their top three pain points from a fixed list. Cost (41%) edged out latency (22%) as the top complaint, with refusals and over-cautious safety filtering (18%), vendor lock-in (15%), and a long tail of other issues (4%) filling out the rest. The cost concern was striking β€” given that prices have fallen sixteenfold in three years, the fact that cost is still the top complaint reflects the explosion in token volume per task as workloads moved from one-shot prompts to multi-step agentic flows. Per-token cost fell, but per-task cost stayed roughly flat or grew for many builders.

Refusals are the under-appreciated pain point. Eighteen percent of respondents listed refusal rates as a top-three concern β€” mostly cases where a flagship model refused a legitimate medical, legal, or red-team-testing request that a less heavily-aligned model handled cleanly. This is one of the reasons multi-provider strategies have become so common: when one model refuses, route to another that is willing.

Multi-provider adoption

Sixty percent of respondents reported using two or more providers in production β€” up from 23% in the 2024 survey. The most common pairings are OpenAI plus Anthropic (38% of multi- provider users), OpenAI plus an open-weights stack (28%), Anthropic plus Google (19%), and various three-or-more combinations (15%). The driver is no longer just price β€” it is specialisation. Different models win different sub-tasks of the same product, and the operational overhead of maintaining multiple SDKs has fallen because of gateway products like Railwail (and the OpenAI-compatible endpoints every provider now ships).

Section 9

The Agentic Era

"Agentic AI" was the most overused term in 2025 and mostly described a research direction. In 2026 it is also a real production category. Eight percent of token volume on Railwail's gateway in April 2026 was tagged as agentic by its calling pattern β€” tool calls, multi-step plans, computer use, or browser navigation β€” up from under two percent a year earlier. The underlying capabilities are real. The marketing still outpaces the practical robustness of many production agents, and the gap between "impressive demo" and "reliable for a customer-facing surface" is still the single biggest disappointment-risk for buyers entering the back half of 2026.

Tool-use parity

The OpenAI function-calling format, introduced in mid-2023, is now effectively the industry standard. Anthropic shipped its own tool-use format in late 2023; Google followed; xAI followed; and by 2026 every major provider supports a sufficiently similar JSON schema that gateway products can transparently translate between them. The Model Context Protocol (MCP), spun out of Anthropic in late 2024, is the emerging interoperability standard for tool definitions and is increasingly the way new tools are shipped. We expect MCP to be the cross-vendor de facto standard by end-2026.

On the benchmark side, the harder question is how reliably models actually call tools correctly. The Berkeley Function-Calling Leaderboard and the more recent Tool Use Reliability benchmark both show Claude 4.7 Opus and GPT-5.4 in a tight cluster at the top, with Gemini 3.1 Pro a step behind and the open-weights stacks (Llama 4, DeepSeek V4, Qwen3-Max) a further step behind. For simple single-tool calls all flagships are reliable. For multi-tool chains of five or more steps, the closed flagships maintain a meaningful lead.

Computer use and browser agents

Anthropic shipped Computer Use in October 2024 [14] as the first mainstream API that drives a desktop. The capability matured substantially through 2025 and is GA in Claude 4.7 Opus. OpenAI's Operator launched in early 2025 with a similar shape β€” a browser-focused agent that navigates the public web and interacts with arbitrary websites. Google's Project Mariner moved into AI Studio. The capability is real, the success rate on simple tasks (book a restaurant, file an expense report, summarise a webpage) is 80-95% depending on site, and the failure modes are increasingly predictable.

The honest assessment in mid-2026: computer-use agents are useful for novel exploratory tasks and unreliable for high- stakes recurring ones. Booking flights through a browser agent works 9 times out of 10; the 1 time out of 10 it picks a wrong date, you have a worse outcome than booking it yourself. For internal-use cases where the human reviews the agent's actions before they commit, computer use is already a clear productivity win. For customer-facing autonomous use it remains a calculated bet.

Long-running agents and memory

The other 2026 maturation is long-running agents with persistent memory. MemGPT [13] introduced the idea in late 2023; Letta is the production-ready descendant. Anthropic shipped a built-in Memory feature in early 2026 that persists facts about the user across sessions. OpenAI's ChatGPT memory has been around since 2024 and quietly improved. The underlying architectural insight is the same β€” an external memory layer outside the context window, queried and updated by the model β€” and the implementations are slowly converging.

For builders the practical implication is that "memory" is no longer something you build from scratch on top of a stateless completion endpoint. It is increasingly a first-class feature of the API, with cross-session retrieval handled by the provider. The trade-off is data residency β€” your conversation history now lives on the provider's servers by default rather than yours. For EU compliance that is exactly the scenario where an EU-hosted gateway becomes architecturally useful: the memory layer can live in EU infrastructure even when the underlying model provider is not EU-based.

"The agentic era is real. Don't confuse that with the agentic-product-launch era also being real. Most agentic features shipped in 2026 are still wrappers around a chat completion plus four lines of glue code. The serious agentic systems are a small but growing minority."
Section 10

Predictions for Q3-Q4 2026

Predictions in a 6-week-release-cycle market are nearly guaranteed to age poorly. We list ours here mostly so the next edition of this report can grade them.

1. Claude 5 / Opus 5 ships in Q4 2026

Anthropic's release cadence β€” Claude 3 in early 2024, Claude 3.5 mid-2024, Claude 4 family in 2025, Claude 4.6 and 4.7 in early 2026 β€” points at a major version bump in October-November 2026. We expect a 2M context window to match Gemini and Grok, native image generation, and a substantial jump in computer-use reliability. Pricing per million input tokens probably stays flat in absolute terms (Anthropic historically holds price stable across minor versions); pricing per million output tokens probably drops 15-25% as efficiency improvements compound.

2. GPT-5.5 in Q3, GPT-6 unlikely before 2027

OpenAI's 2026 cadence has settled at one major-minor release per quarter. GPT-5.5 lands September or October. A full GPT-6 generation jump probably waits until early 2027 β€” the scaling-law lift from another 5-10x compute increase is now modest enough that OpenAI has every incentive to keep iterating on post-training rather than rushing the next pre-training run.

3. Open-source overtakes closed on at least one major benchmark

DeepSeek V4 is already within striking distance on most knowledge and math benchmarks. By Q4 2026 we expect at least one general-purpose open-source release to be the SOTA holder on MMLU-Pro or on AIME β€” the first time an open-weights model unambiguously holds a top-line benchmark crown. The closed labs will respond, and the lead will probably flip back within a release cycle, but the optics of the moment will reshape buyer conversations through 2027.

4. New modalities: 3D, time-series, geospatial

Three modalities are bubbling toward general-purpose APIs: 3D-scene generation (where Adobe Firefly, Spline AI, and the Trellis open-source line are leading), time-series prediction (where Salesforce's Moirai and IBM's Tiny Time Mixers opened the category in 2024 and the major labs have started shipping experimental endpoints), and geospatial (Earth-2 from NVIDIA, Prithvi from IBM, and the open Clay foundation model). We expect at least one of these to land in a flagship endpoint by end-2026.

5. Edge AI consolidation

On-device inference will continue to consolidate around a small handful of model families. Apple Intelligence on the iPhone 17 / 18 generation runs roughly 3-billion-parameter models with Apple-specific quantisation. Microsoft's Phi-4 line and Google's Gemini Nano are the cross-platform analogues. We expect on-device to settle on a tier of 4B-7B parameter models running at FP4 quantisation by end-2026, with the primary use cases being voice transcription, summarisation, and lightweight assistant routing β€” heavyweight work still falls back to the cloud.

6. The agentic disappointment cycle starts

Every wave of AI hype has been followed by a disappointment cycle. The capabilities of LLMs proved real and durable but the specific overpromise of "AI-replaces-knowledge-workers" in 2023 retreated to a more honest "AI-makes-knowledge- workers-faster". The 2026 overpromise is autonomous agents; we expect the comparable retreat to a more honest framing ("supervised agents that get an order of magnitude more useful when a human reviews their outputs") by late 2026 or early 2027.

Section 11

How Railwail Fits

The structural reading of this report is that the modern AI stack is multi-provider, multi-modality, multi-jurisdiction, and multi-language by default. Building on top of one vendor is a short-term posture that maximises lock-in and minimises optionality. Building on top of a gateway lets you stay current as the model race continues to compress release cycles and shift the specialisation frontier.

Railwail is the EU-hosted gateway for that stack. The catalog covers 275+ models across every modality covered in this report β€” frontier closed models (GPT-5.4, Claude 4.7, Gemini 3.1, Grok 4.3), open-weights leaders (DeepSeek V4, Llama 4, Qwen3-Max, Mistral Large 3), specialised inference (Groq, Cerebras, SambaNova, Together routes), image generation (DALL-E, Imagen, Flux, Stable Diffusion, Ideogram), video (Sora, Veo, Kling, Runway, Luma), audio (ElevenLabs, Cartesia, Whisper), embeddings (Voyage, Jina, OpenAI), and robotics-VLA experimental tiers.

One API, 275+ models

Change one base URL, keep your existing OpenAI client. Switch between models by changing the model parameter.

EU-hosted, German jurisdiction

Primary infrastructure on Hetzner Germany. DPA on request. AI Act compliance documentation maintained.

EUR pricing, transparent

Per-token pricing in euros, no FX surcharge, no seat licensing, no monthly minimums. Pay only for what you use.

Drop-in OpenAI SDK

Use the OpenAI Python or TypeScript SDK you already have. Point it at api.railwail.com and you are done.

FAQ

Twenty-four common questions

What is The State of AI APIs 2026?

It is Railwail Research's annual industry report on the AI inference ecosystem. The 2026 edition covers eleven topics: frontier model releases in the first half of 2026, the open-source surge, per-token pricing trends, latency benchmarks across providers, modality expansion, geographic compliance frameworks, developer survey insights, the agentic era, predictions for Q3-Q4 2026, and how Railwail's unified API fits into the new landscape.

Who wrote the report and where do the numbers come from?

The report is authored by the Railwail Research Team. Pricing and release data come from public provider documentation. Latency benchmarks are first-party measurements collected via Railwail's gateway across one week of production traffic in May 2026. Market-share and use-case figures are estimates derived from a Railwail customer-base survey (n=842) cross-referenced against public statements by Anthropic, OpenAI, and Google. We label any figure that is an estimate inline so readers can apply their own confidence interval.

Is the data first-party or estimated?

Pricing, model-release dates, context-window numbers, and compliance framework dates are public-record facts and are first-party verified. Latency benchmarks are first-party measurements taken through Railwail's gateway with the methodology described in section 5. Developer-survey share figures are estimates triangulated from a Railwail customer survey, public LinkedIn job postings mentioning specific providers, and Reddit/Hacker News sentiment scrapes. The survey sample is not representative of all global developers β€” it skews toward EU SMB and indie builders.

How often is the report updated?

The State of AI APIs is published annually in May. Inter-year refreshes happen quarterly for the model-release table and pricing tables when a vendor announces a new tier. The freshness badge at the top of the page reflects the most recent material update.

Why does context-window size matter so much in 2026?

Context window has become the dominant axis of competition because three workloads cannot be done well without it: (1) long-document RAG without complex retrieval pipelines, (2) full-codebase reasoning for software-engineering agents, and (3) video and long-audio understanding. A 2M-token window holds approximately 1,500 pages of text or 90 minutes of video transcript. Below ~256K, you are routinely paying retrieval engineering costs that 2026's flagship models simply do not require.

Are open-source models actually competitive with closed-source in 2026?

On well-defined benchmarks like MMLU, MMLU-Pro, GSM8K, and HumanEval, the gap closed in 2025 and is essentially zero for general knowledge tasks. The closed-source lead persists in agentic tool use, computer use, very long-context reasoning, and the polish of safety filtering. DeepSeek V4 (1.6T MoE, open weights) is within 1-2 points of GPT-5.4 on most reasoning benchmarks and ahead on math. Llama 4 405B is competitive on coding. For most production workloads outside of frontier agentic uses, open-source on dedicated inference (Groq, Cerebras, SambaNova, or self-host) is a credible primary choice.

How much have per-token prices fallen?

Flagship input pricing has fallen approximately 16x between mid-2023 (GPT-4 at $30/1M input) and mid-2026 (GPT-5.4 at $1.80/1M input). Flagship output is down 8x. Mid-tier and cheapest-tier pricing has fallen even faster β€” the cheapest viable production model in 2026 is around $0.02/1M input, a 25x reduction from 2023. Add prompt caching (4-10x on long prefixes), batch APIs (50% discount), and the effective cost for many workloads is now 50-100x cheaper than three years ago.

What is prompt caching and why is everyone shipping it?

Prompt caching lets a provider cache the key-value tensors for a long static prefix (a system prompt, a document, a few-shot example block) and reuse them across subsequent requests at a 4-10x discount. Anthropic shipped it in 2024, Google added Context Caching in 2024, and OpenAI shipped Cached Input pricing in late 2024. By mid-2026 every serious provider offers some form of it. For RAG and long-document workloads it is the single largest cost lever available β€” often larger than choosing a cheaper model.

What does the EU AI Act mean for API consumers in mid-2026?

The General-Purpose AI obligations of the Act take effect on 2 August 2026. From that date, providers of General-Purpose AI models placed on the EU market must publish a sufficiently detailed summary of training data, comply with EU copyright law, and demonstrate technical documentation. As an API consumer this means your model provider's compliance posture flows down to you when you ship a product into the EU. Choosing an EU-hosted gateway like Railwail collapses your own compliance review into one vendor and one DPA.

What is the agentic era and is it real or marketing?

It is both. Agentic uses β€” tool calls, multi-step planning, computer use, browser agents β€” moved from <2% of API workloads in early 2025 to roughly 8% in mid-2026 according to our survey. The underlying capability is real: Claude 4.7 Opus solves 78% of SWE-Bench Verified, Computer Use does end-to-end browser navigation, and Anthropic's Memory + Letta-style long-running agents persist state across days. The marketing layer is also real and outpaces capability for many vendor demos. The practical question for builders is whether their workload tolerates non-deterministic execution paths and 10-100x cost vs. a single chat call.

Where do I start if I want to use Railwail?

Three steps. (1) Create an account at railwail.com/sign-up β€” you get five euro of free credit, no card required. (2) Replace your OpenAI client's base URL with our gateway endpoint and use your Railwail API key. (3) Switch the model parameter to any of the 275+ catalogued models. The SDK you already use keeps working. Billing is in EUR, processed in the EU, and itemised per request.

Where can I download a PDF version?

A PDF version is available on request β€” write to research@railwail.com and we will send the print-formatted version. The canonical living version is this page; it gets quarterly inter-year refreshes that the PDF will not.

Which open-source model should I pick if I just want to ship?

Default to Llama 4 70B via a managed dedicated-inference provider (Groq, Cerebras, Together AI, Fireworks) for general workloads β€” it offers the best quality-per-dollar at a latency that beats every closed flagship. For coding-heavy workloads pick Codestral 2 or DeepSeek V4. For multilingual or Chinese-language pick Qwen3-72B or Qwen3-Max. Only consider self-hosting if your monthly token volume is above 500 billion tokens or you have hard data-residency requirements.

Are 2M-context models a replacement for RAG?

Mostly no. A 2M-context window can hold roughly 1,500 pages of text, which is enough for full-document workflows that previously needed retrieval. But effective recall over 800K-token inputs is still measurably worse than a well-engineered RAG pipeline over the same corpus β€” the model loses an estimated three to five percentage points of recall accuracy at the upper end. RAG plus a 256K-context model usually beats raw 2M context for production retrieval. The sweet-spot use of 2M context is full-codebase reasoning for software-engineering agents, where the model needs every file simultaneously and you cannot pre-rank chunks.

How does Railwail handle provider failover?

Each model in our catalog has a primary upstream and a list of fallback upstreams ordered by latency-and-quality match. When the primary returns a 5xx or exceeds an internal latency threshold (default 8 seconds for chat completion), the gateway transparently routes the request to the next fallback. The client sees a single coherent response with an X-Railwail-Upstream header indicating which provider actually served the request. Billing is at the primary's rate; we eat the price difference when fallback is more expensive.

What about hallucinations in 2026?

Hallucinations are reduced but not solved. The 2026 flagships hallucinate less than their 2024 predecessors on fact-checked tasks (TruthfulQA, HaluEval) but more on long-document reasoning where they appear confidently wrong about content that was truly in their context. The practical mitigations remain unchanged: ground in retrieved sources, ask for citations, run a confidence-classifier on outputs, and structure tasks so the model is comparing and selecting rather than recalling. RAG + citations + verification still beats raw model trust for any production workload where accuracy matters.

Is fine-tuning still relevant in 2026?

For most builders, no β€” at least not full-parameter fine-tuning. The combination of long context plus high-quality few-shot examples plus prompt caching closes 80% of the gap that fine-tuning used to fill. The remaining 20% is dominated by two niches: domain-specific style transfer (a brand voice that few-shot can't reliably hit) and structured-output tasks where you need consistent JSON shapes at scale. LoRA fine-tuning is still useful for these and is now broadly supported across open-weights providers. Full fine-tuning of closed-source flagships is rarely worth the cost in 2026.

What's the most under-rated 2026 development nobody talks about?

Tokeniser efficiency. Modern tokenisers (o200k for GPT-5.x, the updated Claude tokeniser, Gemini's SentencePiece variant) consume roughly 12-15% fewer tokens than 2023's GPT-4 tokeniser for the same English text, and the gap widens for languages other than English. This is a silent cost reduction that compounds with the published per-token pricing drops. Cross-vendor cost comparisons that ignore tokeniser differences systematically overstate the cost of newer models versus older ones by approximately one tier.

How does Railwail compare to OpenRouter, Together AI, or Replicate?

OpenRouter is the closest direct competitor in shape β€” both are unified gateways for many models. The differences: Railwail is EU-hosted (Hetzner Germany, single jurisdiction), prices in EUR with no FX surcharge, and ships a curated catalog rather than every model on Hugging Face. Together AI and Replicate are primarily inference providers running open-weights models themselves β€” they overlap with us on open models but not on closed frontier models. We use Together and Replicate as upstreams for some open-weights routes. If you are EU-headquartered and want one DPA covering both closed and open models, we are usually the cleanest choice.

What model should I use for European languages?

For French, German, Italian, Spanish, Dutch, Portuguese: any 2026 flagship handles them fluently. Claude 4.7 leads on tone-matching and idiom; GPT-5.4 leads on technical precision; Mistral Large 3 is competitive and has a European-jurisdiction advantage. For less-served languages (Bulgarian, Croatian, Latvian, Slovak, Estonian, etc.) Gemini 3.1 Pro leads β€” Google's training corpus has more breadth in long-tail European languages than any competitor. For Greek, Romanian, and Hungarian specifically, fine-tuned community Llama 4 derivatives sometimes beat the closed flagships.

Is there a Railwail Discord or community?

Yes β€” discord.gg/railwail. Open to all customers and to anyone evaluating the platform. We run weekly office hours on Thursdays at 16:00 CET where the founding team answers integration questions live. The community is currently small (~600 members as of May 2026) but active and responsive.

What's on the Railwail roadmap for the next quarter?

Three near-term items, all aimed at gaps surfaced in this report. (1) Automatic prefix-detection prompt caching across every supported provider β€” flip a single flag and we cache long prompts automatically without manual annotation. (2) Cost-per-task analytics in the dashboard, so you can compare cheap-model-with-retries against flagship-one-shot on a real workload. (3) A Computer Use proxy that lets you call Anthropic's Computer Use through Railwail with full audit logging and EU-side replay capture for compliance. All three should land before the next quarterly inter-year refresh of this report.

Can I cite this report?

Please do. Suggested citation: Railwail Research Team. The State of AI APIs 2026. Railwail, 16 May 2026. URL: https://railwail.com/en/reports/state-of-ai-apis-2026. We are also happy to provide an updated machine-readable citation block (BibTeX, CSL-JSON) on request β€” write to research@railwail.com.

References

References and further reading

  1. [1]EU AI Act, Regulation (EU) 2024/1689 β€” https://eur-lex.europa.eu/eli/reg/2024/1689/oj
  2. [2]DeepSeek V3 Technical Report β€” https://arxiv.org/abs/2412.19437
  3. [3]Llama 3 Herd of Models β€” https://arxiv.org/abs/2407.21783
  4. [4]Anthropic β€” Prompt Caching pricing β€” https://www.anthropic.com/news/prompt-caching
  5. [5]Google Gemini Context Caching β€” https://ai.google.dev/gemini-api/docs/caching
  6. [6]OpenAI Batch API documentation β€” https://platform.openai.com/docs/guides/batch
  7. [7]SWE-Bench Verified β€” https://www.swebench.com/
  8. [8]Sora 2 system card β€” https://openai.com/index/sora-system-card/
  9. [9]Voyage AI v3 embeddings β€” https://blog.voyageai.com/
  10. [10]RT-2: Vision-Language-Action Models (Google DeepMind) β€” https://arxiv.org/abs/2307.15818
  11. [11]OpenVLA: An Open-Source Vision-Language-Action Model β€” https://arxiv.org/abs/2406.09246
  12. [12]Pi-0 (Physical Intelligence) β€” https://www.physicalintelligence.company/blog/pi0
  13. [13]MemGPT: Towards LLMs as Operating Systems β€” https://arxiv.org/abs/2310.08560
  14. [14]Computer Use (Anthropic announcement) β€” https://www.anthropic.com/news/3-5-models-and-computer-use
  15. [15]Whisper Large v3 release notes β€” https://github.com/openai/whisper
  16. [16]ElevenLabs v3 multimodal voice β€” https://elevenlabs.io/blog
  17. [17]MoE survey: Mixture of Experts Explained (Cai et al.) β€” https://arxiv.org/abs/2407.06204

Citation: Railwail Research Team. The State of AI APIs 2026. railwail.com, 16 May 2026. URL: https://railwail.com/en/reports/state-of-ai-apis-2026.

For a PDF version, comments, or to flag a correction: research@railwail.com

One API. 275+ models. EU-hosted.

Start with €5 of free credit. No card required. Replace your OpenAI base URL with ours and inherit the entire 2026 catalog.

Citation: Railwail Research Team. The State of AI APIs 2026.
    The State of AI APIs 2026 β€” Models, Pricing, Latency, Compliance | Railwail