Open-Source vs Closed-API LLMs: When Does Self-Hosting Pay Off in 2026?

TL;DRWhen self-hosting beats API calls — the short answer

Break-even for Llama 3.3 70B self-hosted vs Together AI serverless: ~25M tokens/day sustained (≈750M/month) at 70% utilization, assuming on-demand H100 pricing.
Break-even for DeepSeek V4 self-hosted vs Fireworks serverless: ~80M tokens/day sustained (≈2.4B/month) — DeepSeek is so cheap on serverless that self-hosting rarely wins on pure cost.
Operational overhead adds ~$400k/year fully loaded (ML platform engineer + on-call + monitoring). Bake this into TCO before deciding.
Hybrid is the right default: route 80% of traffic to a serverless open-source model (cheap + zero overhead), reserve 20% for a closed-source flagship on hard cases (quality + reliability).
Closed-source has real value beyond benchmarks: SLA, predictable pricing, no GPU shortage risk, fewer security audits, better fine-print on data handling. The break-even must account for what the API includes that self-hosting doesn't.
Recommended default: Stay on serverless until 2B+ monthly tokens or a hard data-residency requirement forces the switch. Then go self-hosted on the model that fits your workload — Qwen 3 or DeepSeek V4 for general, Llama 3.3 70B for the cheapest path.

The cost comparison between open-source and closed-source LLMs has become a moving target. Two years ago, the answer was obvious: closed-source for quality, open-source for cost. In 2026, the math is more interesting. Open-source models are now within 2–5 percentage points of closed-source flagships on most benchmarks. Serverless open-source providers have driven per-token prices down by 80% in 18 months. Closed-source providers have responded with their own cost cuts and aggressive caching discounts. The result is a genuinely difficult three-way decision: closed-source API, serverless open-source, or self-hosted open-source.

This guide walks through the actual TCO math at three workload sizes, the operational overhead that benchmark posts always forget to mention, the hybrid architecture patterns that win in practice, and the situations where each option is the right call. All numbers are from May 2026 list pricing, our internal serving-stack benchmarks, and the public model cards.

The Three Architectures

Three deployment patterns compared (May 2026)

Pattern	Example stack	Per-1M-output cost (Llama 3.3 70B)	Operational overhead
Closed-source API	GPT-5.4 via OpenAI	$32.00	≈$0
Serverless open-source	Llama 3.3 70B via Together AI	$0.88	≈$0
Self-hosted open-source	Llama 3.3 70B on 2×H100	$0.50–$1.20	$400k/year fully loaded

On per-token cost alone, the comparison looks brutal: GPT-5.4 lists at $32 per million output tokens; Llama 3.3 70B on Together is $0.88; self-hosted is $0.50–$1.20 depending on utilization. The 35× cost gap between closed-source and open-source is real. But raw per-token cost is not the right metric — TCO is. And TCO depends on volume, latency requirements, data sensitivity, and how much engineering time you can spend on infrastructure rather than product.

The Real Cost of Each Pattern

Closed-source API: predictable but expensive

When you call GPT-5.4 or Claude Opus 4.7 via the official API, you pay the per-token price and get everything else included: SLA, scaling, model updates, prompt caching, regional failover, security audits, compliance certifications. There is essentially no operational overhead — the provider takes care of GPU provisioning, version management, and incident response. The cost structure is purely variable: $0 fixed cost, $X per million tokens.

Two underappreciated facts about closed-source pricing. First, the list price is rarely what you pay at scale. Enterprise contracts typically include 30–50% volume discounts plus prompt-caching credits worth another 20–40% effective discount. Second, the closed-source providers absorb the GPU shortage problem entirely — you do not have to worry about H100 availability, regional capacity, or queueing behind other tenants. This is worth real money during shortage cycles (which have happened twice in the past two years).

Serverless open-source: the goldilocks zone

Serverless open-source providers — Together AI, Fireworks, DeepInfra, AnyScale, Replicate — host open-source models and bill per token. You get most of the closed-source operational benefits (no GPU management, instant scaling, OpenAI-compatible API) at a fraction of the cost. The trade-offs: SLAs are weaker, tail latency is higher during demand spikes, model lifecycle is provider-controlled (a model can disappear with 30 days notice), and prompt caching is not yet at parity with the closed-source flagships.

Serverless open-source per-million-output pricing (May 2026)

Model	Together AI	Fireworks AI	DeepInfra
Llama 3.3 70B Instruct	$0.88	$0.90	$0.79
Llama 3.3 405B Instruct	$3.50	$3.20	$2.95
DeepSeek V4	$1.20	$1.10	$1.10
Qwen 3 235B	$0.85	$0.79	$0.72
Mixtral 8x22B	$1.20	$1.20	$1.05
DeepSeek R2 (reasoning)	$2.40	$2.20	$2.10
Llama 3.3 70B Turbo (FP8)	$0.40	$0.38	$0.35

DeepInfra is consistently cheapest. Fireworks is consistently fastest. Together has the widest model catalog. For most production workloads the right strategy is to start with whichever provider has the model you need at the best price-latency combination, then add a second provider as a hot-standby for failover. Multi-vendor routing also negotiates pricing — providers respond to credible competitive threats.

Self-hosted: capex + opex + soul

Self-hosting is where the cost story gets complicated. Per-token cost looks lower on paper, but to compute the actual TCO you have to account for: GPU capex (or rental), serving stack engineering, model update pipeline, monitoring, on-call coverage, security patching, and the opportunity cost of engineers doing infrastructure work instead of product work. For most teams, the realistic break-even where self-hosting beats serverless is much higher than the naive per-token math suggests.

Self-hosted GPU stack — list prices for common Llama 3.3 70B configurations

Config	GPU rate (Lambda, on-demand)	GPU rate (3yr reserved)	Capex (own hardware)
2× H100 80GB (FP8)	$6.00/hr	$2.80/hr	≈$120,000
1× H100 80GB (INT4)	$3.00/hr	$1.50/hr	≈$60,000
2× A100 80GB (FP8)	$3.40/hr	$1.60/hr	≈$60,000
8× L40S (cheaper alternative)	$3.20/hr	$1.50/hr	≈$48,000

Below those raw prices, the realistic per-token math gets noisier. At 70% sustained utilization on 2×H100 (the typical operating point for a serious production deployment), Llama 3.3 70B costs around $0.50/M output tokens at on-demand cloud rates, $0.23/M with 3-year reserved capacity. At 30% utilization (a more conservative number for bursty workloads), it's $1.18/M on-demand. Compare to Together AI's $0.88/M serverless price — self-hosting only wins above ~50% sustained utilization.

The Break-Even Calculator

Let's compute the actual break-even for the most common configuration: Llama 3.3 70B on 2× H100 vs the same model on Together AI's serverless tier. The math depends on three variables: tokens per day, GPU utilization, and operational overhead amortization.

Daily monetary cost — Llama 3.3 70B at three usage levels

Daily output tokens	Together AI serverless	Self-hosted 2×H100 (on-demand)	Self-hosted (3yr reserved)	Break-even verdict
1M tokens (10% utilization)	$0.88	$144 (2×$72)	$67	Serverless wins by 80–160×
10M tokens	$8.80	$144	$67	Serverless wins by 7–16×
25M tokens (50% utilization)	$22.00	$144	$67	Self-hosted reserved wins
50M tokens	$44.00	$144	$67	Self-hosted wins clearly
100M tokens	$88.00	$288 (4×H100)	$134	Self-hosted wins at scale
500M tokens	$440.00	$1,440 (20×H100)	$670	Self-hosted wins by 2–4×

Reading the table: self-hosting wins on raw daily compute cost starting around 25 million output tokens per day with reserved capacity, or 80–100 million daily tokens with on-demand pricing. Below that, serverless is cheaper despite the markup.

But this is the wrong math

Daily compute cost is not TCO. To get the real picture, add operational overhead: ML platform engineer at fully loaded $220k/year, on-call rotation across 3 engineers (each spending ~10% of time = $66k/year), monitoring infrastructure ($24k/year), security audits ($30k/year), and model-update labor (1 person-week per model release × 4 releases per year = $40k/year). Total: $380k/year, or $1,041/day.

Self-hosting TCO including operational overhead

Daily output tokens	Compute cost (3yr reserved)	+ Daily overhead ($1,041)	Total daily cost	vs Serverless
10M	$67	$1,041	$1,108	Serverless wins 126×
25M	$67	$1,041	$1,108	Serverless wins 50×
50M	$67	$1,041	$1,108	Serverless wins 25×
100M	$134	$1,041	$1,175	Serverless wins 13×
250M	$335	$1,041	$1,376	Serverless wins 6×
500M	$670	$1,041	$1,711	Serverless wins 4×
1B (35% util on 16×H100)	$1,340	$1,041	$2,381	Self-hosted wins 1.8×
2.5B (44% util on 32×H100)	$3,350	$1,041	$4,391	Self-hosted wins 5×
10B+ tokens	$13,400+	$1,041	$14,441	Self-hosted wins 6×+

Once you include operational overhead, the break-even shifts dramatically. Self-hosting only beats serverless above roughly 1 billion daily tokens (30B monthly). Below that — which is where 95% of production AI workloads actually sit — serverless is materially cheaper. The list-price comparison badly understates the operational tax.

When Closed-Source Still Wins

Even with the cost gap between closed-source and open-source narrowing to 30–50× on serverless, there are workloads where closed-source remains the right answer.

Frontier reasoning where the 4-6 point benchmark gap matters

On the hardest reasoning tasks — multi-step engineering problems, ambiguous customer-support cases, research-level QA — the 4–6 point benchmark gap between open-source and closed-source flagships compounds into noticeably worse outcomes. For an agentic coding assistant trying to land a PR on its first attempt, going from Claude Opus 4.7's 74.5% SWE-bench Verified to DeepSeek V4's 67.8% means roughly 1 in 5 patches requires human intervention vs 1 in 4 for the open-source model. At scale, that human-intervention cost dominates the per-token savings.

Compliance and contractual requirements

Enterprises with SOC 2, HIPAA, or PCI-DSS compliance requirements often find closed-source vendors easier to onboard — they ship pre-built compliance attestations and ready-made Data Processing Agreements. Open-source serverless providers are catching up but are often a quarter or two behind on the latest certification cycle. Self-hosted, you own the compliance work entirely. If your sales cycle requires SOC 2 from your AI vendor, closed-source is the fastest path.

Latency-sensitive interactive UX

On short responses where TTFT dominates perceived latency, closed-source flagships still have an edge: GPT-5.4 median TTFT is 278ms vs Together's Llama 3.3 70B at ~340ms and self-hosted at ~280ms (if your stack is well-tuned). The gap is small enough that it does not always justify the cost, but for sub-second chat UX it is worth measuring.

Multimodal and frontier features

Computer Use, native video, real-time voice, image generation tied to the same model — these features arrive in closed-source 6–18 months before they reach open-source. If your product needs these, closed-source is currently the only path.

The Hybrid Architecture That Wins in Practice

The pattern most successful teams have converged on: route most traffic to a serverless open-source model, escalate hard cases to a closed-source flagship, never self-host until forced to. This captures the cost advantage of open-source for the 80% of traffic where the quality gap doesn't matter, the quality advantage of closed-source on the 20% where it does, and zero operational overhead from both. The implementation is straightforward.

// hybrid-router.ts — simple example of cost-aware routing import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.RAILWAIL_API_KEY, baseURL: "https://railwail.com/v1", }); async function route(messages: Array<{ role: string; content: string }>) { // Heuristic: pick the right model for the request shape const totalTokens = messages.reduce((s, m) => s + m.content.length / 4, 0); const lastUser = messages.findLast((m) => m.role === "user")?.content ?? ""; const looksHard = /multi-step|prove|derive|optimi[sz]e|refactor entire|architect/i.test(lastUser) || totalTokens > 30_000; const model = looksHard ? "claude-opus-4-7" : "qwen-3-235b"; const r = await client.chat.completions.create({ model, messages }); return { text: r.choices[0].message.content ?? "", model }; } // Add a quality-validation second pass async function withEscalation(messages: any[]) { const first = await route(messages); if (first.model === "qwen-3-235b" && needsEscalation(first.text)) { return await client.chat.completions.create({ model: "claude-opus-4-7", messages: [...messages, { role: "assistant", content: first.text }, { role: "user", content: "Reconsider — make this rigorous." }], }); } return first; }

This is the dumb version of routing — heuristic rules on message content and length. Smarter versions use a small classifier model to estimate task difficulty, route accordingly, and track quality metrics per route. For most teams, even the dumb version saves 60–80% on the AI bill vs routing everything to a closed-source flagship.

Auto-Routing Across Open and Closed Models

Railwail's router sends cost-sensitive traffic to open-source models and escalates hard cases to closed-source flagships. Set quality thresholds, see per-route cost analytics, switch models without code changes.

Set Up Routing

Cost Modeling at Three Realistic Scales

Scale 1 — Startup launching a chat product (100M tokens/month)

Monthly cost at 100M total tokens (60% input, 40% output)

Stack	Monthly cost	Notes
100% GPT-5.4	$1,760	Predictable, no infrastructure work
100% Llama 3.3 70B on Together AI	$56	31× cheaper, quality usually fine
Hybrid: 80% Llama 3.3, 20% Claude	$310	Quality safety net for hard cases
Self-hosted Llama 3.3 70B	$2,008 + overhead	Loses badly at this scale

At startup scale, stay on serverless. Closed-source-only is fine if you have margin to spare and want zero infrastructure work. Hybrid (mostly open-source, with closed-source escalation) is the cost-optimal default. Self-hosting at this scale loses by 30×+ once you include any operational overhead.

Scale 2 — Growth-stage company (5B tokens/month)

Monthly cost at 5B tokens

Stack	Monthly cost	Notes
100% GPT-5.4	$88,000	Heavy at this scale
100% Claude Opus 4.7	$165,000	Quality premium worth it only for specific workloads
100% Llama 3.3 70B on Together AI	$2,800	31× cheaper than GPT-5.4
Hybrid: 80% Llama 3.3, 20% Claude	$15,440	Sweet spot — quality safety net + cost
Self-hosted Llama 3.3 70B (3yr reserved)	$22,000 + $400k/yr overhead = $55,333/mo	Operational overhead dominates

Hybrid serverless is dramatically the cheapest path at this scale — closed-source only on the 20% of traffic that needs it. Self-hosting still loses because the operational overhead dwarfs the compute savings. If your business actually needs closed-source quality on 100% of traffic, the answer is to negotiate a volume discount, not to switch architectures.

Scale 3 — Enterprise at scale (50B tokens/month)

Monthly cost at 50B tokens

Stack	Monthly cost	Notes
100% GPT-5.4 (with enterprise discount, ~40% off)	$528,000	Negotiate hard
100% Llama 3.3 70B serverless	$28,000	Likely getting reserved pricing already
Hybrid: 80% Llama 3.3, 20% Claude (discounted)	$129,000	Most workloads land here
Self-hosted Llama 3.3 70B reserved	$220,000 + $400k overhead = $253,000	Self-hosting starts to win on dollar terms
Self-hosted DeepSeek V4 reserved	$330,000 + $400k overhead = $363,000	Higher quality but compute-heavier

At enterprise scale, self-hosting starts to compete on raw dollar terms — particularly if you already have ML platform engineers on staff (in which case the marginal overhead is closer to $200k than $400k). For most enterprises, hybrid serverless is still the right answer because it preserves engineering capacity for product work and avoids GPU-shortage risk. Self-hosting is a strategic decision driven by data residency, IP, or extreme scale rather than by per-token cost.

The Hidden Failure Modes

Serverless: provider lock-in via model lifecycle

Serverless providers deprecate models on their own schedule. Together AI dropped Mixtral 8x7B with 30 days notice in 2025; Fireworks deprecated older Llama variants on a similar cadence. If your production prompts are tuned against a specific model checkpoint, an upstream deprecation forces a re-evaluation cycle. The mitigation is to keep two serverless vendors active (so an EOL on one doesn't break you) and to have a quality regression suite that you can run against any candidate replacement.

Self-hosting: GPU-shortage cycles

We have lived through two H100 supply crunches in the past 24 months. Each lasted 6–10 weeks. During each crunch, cloud H100 hourly rates roughly doubled and capacity was queued by region. Closed-source and serverless open-source providers absorbed this; self-hosted teams either paid the 2× rate or queued requests. For products with high availability requirements, the right move is to keep a closed-source vendor warm even if you primarily self-host — a small percentage of traffic to GPT-5.4 in shortage windows lets you ride out crunches without an outage.

Closed-source: pricing changes you can't control

Closed-source providers have raised prices twice in the past three years (and cut them more often, but the increases hurt budgets). A model that costs you $X today might cost you $1.4X 18 months from now. Open-source pricing has only moved down, and self-hosting locks in your cost basis entirely. For finance-team-facing predictability, open-source is better.

Decision Matrix

What to pick — by your real situation

Your situation	Recommended pattern	Why
Pre-product, validating PMF	100% closed-source (GPT-5.4)	Zero infra work, ship in days
Early product, finding patterns	Hybrid — 80% open-source serverless, 20% closed escalation	Cheap default, quality safety net
Growth-stage, 1-10B tokens/month	Hybrid serverless (multi-vendor for failover)	Cost-optimal, low overhead
Enterprise, 50B+ tokens/month, no data residency requirement	Hybrid serverless with reserved capacity	Vendor negotiation power, still no overhead
Enterprise, regulated industry, on-prem requirement	Self-hosted Qwen 3 / Llama 3.3 + closed-source fallback for non-regulated traffic	Data residency forces self-hosting
Code-agent product needing maximum SWE-bench score	Claude Opus 4.7 primary, DeepSeek V4 for cheap pre-screening	6-point benchmark gap matters for code-agent products
Multilingual content product	Qwen 3 235B (serverless) primary, GPT-5.4 for hard non-English cases	Open-source has parity here at a fraction of cost
Vision-heavy product	Closed-source (Gemini for cost, GPT-5.4 for charts)	Open-source vision is not yet at parity at the flagship size

Operational Checklist Before Switching to Self-Hosting

If you are considering self-hosting, these are the questions to answer before committing capital. We have seen multiple teams skip these and end up with a self-hosted stack costing more than the serverless option they replaced.

**What is your sustained utilization?** Below 50%, self-hosting almost never wins. Measure your actual utilization on a representative serverless workload before extrapolating.
**Who owns the on-call rotation?** Self-hosted production needs 24/7 coverage. If your engineering team is under 20 people, dedicating 3 of them to LLM ops is a real product-velocity tradeoff.
**What is your model-update cadence?** Open-source models release every 3–6 months. Plan for 1 person-week of evaluation + re-deployment per release.
**Have you priced in GPU shortage risk?** Maintain a fallback to closed-source or serverless during shortage cycles. Cost-model this — it can change the break-even calculation.
**Is your tokenizer compatible?** Switching from a closed-source flagship to a self-hosted model usually means a different tokenizer, which means recounting tokens and possibly retuning prompts.
**Does your latency SLA survive cold starts?** Self-hosted models take 4–8 minutes to load. If you cannot keep pods warm, your TTFT becomes random during scaling events.
**What is your compliance posture?** Self-hosting moves all data-handling compliance work to you. Confirm you have the SOC 2 / HIPAA / GDPR audit budget for this before deciding.

Migration Patterns Without Vendor Lock-In

The architectural decision that matters most is not which model you pick — it is whether your code can swap models in one line. The pattern we recommend: write all LLM calls against an OpenAI-compatible interface (every closed-source provider, every serverless open-source provider, and most self-hosted stacks support this format). When you switch providers, change a `baseURL` and a `model` name. Nothing else.

// Single-interface pattern — works across closed and open providers import OpenAI from "openai"; function makeClient(provider: "openai" | "anthropic" | "together" | "fireworks" | "railwail" | "self-hosted") { const config = { openai: { url: "https://api.openai.com/v1", key: process.env.OPENAI_API_KEY }, anthropic: { url: "https://api.anthropic.com/v1", key: process.env.ANTHROPIC_API_KEY }, together: { url: "https://api.together.xyz/v1", key: process.env.TOGETHER_API_KEY }, fireworks: { url: "https://api.fireworks.ai/inference/v1", key: process.env.FIREWORKS_API_KEY }, railwail: { url: "https://railwail.com/v1", key: process.env.RAILWAIL_API_KEY }, "self-hosted": { url: process.env.SELF_HOSTED_URL!, key: "local" }, }[provider]; return new OpenAI({ baseURL: config.url, apiKey: config.key }); } export async function chat( provider: Parameters<typeof makeClient>[0], model: string, messages: any[], ) { return makeClient(provider).chat.completions.create({ model, messages }); }

Building against this abstraction from day one means you can re-run the cost decision every quarter as prices move. The cost of building the abstraction is small (≈100 lines of TypeScript); the cost of NOT building it is being locked into whichever vendor you chose first.

What Will Change by End of 2026

**Open-source flagships will close another 2–3 benchmark points** — both DeepSeek and Alibaba have public roadmaps targeting closed-source parity by Q4. This will further tip the cost-quality math toward open-source.
**Closed-source providers will keep cutting prices** — OpenAI cut input prices 3 times in 2025 alone. Expect a 30–40% reduction by year-end.
**Serverless providers will offer tighter SLAs** — Fireworks and Together both signaled enterprise-tier SLA improvements in Q3. This narrows the closed-vs-serverless reliability gap.
**Specialized hardware will arrive** — Groq, SambaNova, and Cerebras all claim 5–10× throughput on open-source models at competitive prices. If the price-performance numbers hold up, expect serverless costs to drop another 50%.

Bottom Line — Start Hybrid, Stay Hybrid

The clearest takeaway from the May 2026 cost math: for the vast majority of production AI workloads, the right architecture is hybrid serverless — most traffic to an open-source model on a low-cost provider, escalation to a closed-source flagship for hard cases. This pattern beats both pure closed-source (cost) and pure self-hosting (overhead) for any workload below ~30 billion tokens per month. Above that scale, self-hosting becomes a strategic question driven by data residency or extreme cost discipline, not a default.

Build behind an OpenAI-compatible abstraction so that switching models is a configuration change, not a code change. Maintain two serverless vendors so that one EOL or outage does not break you. Keep a closed-source flagship warm even if 95% of traffic goes to open-source — the quality safety net is cheap when you only invoke it on 5% of requests. Re-evaluate the cost math quarterly; the numbers will keep moving.

Frequently Asked Questions

When does self-hosting an LLM beat using a serverless API?

On raw compute cost, self-hosting wins above roughly 25 million daily output tokens with 3-year reserved capacity, or 80–100 million daily tokens on-demand. But once you include operational overhead (ML platform engineer, on-call, monitoring — about $400k/year fully loaded), the break-even moves to roughly 1 billion daily tokens (30 billion monthly). Most production workloads sit well below that, so serverless remains cheaper.

How much does it cost to use Llama 3.3 70B via API in 2026?

On serverless providers (May 2026 list pricing): Together AI $0.88/M output tokens, Fireworks $0.90, DeepInfra $0.79. That's roughly 30–35× cheaper than GPT-5.4 ($32/M output) and 80× cheaper than Claude Opus 4.7 ($75/M output).

Which is cheaper — Together AI, Fireworks, or DeepInfra?

DeepInfra is consistently cheapest on list pricing. Fireworks is fastest. Together has the broadest model catalog. For most workloads, pick by price-latency combination on the specific model you need; the gaps are small enough that running A/B between two providers is more informative than picking based on aggregate marketing.

What is the total cost of ownership for self-hosted LLMs?

Beyond GPU cost: ~$220k/year for an ML platform engineer (fully loaded), ~$66k/year for on-call coverage across 3 engineers, ~$24k/year for monitoring, ~$30k/year for security audits, ~$40k/year for model-update labor. Total operational overhead is roughly $380–400k/year before any GPU bill. This is the line item teams systematically underestimate.

Should I use closed-source or open-source LLMs in production?

Both, via a hybrid architecture. Route 80% of traffic to a serverless open-source model (Llama 3.3, Qwen 3, DeepSeek V4) for the bulk of requests where the 4–6 point quality gap doesn't matter. Escalate the hardest 20% to a closed-source flagship (GPT-5.4 or Claude Opus 4.7) where the quality advantage is worth the cost. This pattern beats pure-closed-source on cost and pure-open-source on quality.

Are open-source LLMs good enough for production in 2026?

Yes. Llama 3.3 70B, Qwen 3 235B, and DeepSeek V4 are all within 2–6 percentage points of closed-source flagships on every major benchmark. For 80% of production workloads — chat, summarization, classification, document QA, code generation — they perform indistinguishably at a fraction of the cost. The remaining 20% of workloads (frontier reasoning, agentic engineering at the highest reliability tier, advanced multimodal) still favor closed-source.

How do I avoid vendor lock-in when picking an LLM provider?

Build against an OpenAI-compatible interface (every major closed-source provider, all serverless open-source providers, and most self-hosted stacks support this format). When you switch providers, change a `baseURL` and a `model` string. Nothing else changes. Keep at least two providers active so an EOL or outage on one doesn't break you.

What GPUs do I need to self-host an open-source LLM?

Depends on the model. Llama 3.3 70B at FP8: 2× H100 80GB. Llama 3.3 405B at FP8: 8× H100. Qwen 3 235B at FP8: 4× H100. DeepSeek V4 at FP8: 8× H100. At INT4 quantization (3-5% quality loss), each footprint roughly halves. Cloud rates from Lambda or RunPod range from $3–$24/hour.

What about prompt caching — does that change the math?

Yes, meaningfully. All major closed-source providers and most serverless open-source providers now offer 70–90% discounts on cached input tokens. For agentic workloads with stable system prompts and tool schemas, this brings effective per-token costs down by 30–50%. The math favors providers that have shipped prompt caching at production quality (OpenAI, Anthropic, Fireworks, Together).

How do I calculate the break-even point for my workload?

Three numbers: (1) your sustained daily output token volume, (2) your target latency (which determines minimum GPU count and utilization), (3) your operational budget including engineering time. Plug these into the table in this article. Generally, below 100M daily tokens stay on serverless; between 100M and 1B do a hybrid serverless approach with possible reserved capacity; above 1B start evaluating self-hosting.

What are the hidden costs of using closed-source LLMs?

Pricing changes you cannot control (closed-source has raised prices twice in the past three years), vendor-dependent feature roadmaps (a deprecated capability becomes your problem), regional availability constraints, and stricter content policies that may refuse legitimate use cases. The lock-in cost is largely mitigated if you build against an OpenAI-compatible interface and maintain a second provider as warm standby.

When should I worry about GPU shortages affecting self-hosting?

Always have a fallback. We have lived through two H100 supply crunches in 24 months; each lasted 6–10 weeks with cloud rates doubling and capacity queued by region. Keep a closed-source or serverless provider warm even if you primarily self-host. Cost-model the shortage scenario before committing to self-hosting — it can change the break-even calculation.

Build a Hybrid Stack in 5 Minutes

Railwail exposes every major closed-source and serverless open-source model through one OpenAI-compatible endpoint. Route most traffic to Llama 3.3, Qwen 3, or DeepSeek V4 at pass-through serverless prices; escalate to Claude Opus 4.7 or GPT-5.4 for hard cases. Get per-route cost analytics and switch model strings without redeploying code. Start with free credits and check the math against your real workload.

One API. Every Major Model. Hybrid Routing Included.

Route cost-sensitive traffic to open-source, escalate hard cases to closed-source flagships. Per-route cost analytics, no vendor lock-in, free to start.

Set Up Hybrid Routing

SourceTogether AI — serverless inference pricing

SourceFireworks AI — serverless inference pricing

SourceDeepInfra — serverless inference pricing