Open-Source vs Closed-API LLMs: When Does Self-Hosting Pay Off in 2026?
Comparison

Open-Source vs Closed-API LLMs: When Does Self-Hosting Pay Off in 2026?

The full TCO comparison: serverless API (Together AI, Fireworks, DeepInfra) vs self-hosted H100 clusters vs closed-source flagships. Break-even tokens calculator, hybrid architecture patterns, and the hidden costs that benchmarks ignore.

Anjali Mehtaยท ML Infrastructure Architect23 min readMay 16, 2026

The cost comparison between open-source and closed-source LLMs has become a moving target. Two years ago, the answer was obvious: closed-source for quality, open-source for cost. In 2026, the math is more interesting. Open-source models are now within 2โ€“5 percentage points of closed-source flagships on most benchmarks. Serverless open-source providers have driven per-token prices down by 80% in 18 months. Closed-source providers have responded with their own cost cuts and aggressive caching discounts. The result is a genuinely difficult three-way decision: closed-source API, serverless open-source, or self-hosted open-source.

This guide walks through the actual TCO math at three workload sizes, the operational overhead that benchmark posts always forget to mention, the hybrid architecture patterns that win in practice, and the situations where each option is the right call. All numbers are from May 2026 list pricing, our internal serving-stack benchmarks, and the public model cards.

The Three Architectures

Three deployment patterns compared (May 2026)

PatternExample stackPer-1M-output cost (Llama 3.3 70B)Operational overhead
Closed-source APIGPT-5.4 via OpenAI$32.00โ‰ˆ$0
Serverless open-sourceLlama 3.3 70B via Together AI$0.88โ‰ˆ$0
Self-hosted open-sourceLlama 3.3 70B on 2ร—H100$0.50โ€“$1.20$400k/year fully loaded

On per-token cost alone, the comparison looks brutal: GPT-5.4 lists at $32 per million output tokens; Llama 3.3 70B on Together is $0.88; self-hosted is $0.50โ€“$1.20 depending on utilization. The 35ร— cost gap between closed-source and open-source is real. But raw per-token cost is not the right metric โ€” TCO is. And TCO depends on volume, latency requirements, data sensitivity, and how much engineering time you can spend on infrastructure rather than product.

The Real Cost of Each Pattern

Closed-source API: predictable but expensive

When you call GPT-5.4 or Claude Opus 4.7 via the official API, you pay the per-token price and get everything else included: SLA, scaling, model updates, prompt caching, regional failover, security audits, compliance certifications. There is essentially no operational overhead โ€” the provider takes care of GPU provisioning, version management, and incident response. The cost structure is purely variable: $0 fixed cost, $X per million tokens.

Two underappreciated facts about closed-source pricing. First, the list price is rarely what you pay at scale. Enterprise contracts typically include 30โ€“50% volume discounts plus prompt-caching credits worth another 20โ€“40% effective discount. Second, the closed-source providers absorb the GPU shortage problem entirely โ€” you do not have to worry about H100 availability, regional capacity, or queueing behind other tenants. This is worth real money during shortage cycles (which have happened twice in the past two years).

Serverless open-source: the goldilocks zone

Serverless open-source providers โ€” Together AI, Fireworks, DeepInfra, AnyScale, Replicate โ€” host open-source models and bill per token. You get most of the closed-source operational benefits (no GPU management, instant scaling, OpenAI-compatible API) at a fraction of the cost. The trade-offs: SLAs are weaker, tail latency is higher during demand spikes, model lifecycle is provider-controlled (a model can disappear with 30 days notice), and prompt caching is not yet at parity with the closed-source flagships.

Serverless open-source per-million-output pricing (May 2026)

ModelTogether AIFireworks AIDeepInfra
Llama 3.3 70B Instruct$0.88$0.90$0.79
Llama 3.3 405B Instruct$3.50$3.20$2.95
DeepSeek V4$1.20$1.10$1.10
Qwen 3 235B$0.85$0.79$0.72
Mixtral 8x22B$1.20$1.20$1.05
DeepSeek R2 (reasoning)$2.40$2.20$2.10
Llama 3.3 70B Turbo (FP8)$0.40$0.38$0.35

DeepInfra is consistently cheapest. Fireworks is consistently fastest. Together has the widest model catalog. For most production workloads the right strategy is to start with whichever provider has the model you need at the best price-latency combination, then add a second provider as a hot-standby for failover. Multi-vendor routing also negotiates pricing โ€” providers respond to credible competitive threats.

Self-hosted: capex + opex + soul

Self-hosting is where the cost story gets complicated. Per-token cost looks lower on paper, but to compute the actual TCO you have to account for: GPU capex (or rental), serving stack engineering, model update pipeline, monitoring, on-call coverage, security patching, and the opportunity cost of engineers doing infrastructure work instead of product work. For most teams, the realistic break-even where self-hosting beats serverless is much higher than the naive per-token math suggests.

Self-hosted GPU stack โ€” list prices for common Llama 3.3 70B configurations

ConfigGPU rate (Lambda, on-demand)GPU rate (3yr reserved)Capex (own hardware)
2ร— H100 80GB (FP8)$6.00/hr$2.80/hrโ‰ˆ$120,000
1ร— H100 80GB (INT4)$3.00/hr$1.50/hrโ‰ˆ$60,000
2ร— A100 80GB (FP8)$3.40/hr$1.60/hrโ‰ˆ$60,000
8ร— L40S (cheaper alternative)$3.20/hr$1.50/hrโ‰ˆ$48,000

Below those raw prices, the realistic per-token math gets noisier. At 70% sustained utilization on 2ร—H100 (the typical operating point for a serious production deployment), Llama 3.3 70B costs around $0.50/M output tokens at on-demand cloud rates, $0.23/M with 3-year reserved capacity. At 30% utilization (a more conservative number for bursty workloads), it's $1.18/M on-demand. Compare to Together AI's $0.88/M serverless price โ€” self-hosting only wins above ~50% sustained utilization.

The Break-Even Calculator

Let's compute the actual break-even for the most common configuration: Llama 3.3 70B on 2ร— H100 vs the same model on Together AI's serverless tier. The math depends on three variables: tokens per day, GPU utilization, and operational overhead amortization.

Daily monetary cost โ€” Llama 3.3 70B at three usage levels

Daily output tokensTogether AI serverlessSelf-hosted 2ร—H100 (on-demand)Self-hosted (3yr reserved)Break-even verdict
1M tokens (10% utilization)$0.88$144 (2ร—$72)$67Serverless wins by 80โ€“160ร—
10M tokens$8.80$144$67Serverless wins by 7โ€“16ร—
25M tokens (50% utilization)$22.00$144$67Self-hosted reserved wins
50M tokens$44.00$144$67Self-hosted wins clearly
100M tokens$88.00$288 (4ร—H100)$134Self-hosted wins at scale
500M tokens$440.00$1,440 (20ร—H100)$670Self-hosted wins by 2โ€“4ร—

Reading the table: self-hosting wins on raw daily compute cost starting around 25 million output tokens per day with reserved capacity, or 80โ€“100 million daily tokens with on-demand pricing. Below that, serverless is cheaper despite the markup.

But this is the wrong math

Daily compute cost is not TCO. To get the real picture, add operational overhead: ML platform engineer at fully loaded $220k/year, on-call rotation across 3 engineers (each spending ~10% of time = $66k/year), monitoring infrastructure ($24k/year), security audits ($30k/year), and model-update labor (1 person-week per model release ร— 4 releases per year = $40k/year). Total: $380k/year, or $1,041/day.

Self-hosting TCO including operational overhead

Daily output tokensCompute cost (3yr reserved)+ Daily overhead ($1,041)Total daily costvs Serverless
10M$67$1,041$1,108Serverless wins 126ร—
25M$67$1,041$1,108Serverless wins 50ร—
50M$67$1,041$1,108Serverless wins 25ร—
100M$134$1,041$1,175Serverless wins 13ร—
250M$335$1,041$1,376Serverless wins 6ร—
500M$670$1,041$1,711Serverless wins 4ร—
1B (35% util on 16ร—H100)$1,340$1,041$2,381Self-hosted wins 1.8ร—
2.5B (44% util on 32ร—H100)$3,350$1,041$4,391Self-hosted wins 5ร—
10B+ tokens$13,400+$1,041$14,441Self-hosted wins 6ร—+

Once you include operational overhead, the break-even shifts dramatically. Self-hosting only beats serverless above roughly 1 billion daily tokens (30B monthly). Below that โ€” which is where 95% of production AI workloads actually sit โ€” serverless is materially cheaper. The list-price comparison badly understates the operational tax.

When Closed-Source Still Wins

Even with the cost gap between closed-source and open-source narrowing to 30โ€“50ร— on serverless, there are workloads where closed-source remains the right answer.

Frontier reasoning where the 4-6 point benchmark gap matters

On the hardest reasoning tasks โ€” multi-step engineering problems, ambiguous customer-support cases, research-level QA โ€” the 4โ€“6 point benchmark gap between open-source and closed-source flagships compounds into noticeably worse outcomes. For an agentic coding assistant trying to land a PR on its first attempt, going from Claude Opus 4.7's 74.5% SWE-bench Verified to DeepSeek V4's 67.8% means roughly 1 in 5 patches requires human intervention vs 1 in 4 for the open-source model. At scale, that human-intervention cost dominates the per-token savings.

Compliance and contractual requirements

Enterprises with SOC 2, HIPAA, or PCI-DSS compliance requirements often find closed-source vendors easier to onboard โ€” they ship pre-built compliance attestations and ready-made Data Processing Agreements. Open-source serverless providers are catching up but are often a quarter or two behind on the latest certification cycle. Self-hosted, you own the compliance work entirely. If your sales cycle requires SOC 2 from your AI vendor, closed-source is the fastest path.

Latency-sensitive interactive UX

On short responses where TTFT dominates perceived latency, closed-source flagships still have an edge: GPT-5.4 median TTFT is 278ms vs Together's Llama 3.3 70B at ~340ms and self-hosted at ~280ms (if your stack is well-tuned). The gap is small enough that it does not always justify the cost, but for sub-second chat UX it is worth measuring.

Multimodal and frontier features

Computer Use, native video, real-time voice, image generation tied to the same model โ€” these features arrive in closed-source 6โ€“18 months before they reach open-source. If your product needs these, closed-source is currently the only path.

The Hybrid Architecture That Wins in Practice

The pattern most successful teams have converged on: route most traffic to a serverless open-source model, escalate hard cases to a closed-source flagship, never self-host until forced to. This captures the cost advantage of open-source for the 80% of traffic where the quality gap doesn't matter, the quality advantage of closed-source on the 20% where it does, and zero operational overhead from both. The implementation is straightforward.

// hybrid-router.ts โ€” simple example of cost-aware routing
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.RAILWAIL_API_KEY,
  baseURL: "https://railwail.com/v1",
});

async function route(messages: Array<{ role: string; content: string }>) {
  // Heuristic: pick the right model for the request shape
  const totalTokens = messages.reduce((s, m) => s + m.content.length / 4, 0);
  const lastUser = messages.findLast((m) => m.role === "user")?.content ?? "";
  const looksHard =
    /multi-step|prove|derive|optimi[sz]e|refactor entire|architect/i.test(lastUser) ||
    totalTokens > 30_000;

  const model = looksHard ? "claude-opus-4-7" : "qwen-3-235b";
  const r = await client.chat.completions.create({ model, messages });
  return { text: r.choices[0].message.content ?? "", model };
}

// Add a quality-validation second pass
async function withEscalation(messages: any[]) {
  const first = await route(messages);
  if (first.model === "qwen-3-235b" && needsEscalation(first.text)) {
    return await client.chat.completions.create({
      model: "claude-opus-4-7",
      messages: [...messages, { role: "assistant", content: first.text }, { role: "user", content: "Reconsider โ€” make this rigorous." }],
    });
  }
  return first;
}

This is the dumb version of routing โ€” heuristic rules on message content and length. Smarter versions use a small classifier model to estimate task difficulty, route accordingly, and track quality metrics per route. For most teams, even the dumb version saves 60โ€“80% on the AI bill vs routing everything to a closed-source flagship.

Sponsored

Auto-Routing Across Open and Closed Models

Railwail's router sends cost-sensitive traffic to open-source models and escalates hard cases to closed-source flagships. Set quality thresholds, see per-route cost analytics, switch models without code changes.

Cost Modeling at Three Realistic Scales

Scale 1 โ€” Startup launching a chat product (100M tokens/month)

Monthly cost at 100M total tokens (60% input, 40% output)

StackMonthly costNotes
100% GPT-5.4$1,760Predictable, no infrastructure work
100% Llama 3.3 70B on Together AI$5631ร— cheaper, quality usually fine
Hybrid: 80% Llama 3.3, 20% Claude$310Quality safety net for hard cases
Self-hosted Llama 3.3 70B$2,008 + overheadLoses badly at this scale

At startup scale, stay on serverless. Closed-source-only is fine if you have margin to spare and want zero infrastructure work. Hybrid (mostly open-source, with closed-source escalation) is the cost-optimal default. Self-hosting at this scale loses by 30ร—+ once you include any operational overhead.

Scale 2 โ€” Growth-stage company (5B tokens/month)

Monthly cost at 5B tokens

StackMonthly costNotes
100% GPT-5.4$88,000Heavy at this scale
100% Claude Opus 4.7$165,000Quality premium worth it only for specific workloads
100% Llama 3.3 70B on Together AI$2,80031ร— cheaper than GPT-5.4
Hybrid: 80% Llama 3.3, 20% Claude$15,440Sweet spot โ€” quality safety net + cost
Self-hosted Llama 3.3 70B (3yr reserved)$22,000 + $400k/yr overhead = $55,333/moOperational overhead dominates

Hybrid serverless is dramatically the cheapest path at this scale โ€” closed-source only on the 20% of traffic that needs it. Self-hosting still loses because the operational overhead dwarfs the compute savings. If your business actually needs closed-source quality on 100% of traffic, the answer is to negotiate a volume discount, not to switch architectures.

Scale 3 โ€” Enterprise at scale (50B tokens/month)

Monthly cost at 50B tokens

StackMonthly costNotes
100% GPT-5.4 (with enterprise discount, ~40% off)$528,000Negotiate hard
100% Llama 3.3 70B serverless$28,000Likely getting reserved pricing already
Hybrid: 80% Llama 3.3, 20% Claude (discounted)$129,000Most workloads land here
Self-hosted Llama 3.3 70B reserved$220,000 + $400k overhead = $253,000Self-hosting starts to win on dollar terms
Self-hosted DeepSeek V4 reserved$330,000 + $400k overhead = $363,000Higher quality but compute-heavier

At enterprise scale, self-hosting starts to compete on raw dollar terms โ€” particularly if you already have ML platform engineers on staff (in which case the marginal overhead is closer to $200k than $400k). For most enterprises, hybrid serverless is still the right answer because it preserves engineering capacity for product work and avoids GPU-shortage risk. Self-hosting is a strategic decision driven by data residency, IP, or extreme scale rather than by per-token cost.

The Hidden Failure Modes

Serverless: provider lock-in via model lifecycle

Serverless providers deprecate models on their own schedule. Together AI dropped Mixtral 8x7B with 30 days notice in 2025; Fireworks deprecated older Llama variants on a similar cadence. If your production prompts are tuned against a specific model checkpoint, an upstream deprecation forces a re-evaluation cycle. The mitigation is to keep two serverless vendors active (so an EOL on one doesn't break you) and to have a quality regression suite that you can run against any candidate replacement.

Self-hosting: GPU-shortage cycles

We have lived through two H100 supply crunches in the past 24 months. Each lasted 6โ€“10 weeks. During each crunch, cloud H100 hourly rates roughly doubled and capacity was queued by region. Closed-source and serverless open-source providers absorbed this; self-hosted teams either paid the 2ร— rate or queued requests. For products with high availability requirements, the right move is to keep a closed-source vendor warm even if you primarily self-host โ€” a small percentage of traffic to GPT-5.4 in shortage windows lets you ride out crunches without an outage.

Closed-source: pricing changes you can't control

Closed-source providers have raised prices twice in the past three years (and cut them more often, but the increases hurt budgets). A model that costs you $X today might cost you $1.4X 18 months from now. Open-source pricing has only moved down, and self-hosting locks in your cost basis entirely. For finance-team-facing predictability, open-source is better.

Decision Matrix

What to pick โ€” by your real situation

Your situationRecommended patternWhy
Pre-product, validating PMF100% closed-source (GPT-5.4)Zero infra work, ship in days
Early product, finding patternsHybrid โ€” 80% open-source serverless, 20% closed escalationCheap default, quality safety net
Growth-stage, 1-10B tokens/monthHybrid serverless (multi-vendor for failover)Cost-optimal, low overhead
Enterprise, 50B+ tokens/month, no data residency requirementHybrid serverless with reserved capacityVendor negotiation power, still no overhead
Enterprise, regulated industry, on-prem requirementSelf-hosted Qwen 3 / Llama 3.3 + closed-source fallback for non-regulated trafficData residency forces self-hosting
Code-agent product needing maximum SWE-bench scoreClaude Opus 4.7 primary, DeepSeek V4 for cheap pre-screening6-point benchmark gap matters for code-agent products
Multilingual content productQwen 3 235B (serverless) primary, GPT-5.4 for hard non-English casesOpen-source has parity here at a fraction of cost
Vision-heavy productClosed-source (Gemini for cost, GPT-5.4 for charts)Open-source vision is not yet at parity at the flagship size

Operational Checklist Before Switching to Self-Hosting

If you are considering self-hosting, these are the questions to answer before committing capital. We have seen multiple teams skip these and end up with a self-hosted stack costing more than the serverless option they replaced.

  • **What is your sustained utilization?** Below 50%, self-hosting almost never wins. Measure your actual utilization on a representative serverless workload before extrapolating.
  • **Who owns the on-call rotation?** Self-hosted production needs 24/7 coverage. If your engineering team is under 20 people, dedicating 3 of them to LLM ops is a real product-velocity tradeoff.
  • **What is your model-update cadence?** Open-source models release every 3โ€“6 months. Plan for 1 person-week of evaluation + re-deployment per release.
  • **Have you priced in GPU shortage risk?** Maintain a fallback to closed-source or serverless during shortage cycles. Cost-model this โ€” it can change the break-even calculation.
  • **Is your tokenizer compatible?** Switching from a closed-source flagship to a self-hosted model usually means a different tokenizer, which means recounting tokens and possibly retuning prompts.
  • **Does your latency SLA survive cold starts?** Self-hosted models take 4โ€“8 minutes to load. If you cannot keep pods warm, your TTFT becomes random during scaling events.
  • **What is your compliance posture?** Self-hosting moves all data-handling compliance work to you. Confirm you have the SOC 2 / HIPAA / GDPR audit budget for this before deciding.

Migration Patterns Without Vendor Lock-In

The architectural decision that matters most is not which model you pick โ€” it is whether your code can swap models in one line. The pattern we recommend: write all LLM calls against an OpenAI-compatible interface (every closed-source provider, every serverless open-source provider, and most self-hosted stacks support this format). When you switch providers, change a `baseURL` and a `model` name. Nothing else.

// Single-interface pattern โ€” works across closed and open providers
import OpenAI from "openai";

function makeClient(provider: "openai" | "anthropic" | "together" | "fireworks" | "railwail" | "self-hosted") {
  const config = {
    openai: { url: "https://api.openai.com/v1", key: process.env.OPENAI_API_KEY },
    anthropic: { url: "https://api.anthropic.com/v1", key: process.env.ANTHROPIC_API_KEY },
    together: { url: "https://api.together.xyz/v1", key: process.env.TOGETHER_API_KEY },
    fireworks: { url: "https://api.fireworks.ai/inference/v1", key: process.env.FIREWORKS_API_KEY },
    railwail: { url: "https://railwail.com/v1", key: process.env.RAILWAIL_API_KEY },
    "self-hosted": { url: process.env.SELF_HOSTED_URL!, key: "local" },
  }[provider];
  return new OpenAI({ baseURL: config.url, apiKey: config.key });
}

export async function chat(
  provider: Parameters<typeof makeClient>[0],
  model: string,
  messages: any[],
) {
  return makeClient(provider).chat.completions.create({ model, messages });
}

Building against this abstraction from day one means you can re-run the cost decision every quarter as prices move. The cost of building the abstraction is small (โ‰ˆ100 lines of TypeScript); the cost of NOT building it is being locked into whichever vendor you chose first.

What Will Change by End of 2026

  • **Open-source flagships will close another 2โ€“3 benchmark points** โ€” both DeepSeek and Alibaba have public roadmaps targeting closed-source parity by Q4. This will further tip the cost-quality math toward open-source.
  • **Closed-source providers will keep cutting prices** โ€” OpenAI cut input prices 3 times in 2025 alone. Expect a 30โ€“40% reduction by year-end.
  • **Serverless providers will offer tighter SLAs** โ€” Fireworks and Together both signaled enterprise-tier SLA improvements in Q3. This narrows the closed-vs-serverless reliability gap.
  • **Specialized hardware will arrive** โ€” Groq, SambaNova, and Cerebras all claim 5โ€“10ร— throughput on open-source models at competitive prices. If the price-performance numbers hold up, expect serverless costs to drop another 50%.

Bottom Line โ€” Start Hybrid, Stay Hybrid

The clearest takeaway from the May 2026 cost math: for the vast majority of production AI workloads, the right architecture is hybrid serverless โ€” most traffic to an open-source model on a low-cost provider, escalation to a closed-source flagship for hard cases. This pattern beats both pure closed-source (cost) and pure self-hosting (overhead) for any workload below ~30 billion tokens per month. Above that scale, self-hosting becomes a strategic question driven by data residency or extreme cost discipline, not a default.

Build behind an OpenAI-compatible abstraction so that switching models is a configuration change, not a code change. Maintain two serverless vendors so that one EOL or outage does not break you. Keep a closed-source flagship warm even if 95% of traffic goes to open-source โ€” the quality safety net is cheap when you only invoke it on 5% of requests. Re-evaluate the cost math quarterly; the numbers will keep moving.

Frequently Asked Questions

When does self-hosting an LLM beat using a serverless API?

On raw compute cost, self-hosting wins above roughly 25 million daily output tokens with 3-year reserved capacity, or 80โ€“100 million daily tokens on-demand. But once you include operational overhead (ML platform engineer, on-call, monitoring โ€” about $400k/year fully loaded), the break-even moves to roughly 1 billion daily tokens (30 billion monthly). Most production workloads sit well below that, so serverless remains cheaper.

How much does it cost to use Llama 3.3 70B via API in 2026?

On serverless providers (May 2026 list pricing): Together AI $0.88/M output tokens, Fireworks $0.90, DeepInfra $0.79. That's roughly 30โ€“35ร— cheaper than GPT-5.4 ($32/M output) and 80ร— cheaper than Claude Opus 4.7 ($75/M output).

Which is cheaper โ€” Together AI, Fireworks, or DeepInfra?

DeepInfra is consistently cheapest on list pricing. Fireworks is fastest. Together has the broadest model catalog. For most workloads, pick by price-latency combination on the specific model you need; the gaps are small enough that running A/B between two providers is more informative than picking based on aggregate marketing.

What is the total cost of ownership for self-hosted LLMs?

Beyond GPU cost: ~$220k/year for an ML platform engineer (fully loaded), ~$66k/year for on-call coverage across 3 engineers, ~$24k/year for monitoring, ~$30k/year for security audits, ~$40k/year for model-update labor. Total operational overhead is roughly $380โ€“400k/year before any GPU bill. This is the line item teams systematically underestimate.

Should I use closed-source or open-source LLMs in production?

Both, via a hybrid architecture. Route 80% of traffic to a serverless open-source model (Llama 3.3, Qwen 3, DeepSeek V4) for the bulk of requests where the 4โ€“6 point quality gap doesn't matter. Escalate the hardest 20% to a closed-source flagship (GPT-5.4 or Claude Opus 4.7) where the quality advantage is worth the cost. This pattern beats pure-closed-source on cost and pure-open-source on quality.

Are open-source LLMs good enough for production in 2026?

Yes. Llama 3.3 70B, Qwen 3 235B, and DeepSeek V4 are all within 2โ€“6 percentage points of closed-source flagships on every major benchmark. For 80% of production workloads โ€” chat, summarization, classification, document QA, code generation โ€” they perform indistinguishably at a fraction of the cost. The remaining 20% of workloads (frontier reasoning, agentic engineering at the highest reliability tier, advanced multimodal) still favor closed-source.

How do I avoid vendor lock-in when picking an LLM provider?

Build against an OpenAI-compatible interface (every major closed-source provider, all serverless open-source providers, and most self-hosted stacks support this format). When you switch providers, change a `baseURL` and a `model` string. Nothing else changes. Keep at least two providers active so an EOL or outage on one doesn't break you.

What GPUs do I need to self-host an open-source LLM?

Depends on the model. Llama 3.3 70B at FP8: 2ร— H100 80GB. Llama 3.3 405B at FP8: 8ร— H100. Qwen 3 235B at FP8: 4ร— H100. DeepSeek V4 at FP8: 8ร— H100. At INT4 quantization (3-5% quality loss), each footprint roughly halves. Cloud rates from Lambda or RunPod range from $3โ€“$24/hour.

What about prompt caching โ€” does that change the math?

Yes, meaningfully. All major closed-source providers and most serverless open-source providers now offer 70โ€“90% discounts on cached input tokens. For agentic workloads with stable system prompts and tool schemas, this brings effective per-token costs down by 30โ€“50%. The math favors providers that have shipped prompt caching at production quality (OpenAI, Anthropic, Fireworks, Together).

How do I calculate the break-even point for my workload?

Three numbers: (1) your sustained daily output token volume, (2) your target latency (which determines minimum GPU count and utilization), (3) your operational budget including engineering time. Plug these into the table in this article. Generally, below 100M daily tokens stay on serverless; between 100M and 1B do a hybrid serverless approach with possible reserved capacity; above 1B start evaluating self-hosting.

What are the hidden costs of using closed-source LLMs?

Pricing changes you cannot control (closed-source has raised prices twice in the past three years), vendor-dependent feature roadmaps (a deprecated capability becomes your problem), regional availability constraints, and stricter content policies that may refuse legitimate use cases. The lock-in cost is largely mitigated if you build against an OpenAI-compatible interface and maintain a second provider as warm standby.

When should I worry about GPU shortages affecting self-hosting?

Always have a fallback. We have lived through two H100 supply crunches in 24 months; each lasted 6โ€“10 weeks with cloud rates doubling and capacity queued by region. Keep a closed-source or serverless provider warm even if you primarily self-host. Cost-model the shortage scenario before committing to self-hosting โ€” it can change the break-even calculation.

Build a Hybrid Stack in 5 Minutes

Railwail exposes every major closed-source and serverless open-source model through one OpenAI-compatible endpoint. Route most traffic to Llama 3.3, Qwen 3, or DeepSeek V4 at pass-through serverless prices; escalate to Claude Opus 4.7 or GPT-5.4 for hard cases. Get per-route cost analytics and switch model strings without redeploying code. Start with free credits and check the math against your real workload.

Sponsored

One API. Every Major Model. Hybrid Routing Included.

Route cost-sensitive traffic to open-source, escalate hard cases to closed-source flagships. Per-route cost analytics, no vendor lock-in, free to start.

Anjali Mehta

Anjali Mehta

ML Infrastructure Architect

Built production LLM serving stacks at three Series B companies. Author of the open-source vLLM cost calculator. Speaks regularly at MLOps Community events.

Tags:
Open Source
Self-Hosting
TCO
Break-Even
Llama
DeepSeek
Qwen
Together
Fireworks
2026