The cost comparison between open-source and closed-source LLMs has become a moving target. Two years ago, the answer was obvious: closed-source for quality, open-source for cost. In 2026, the math is more interesting. Open-source models are now within 2โ5 percentage points of closed-source flagships on most benchmarks. Serverless open-source providers have driven per-token prices down by 80% in 18 months. Closed-source providers have responded with their own cost cuts and aggressive caching discounts. The result is a genuinely difficult three-way decision: closed-source API, serverless open-source, or self-hosted open-source.
This guide walks through the actual TCO math at three workload sizes, the operational overhead that benchmark posts always forget to mention, the hybrid architecture patterns that win in practice, and the situations where each option is the right call. All numbers are from May 2026 list pricing, our internal serving-stack benchmarks, and the public model cards.
The Three Architectures
Three deployment patterns compared (May 2026)
| Pattern | Example stack | Per-1M-output cost (Llama 3.3 70B) | Operational overhead |
|---|---|---|---|
| Closed-source API | GPT-5.4 via OpenAI | $32.00 | โ$0 |
| Serverless open-source | Llama 3.3 70B via Together AI | $0.88 | โ$0 |
| Self-hosted open-source | Llama 3.3 70B on 2รH100 | $0.50โ$1.20 | $400k/year fully loaded |
On per-token cost alone, the comparison looks brutal: GPT-5.4 lists at $32 per million output tokens; Llama 3.3 70B on Together is $0.88; self-hosted is $0.50โ$1.20 depending on utilization. The 35ร cost gap between closed-source and open-source is real. But raw per-token cost is not the right metric โ TCO is. And TCO depends on volume, latency requirements, data sensitivity, and how much engineering time you can spend on infrastructure rather than product.
The Real Cost of Each Pattern
Closed-source API: predictable but expensive
When you call GPT-5.4 or Claude Opus 4.7 via the official API, you pay the per-token price and get everything else included: SLA, scaling, model updates, prompt caching, regional failover, security audits, compliance certifications. There is essentially no operational overhead โ the provider takes care of GPU provisioning, version management, and incident response. The cost structure is purely variable: $0 fixed cost, $X per million tokens.
Two underappreciated facts about closed-source pricing. First, the list price is rarely what you pay at scale. Enterprise contracts typically include 30โ50% volume discounts plus prompt-caching credits worth another 20โ40% effective discount. Second, the closed-source providers absorb the GPU shortage problem entirely โ you do not have to worry about H100 availability, regional capacity, or queueing behind other tenants. This is worth real money during shortage cycles (which have happened twice in the past two years).
Serverless open-source: the goldilocks zone
Serverless open-source providers โ Together AI, Fireworks, DeepInfra, AnyScale, Replicate โ host open-source models and bill per token. You get most of the closed-source operational benefits (no GPU management, instant scaling, OpenAI-compatible API) at a fraction of the cost. The trade-offs: SLAs are weaker, tail latency is higher during demand spikes, model lifecycle is provider-controlled (a model can disappear with 30 days notice), and prompt caching is not yet at parity with the closed-source flagships.
Serverless open-source per-million-output pricing (May 2026)
| Model | Together AI | Fireworks AI | DeepInfra |
|---|---|---|---|
| Llama 3.3 70B Instruct | $0.88 | $0.90 | $0.79 |
| Llama 3.3 405B Instruct | $3.50 | $3.20 | $2.95 |
| DeepSeek V4 | $1.20 | $1.10 | $1.10 |
| Qwen 3 235B | $0.85 | $0.79 | $0.72 |
| Mixtral 8x22B | $1.20 | $1.20 | $1.05 |
| DeepSeek R2 (reasoning) | $2.40 | $2.20 | $2.10 |
| Llama 3.3 70B Turbo (FP8) | $0.40 | $0.38 | $0.35 |
DeepInfra is consistently cheapest. Fireworks is consistently fastest. Together has the widest model catalog. For most production workloads the right strategy is to start with whichever provider has the model you need at the best price-latency combination, then add a second provider as a hot-standby for failover. Multi-vendor routing also negotiates pricing โ providers respond to credible competitive threats.
Self-hosted: capex + opex + soul
Self-hosting is where the cost story gets complicated. Per-token cost looks lower on paper, but to compute the actual TCO you have to account for: GPU capex (or rental), serving stack engineering, model update pipeline, monitoring, on-call coverage, security patching, and the opportunity cost of engineers doing infrastructure work instead of product work. For most teams, the realistic break-even where self-hosting beats serverless is much higher than the naive per-token math suggests.
Self-hosted GPU stack โ list prices for common Llama 3.3 70B configurations
| Config | GPU rate (Lambda, on-demand) | GPU rate (3yr reserved) | Capex (own hardware) |
|---|---|---|---|
| 2ร H100 80GB (FP8) | $6.00/hr | $2.80/hr | โ$120,000 |
| 1ร H100 80GB (INT4) | $3.00/hr | $1.50/hr | โ$60,000 |
| 2ร A100 80GB (FP8) | $3.40/hr | $1.60/hr | โ$60,000 |
| 8ร L40S (cheaper alternative) | $3.20/hr | $1.50/hr | โ$48,000 |
Below those raw prices, the realistic per-token math gets noisier. At 70% sustained utilization on 2รH100 (the typical operating point for a serious production deployment), Llama 3.3 70B costs around $0.50/M output tokens at on-demand cloud rates, $0.23/M with 3-year reserved capacity. At 30% utilization (a more conservative number for bursty workloads), it's $1.18/M on-demand. Compare to Together AI's $0.88/M serverless price โ self-hosting only wins above ~50% sustained utilization.
The Break-Even Calculator
Let's compute the actual break-even for the most common configuration: Llama 3.3 70B on 2ร H100 vs the same model on Together AI's serverless tier. The math depends on three variables: tokens per day, GPU utilization, and operational overhead amortization.
Daily monetary cost โ Llama 3.3 70B at three usage levels
| Daily output tokens | Together AI serverless | Self-hosted 2รH100 (on-demand) | Self-hosted (3yr reserved) | Break-even verdict |
|---|---|---|---|---|
| 1M tokens (10% utilization) | $0.88 | $144 (2ร$72) | $67 | Serverless wins by 80โ160ร |
| 10M tokens | $8.80 | $144 | $67 | Serverless wins by 7โ16ร |
| 25M tokens (50% utilization) | $22.00 | $144 | $67 | Self-hosted reserved wins |
| 50M tokens | $44.00 | $144 | $67 | Self-hosted wins clearly |
| 100M tokens | $88.00 | $288 (4รH100) | $134 | Self-hosted wins at scale |
| 500M tokens | $440.00 | $1,440 (20รH100) | $670 | Self-hosted wins by 2โ4ร |
Reading the table: self-hosting wins on raw daily compute cost starting around 25 million output tokens per day with reserved capacity, or 80โ100 million daily tokens with on-demand pricing. Below that, serverless is cheaper despite the markup.
But this is the wrong math
Daily compute cost is not TCO. To get the real picture, add operational overhead: ML platform engineer at fully loaded $220k/year, on-call rotation across 3 engineers (each spending ~10% of time = $66k/year), monitoring infrastructure ($24k/year), security audits ($30k/year), and model-update labor (1 person-week per model release ร 4 releases per year = $40k/year). Total: $380k/year, or $1,041/day.
Self-hosting TCO including operational overhead
| Daily output tokens | Compute cost (3yr reserved) | + Daily overhead ($1,041) | Total daily cost | vs Serverless |
|---|---|---|---|---|
| 10M | $67 | $1,041 | $1,108 | Serverless wins 126ร |
| 25M | $67 | $1,041 | $1,108 | Serverless wins 50ร |
| 50M | $67 | $1,041 | $1,108 | Serverless wins 25ร |
| 100M | $134 | $1,041 | $1,175 | Serverless wins 13ร |
| 250M | $335 | $1,041 | $1,376 | Serverless wins 6ร |
| 500M | $670 | $1,041 | $1,711 | Serverless wins 4ร |
| 1B (35% util on 16รH100) | $1,340 | $1,041 | $2,381 | Self-hosted wins 1.8ร |
| 2.5B (44% util on 32รH100) | $3,350 | $1,041 | $4,391 | Self-hosted wins 5ร |
| 10B+ tokens | $13,400+ | $1,041 | $14,441 | Self-hosted wins 6ร+ |
Once you include operational overhead, the break-even shifts dramatically. Self-hosting only beats serverless above roughly 1 billion daily tokens (30B monthly). Below that โ which is where 95% of production AI workloads actually sit โ serverless is materially cheaper. The list-price comparison badly understates the operational tax.
When Closed-Source Still Wins
Even with the cost gap between closed-source and open-source narrowing to 30โ50ร on serverless, there are workloads where closed-source remains the right answer.
Frontier reasoning where the 4-6 point benchmark gap matters
On the hardest reasoning tasks โ multi-step engineering problems, ambiguous customer-support cases, research-level QA โ the 4โ6 point benchmark gap between open-source and closed-source flagships compounds into noticeably worse outcomes. For an agentic coding assistant trying to land a PR on its first attempt, going from Claude Opus 4.7's 74.5% SWE-bench Verified to DeepSeek V4's 67.8% means roughly 1 in 5 patches requires human intervention vs 1 in 4 for the open-source model. At scale, that human-intervention cost dominates the per-token savings.
Compliance and contractual requirements
Enterprises with SOC 2, HIPAA, or PCI-DSS compliance requirements often find closed-source vendors easier to onboard โ they ship pre-built compliance attestations and ready-made Data Processing Agreements. Open-source serverless providers are catching up but are often a quarter or two behind on the latest certification cycle. Self-hosted, you own the compliance work entirely. If your sales cycle requires SOC 2 from your AI vendor, closed-source is the fastest path.
Latency-sensitive interactive UX
On short responses where TTFT dominates perceived latency, closed-source flagships still have an edge: GPT-5.4 median TTFT is 278ms vs Together's Llama 3.3 70B at ~340ms and self-hosted at ~280ms (if your stack is well-tuned). The gap is small enough that it does not always justify the cost, but for sub-second chat UX it is worth measuring.
Multimodal and frontier features
Computer Use, native video, real-time voice, image generation tied to the same model โ these features arrive in closed-source 6โ18 months before they reach open-source. If your product needs these, closed-source is currently the only path.
The Hybrid Architecture That Wins in Practice
The pattern most successful teams have converged on: route most traffic to a serverless open-source model, escalate hard cases to a closed-source flagship, never self-host until forced to. This captures the cost advantage of open-source for the 80% of traffic where the quality gap doesn't matter, the quality advantage of closed-source on the 20% where it does, and zero operational overhead from both. The implementation is straightforward.
// hybrid-router.ts โ simple example of cost-aware routing
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.RAILWAIL_API_KEY,
baseURL: "https://railwail.com/v1",
});
async function route(messages: Array<{ role: string; content: string }>) {
// Heuristic: pick the right model for the request shape
const totalTokens = messages.reduce((s, m) => s + m.content.length / 4, 0);
const lastUser = messages.findLast((m) => m.role === "user")?.content ?? "";
const looksHard =
/multi-step|prove|derive|optimi[sz]e|refactor entire|architect/i.test(lastUser) ||
totalTokens > 30_000;
const model = looksHard ? "claude-opus-4-7" : "qwen-3-235b";
const r = await client.chat.completions.create({ model, messages });
return { text: r.choices[0].message.content ?? "", model };
}
// Add a quality-validation second pass
async function withEscalation(messages: any[]) {
const first = await route(messages);
if (first.model === "qwen-3-235b" && needsEscalation(first.text)) {
return await client.chat.completions.create({
model: "claude-opus-4-7",
messages: [...messages, { role: "assistant", content: first.text }, { role: "user", content: "Reconsider โ make this rigorous." }],
});
}
return first;
}This is the dumb version of routing โ heuristic rules on message content and length. Smarter versions use a small classifier model to estimate task difficulty, route accordingly, and track quality metrics per route. For most teams, even the dumb version saves 60โ80% on the AI bill vs routing everything to a closed-source flagship.
Sponsored
Auto-Routing Across Open and Closed Models
Railwail's router sends cost-sensitive traffic to open-source models and escalates hard cases to closed-source flagships. Set quality thresholds, see per-route cost analytics, switch models without code changes.
Cost Modeling at Three Realistic Scales
Scale 1 โ Startup launching a chat product (100M tokens/month)
Monthly cost at 100M total tokens (60% input, 40% output)
| Stack | Monthly cost | Notes |
|---|---|---|
| 100% GPT-5.4 | $1,760 | Predictable, no infrastructure work |
| 100% Llama 3.3 70B on Together AI | $56 | 31ร cheaper, quality usually fine |
| Hybrid: 80% Llama 3.3, 20% Claude | $310 | Quality safety net for hard cases |
| Self-hosted Llama 3.3 70B | $2,008 + overhead | Loses badly at this scale |
At startup scale, stay on serverless. Closed-source-only is fine if you have margin to spare and want zero infrastructure work. Hybrid (mostly open-source, with closed-source escalation) is the cost-optimal default. Self-hosting at this scale loses by 30ร+ once you include any operational overhead.
Scale 2 โ Growth-stage company (5B tokens/month)
Monthly cost at 5B tokens
| Stack | Monthly cost | Notes |
|---|---|---|
| 100% GPT-5.4 | $88,000 | Heavy at this scale |
| 100% Claude Opus 4.7 | $165,000 | Quality premium worth it only for specific workloads |
| 100% Llama 3.3 70B on Together AI | $2,800 | 31ร cheaper than GPT-5.4 |
| Hybrid: 80% Llama 3.3, 20% Claude | $15,440 | Sweet spot โ quality safety net + cost |
| Self-hosted Llama 3.3 70B (3yr reserved) | $22,000 + $400k/yr overhead = $55,333/mo | Operational overhead dominates |
Hybrid serverless is dramatically the cheapest path at this scale โ closed-source only on the 20% of traffic that needs it. Self-hosting still loses because the operational overhead dwarfs the compute savings. If your business actually needs closed-source quality on 100% of traffic, the answer is to negotiate a volume discount, not to switch architectures.
Scale 3 โ Enterprise at scale (50B tokens/month)
Monthly cost at 50B tokens
| Stack | Monthly cost | Notes |
|---|---|---|
| 100% GPT-5.4 (with enterprise discount, ~40% off) | $528,000 | Negotiate hard |
| 100% Llama 3.3 70B serverless | $28,000 | Likely getting reserved pricing already |
| Hybrid: 80% Llama 3.3, 20% Claude (discounted) | $129,000 | Most workloads land here |
| Self-hosted Llama 3.3 70B reserved | $220,000 + $400k overhead = $253,000 | Self-hosting starts to win on dollar terms |
| Self-hosted DeepSeek V4 reserved | $330,000 + $400k overhead = $363,000 | Higher quality but compute-heavier |
At enterprise scale, self-hosting starts to compete on raw dollar terms โ particularly if you already have ML platform engineers on staff (in which case the marginal overhead is closer to $200k than $400k). For most enterprises, hybrid serverless is still the right answer because it preserves engineering capacity for product work and avoids GPU-shortage risk. Self-hosting is a strategic decision driven by data residency, IP, or extreme scale rather than by per-token cost.
The Hidden Failure Modes
Serverless: provider lock-in via model lifecycle
Serverless providers deprecate models on their own schedule. Together AI dropped Mixtral 8x7B with 30 days notice in 2025; Fireworks deprecated older Llama variants on a similar cadence. If your production prompts are tuned against a specific model checkpoint, an upstream deprecation forces a re-evaluation cycle. The mitigation is to keep two serverless vendors active (so an EOL on one doesn't break you) and to have a quality regression suite that you can run against any candidate replacement.
Self-hosting: GPU-shortage cycles
We have lived through two H100 supply crunches in the past 24 months. Each lasted 6โ10 weeks. During each crunch, cloud H100 hourly rates roughly doubled and capacity was queued by region. Closed-source and serverless open-source providers absorbed this; self-hosted teams either paid the 2ร rate or queued requests. For products with high availability requirements, the right move is to keep a closed-source vendor warm even if you primarily self-host โ a small percentage of traffic to GPT-5.4 in shortage windows lets you ride out crunches without an outage.
Closed-source: pricing changes you can't control
Closed-source providers have raised prices twice in the past three years (and cut them more often, but the increases hurt budgets). A model that costs you $X today might cost you $1.4X 18 months from now. Open-source pricing has only moved down, and self-hosting locks in your cost basis entirely. For finance-team-facing predictability, open-source is better.
Decision Matrix
What to pick โ by your real situation
| Your situation | Recommended pattern | Why |
|---|---|---|
| Pre-product, validating PMF | 100% closed-source (GPT-5.4) | Zero infra work, ship in days |
| Early product, finding patterns | Hybrid โ 80% open-source serverless, 20% closed escalation | Cheap default, quality safety net |
| Growth-stage, 1-10B tokens/month | Hybrid serverless (multi-vendor for failover) | Cost-optimal, low overhead |
| Enterprise, 50B+ tokens/month, no data residency requirement | Hybrid serverless with reserved capacity | Vendor negotiation power, still no overhead |
| Enterprise, regulated industry, on-prem requirement | Self-hosted Qwen 3 / Llama 3.3 + closed-source fallback for non-regulated traffic | Data residency forces self-hosting |
| Code-agent product needing maximum SWE-bench score | Claude Opus 4.7 primary, DeepSeek V4 for cheap pre-screening | 6-point benchmark gap matters for code-agent products |
| Multilingual content product | Qwen 3 235B (serverless) primary, GPT-5.4 for hard non-English cases | Open-source has parity here at a fraction of cost |
| Vision-heavy product | Closed-source (Gemini for cost, GPT-5.4 for charts) | Open-source vision is not yet at parity at the flagship size |
Operational Checklist Before Switching to Self-Hosting
If you are considering self-hosting, these are the questions to answer before committing capital. We have seen multiple teams skip these and end up with a self-hosted stack costing more than the serverless option they replaced.
- **What is your sustained utilization?** Below 50%, self-hosting almost never wins. Measure your actual utilization on a representative serverless workload before extrapolating.
- **Who owns the on-call rotation?** Self-hosted production needs 24/7 coverage. If your engineering team is under 20 people, dedicating 3 of them to LLM ops is a real product-velocity tradeoff.
- **What is your model-update cadence?** Open-source models release every 3โ6 months. Plan for 1 person-week of evaluation + re-deployment per release.
- **Have you priced in GPU shortage risk?** Maintain a fallback to closed-source or serverless during shortage cycles. Cost-model this โ it can change the break-even calculation.
- **Is your tokenizer compatible?** Switching from a closed-source flagship to a self-hosted model usually means a different tokenizer, which means recounting tokens and possibly retuning prompts.
- **Does your latency SLA survive cold starts?** Self-hosted models take 4โ8 minutes to load. If you cannot keep pods warm, your TTFT becomes random during scaling events.
- **What is your compliance posture?** Self-hosting moves all data-handling compliance work to you. Confirm you have the SOC 2 / HIPAA / GDPR audit budget for this before deciding.
Migration Patterns Without Vendor Lock-In
The architectural decision that matters most is not which model you pick โ it is whether your code can swap models in one line. The pattern we recommend: write all LLM calls against an OpenAI-compatible interface (every closed-source provider, every serverless open-source provider, and most self-hosted stacks support this format). When you switch providers, change a `baseURL` and a `model` name. Nothing else.
// Single-interface pattern โ works across closed and open providers
import OpenAI from "openai";
function makeClient(provider: "openai" | "anthropic" | "together" | "fireworks" | "railwail" | "self-hosted") {
const config = {
openai: { url: "https://api.openai.com/v1", key: process.env.OPENAI_API_KEY },
anthropic: { url: "https://api.anthropic.com/v1", key: process.env.ANTHROPIC_API_KEY },
together: { url: "https://api.together.xyz/v1", key: process.env.TOGETHER_API_KEY },
fireworks: { url: "https://api.fireworks.ai/inference/v1", key: process.env.FIREWORKS_API_KEY },
railwail: { url: "https://railwail.com/v1", key: process.env.RAILWAIL_API_KEY },
"self-hosted": { url: process.env.SELF_HOSTED_URL!, key: "local" },
}[provider];
return new OpenAI({ baseURL: config.url, apiKey: config.key });
}
export async function chat(
provider: Parameters<typeof makeClient>[0],
model: string,
messages: any[],
) {
return makeClient(provider).chat.completions.create({ model, messages });
}Building against this abstraction from day one means you can re-run the cost decision every quarter as prices move. The cost of building the abstraction is small (โ100 lines of TypeScript); the cost of NOT building it is being locked into whichever vendor you chose first.
What Will Change by End of 2026
- **Open-source flagships will close another 2โ3 benchmark points** โ both DeepSeek and Alibaba have public roadmaps targeting closed-source parity by Q4. This will further tip the cost-quality math toward open-source.
- **Closed-source providers will keep cutting prices** โ OpenAI cut input prices 3 times in 2025 alone. Expect a 30โ40% reduction by year-end.
- **Serverless providers will offer tighter SLAs** โ Fireworks and Together both signaled enterprise-tier SLA improvements in Q3. This narrows the closed-vs-serverless reliability gap.
- **Specialized hardware will arrive** โ Groq, SambaNova, and Cerebras all claim 5โ10ร throughput on open-source models at competitive prices. If the price-performance numbers hold up, expect serverless costs to drop another 50%.
Bottom Line โ Start Hybrid, Stay Hybrid
The clearest takeaway from the May 2026 cost math: for the vast majority of production AI workloads, the right architecture is hybrid serverless โ most traffic to an open-source model on a low-cost provider, escalation to a closed-source flagship for hard cases. This pattern beats both pure closed-source (cost) and pure self-hosting (overhead) for any workload below ~30 billion tokens per month. Above that scale, self-hosting becomes a strategic question driven by data residency or extreme cost discipline, not a default.
Build behind an OpenAI-compatible abstraction so that switching models is a configuration change, not a code change. Maintain two serverless vendors so that one EOL or outage does not break you. Keep a closed-source flagship warm even if 95% of traffic goes to open-source โ the quality safety net is cheap when you only invoke it on 5% of requests. Re-evaluate the cost math quarterly; the numbers will keep moving.
Frequently Asked Questions
When does self-hosting an LLM beat using a serverless API?
On raw compute cost, self-hosting wins above roughly 25 million daily output tokens with 3-year reserved capacity, or 80โ100 million daily tokens on-demand. But once you include operational overhead (ML platform engineer, on-call, monitoring โ about $400k/year fully loaded), the break-even moves to roughly 1 billion daily tokens (30 billion monthly). Most production workloads sit well below that, so serverless remains cheaper.
How much does it cost to use Llama 3.3 70B via API in 2026?
On serverless providers (May 2026 list pricing): Together AI $0.88/M output tokens, Fireworks $0.90, DeepInfra $0.79. That's roughly 30โ35ร cheaper than GPT-5.4 ($32/M output) and 80ร cheaper than Claude Opus 4.7 ($75/M output).
Which is cheaper โ Together AI, Fireworks, or DeepInfra?
DeepInfra is consistently cheapest on list pricing. Fireworks is fastest. Together has the broadest model catalog. For most workloads, pick by price-latency combination on the specific model you need; the gaps are small enough that running A/B between two providers is more informative than picking based on aggregate marketing.
What is the total cost of ownership for self-hosted LLMs?
Beyond GPU cost: ~$220k/year for an ML platform engineer (fully loaded), ~$66k/year for on-call coverage across 3 engineers, ~$24k/year for monitoring, ~$30k/year for security audits, ~$40k/year for model-update labor. Total operational overhead is roughly $380โ400k/year before any GPU bill. This is the line item teams systematically underestimate.
Should I use closed-source or open-source LLMs in production?
Both, via a hybrid architecture. Route 80% of traffic to a serverless open-source model (Llama 3.3, Qwen 3, DeepSeek V4) for the bulk of requests where the 4โ6 point quality gap doesn't matter. Escalate the hardest 20% to a closed-source flagship (GPT-5.4 or Claude Opus 4.7) where the quality advantage is worth the cost. This pattern beats pure-closed-source on cost and pure-open-source on quality.
Are open-source LLMs good enough for production in 2026?
Yes. Llama 3.3 70B, Qwen 3 235B, and DeepSeek V4 are all within 2โ6 percentage points of closed-source flagships on every major benchmark. For 80% of production workloads โ chat, summarization, classification, document QA, code generation โ they perform indistinguishably at a fraction of the cost. The remaining 20% of workloads (frontier reasoning, agentic engineering at the highest reliability tier, advanced multimodal) still favor closed-source.
How do I avoid vendor lock-in when picking an LLM provider?
Build against an OpenAI-compatible interface (every major closed-source provider, all serverless open-source providers, and most self-hosted stacks support this format). When you switch providers, change a `baseURL` and a `model` string. Nothing else changes. Keep at least two providers active so an EOL or outage on one doesn't break you.
What GPUs do I need to self-host an open-source LLM?
Depends on the model. Llama 3.3 70B at FP8: 2ร H100 80GB. Llama 3.3 405B at FP8: 8ร H100. Qwen 3 235B at FP8: 4ร H100. DeepSeek V4 at FP8: 8ร H100. At INT4 quantization (3-5% quality loss), each footprint roughly halves. Cloud rates from Lambda or RunPod range from $3โ$24/hour.
What about prompt caching โ does that change the math?
Yes, meaningfully. All major closed-source providers and most serverless open-source providers now offer 70โ90% discounts on cached input tokens. For agentic workloads with stable system prompts and tool schemas, this brings effective per-token costs down by 30โ50%. The math favors providers that have shipped prompt caching at production quality (OpenAI, Anthropic, Fireworks, Together).
How do I calculate the break-even point for my workload?
Three numbers: (1) your sustained daily output token volume, (2) your target latency (which determines minimum GPU count and utilization), (3) your operational budget including engineering time. Plug these into the table in this article. Generally, below 100M daily tokens stay on serverless; between 100M and 1B do a hybrid serverless approach with possible reserved capacity; above 1B start evaluating self-hosting.
What are the hidden costs of using closed-source LLMs?
Pricing changes you cannot control (closed-source has raised prices twice in the past three years), vendor-dependent feature roadmaps (a deprecated capability becomes your problem), regional availability constraints, and stricter content policies that may refuse legitimate use cases. The lock-in cost is largely mitigated if you build against an OpenAI-compatible interface and maintain a second provider as warm standby.
When should I worry about GPU shortages affecting self-hosting?
Always have a fallback. We have lived through two H100 supply crunches in 24 months; each lasted 6โ10 weeks with cloud rates doubling and capacity queued by region. Keep a closed-source or serverless provider warm even if you primarily self-host. Cost-model the shortage scenario before committing to self-hosting โ it can change the break-even calculation.
Build a Hybrid Stack in 5 Minutes
Railwail exposes every major closed-source and serverless open-source model through one OpenAI-compatible endpoint. Route most traffic to Llama 3.3, Qwen 3, or DeepSeek V4 at pass-through serverless prices; escalate to Claude Opus 4.7 or GPT-5.4 for hard cases. Get per-route cost analytics and switch model strings without redeploying code. Start with free credits and check the math against your real workload.
Sponsored
One API. Every Major Model. Hybrid Routing Included.
Route cost-sensitive traffic to open-source, escalate hard cases to closed-source flagships. Per-route cost analytics, no vendor lock-in, free to start.
