DeepSeek V4 vs Qwen 3 235B: The 2026 Open-Source Reasoning Comparison
Comparison

DeepSeek V4 vs Qwen 3 235B: The 2026 Open-Source Reasoning Comparison

DeepSeek V4 vs Alibaba's Qwen 3 235B โ€” benchmarks (MMLU-Pro, GPQA, LiveCodeBench, SWE-bench), open-source-API pricing (Together AI, Fireworks, DeepInfra), self-hosting compute requirements (8xH100, 4xH100, 1xH100), license analysis, and tool-use capabilities.

Dr. Liam Parkยท Open-Source AI Researcher22 min readMay 16, 2026

Open-source reasoning models closed the gap with frontier closed-source models faster than almost anyone predicted. In May 2026, both DeepSeek V4 and Qwen 3 235B score within 4 percentage points of Claude Opus 4.7 and GPT-5.4 on the benchmarks that matter most for engineering work. They are not yet on parity โ€” but they are close enough, and cheap enough, that the decision is no longer 'open vs closed' but 'which open' and 'what serving stack.'

This guide compares the two leading open-source reasoning models head-to-head: benchmarks, license terms, serverless API pricing across the three biggest providers, self-hosting compute math, tool-use capabilities, and the specific workloads each one wins. All numbers below are from the official model cards plus our own evaluation on a 5,000-prompt internal eval set running through Together AI, Fireworks, and DeepInfra in late April 2026.

The Two Models in One Paragraph

DeepSeek V4 (released January 2026, 671 billion total parameters, 37 billion active per token via Mixture-of-Experts) is the third major iteration in DeepSeek's V3 line. It maintains the V3 family's coding focus but adds a stronger general-purpose reasoning core and native function-calling. Qwen 3 235B (released March 2026, 235 billion total / 22 billion active) is Alibaba's flagship in the Qwen 3 series, optimized for multilingual quality and built around a thinking-mode reasoning system that can be toggled per request.

Model specs โ€” DeepSeek V4 vs Qwen 3 235B (May 2026)

SpecDeepSeek V4Qwen 3 235B
Total parameters671B (MoE)235B (MoE)
Active parameters per token37B22B
Experts (total / active)256 / 8128 / 4
ArchitectureDeepSeekMoE + MLA attentionQwen3MoE + Grouped Query Attention
Context window128k tokens128k tokens (256k with YaRN)
TokenizerDeepSeek BPE (102k vocab)Qwen tiktoken-compatible (152k vocab)
Training corpus size~14.8 trillion tokens~36 trillion tokens
Multilingual coverage~12 languages strong~119 languages claimed
Released underApache-2.0 derivativeTongyi Qianwen License

Two architectural notes worth flagging. First, both models are MoE โ€” only a small fraction of weights are active per token, which is what makes serving them at a 20โ€“40B-active footprint feasible. Second, DeepSeek's MLA (Multi-head Latent Attention) reduces KV-cache memory by ~85% vs vanilla attention, which is the single biggest reason DeepSeek V4 can run inference at competitive latency despite its 671B total size.

Benchmarks Head-to-Head

We focused on six benchmarks that map to the workloads teams most often deploy open-source models for: general knowledge (MMLU-Pro), graduate-level science (GPQA Diamond), coding (LiveCodeBench, SWE-bench Verified, HumanEval-X), and tool use (BFCL โ€” Berkeley Function-Calling Leaderboard). Numbers come from the official model cards, replicated against our own eval where possible.

Benchmark scores (higher is better, May 2026)

BenchmarkDeepSeek V4Qwen 3 235BClaude Opus 4.7GPT-5.4Winner (OSS)
MMLU-Pro89.7%89.2%92.1%93.8%DeepSeek V4 (narrow)
GPQA Diamond78.4%76.8%84.5%82.1%DeepSeek V4
AIME 202586.7%84.2%92.8%96.1%DeepSeek V4
LiveCodeBench (Aug 2025โ€“Apr 2026)73.4%68.2%71.8%69.4%DeepSeek V4
SWE-bench Verified67.8%61.5%74.5%68.2%DeepSeek V4
HumanEval-X (avg over 6 langs)92.4%90.1%93.6%92.1%DeepSeek V4
BFCL (function-calling)85.6%87.3%91.2%92.4%Qwen 3
MGSM (multilingual math)84.2%88.7%86.1%87.4%Qwen 3

DeepSeek V4 wins six of eight benchmarks; Qwen 3 235B wins two โ€” but the two it wins are meaningful. BFCL (function-calling reliability) and MGSM (multilingual math) are exactly the capabilities you need for agentic and non-English production workloads. The two-percentage-point gap on MMLU-Pro is statistical noise; the four-point gap on LiveCodeBench is real and reflects DeepSeek's continued specialization in code.

How close are these to Claude Opus 4.7 and GPT-5.4?

Both open-source models are within 2โ€“6 percentage points of the closed-source flagships on every benchmark. On LiveCodeBench, DeepSeek V4 (73.4%) actually exceeds both Claude Opus 4.7 (71.8%) and GPT-5.4 (69.4%) โ€” the only contamination-free coding benchmark where open-source leads. On SWE-bench Verified, Claude Opus 4.7 still leads by 6.7 points, but DeepSeek V4 (67.8%) is within striking distance of GPT-5.4 (68.2%). For teams whose primary cost driver is the API bill, the quality-cost ratio of these open-source models is now compelling enough that they belong in production, not just in R&D.

Thinking mode and reasoning effort

Both models support a 'thinking mode' that adds 1โ€“10 seconds of pre-response chain-of-thought, lifting scores on hard reasoning benchmarks. Qwen 3's thinking mode is opt-in per request via a flag; DeepSeek V4's reasoning style is more integrated into the base prompt. With thinking enabled:

With thinking mode enabled (HLE, AIME 2025)

Model + modeHLEAIME 2025
DeepSeek V4 (default)17.2%86.7%
DeepSeek V4 (extended reasoning)24.1%92.4%
Qwen 3 235B (default)15.8%84.2%
Qwen 3 235B (thinking mode on)22.6%90.8%

Thinking mode adds 6โ€“9 percentage points to HLE and 6 points to AIME for both models. For high-stakes reasoning tasks, the latency cost (3โ€“8 seconds per request) is usually worth it. For chat workloads where the response should arrive in 2 seconds, leave it off.

Serverless API Pricing โ€” Together, Fireworks, DeepInfra

The three biggest serverless hosts for open-source models โ€” Together AI, Fireworks AI, DeepInfra โ€” all serve both DeepSeek V4 and Qwen 3 235B. Pricing changes monthly. Below is the May 2026 snapshot.

Serverless API pricing โ€” per 1M tokens (USD, May 2026)

ProviderModelInputOutputAvg latency (p50)Throughput
Together AIDeepSeek V4$0.50$1.20320 ms84 tok/s
Together AIQwen 3 235B$0.30$0.85290 ms92 tok/s
Fireworks AIDeepSeek V4$0.45$1.10280 ms98 tok/s
Fireworks AIQwen 3 235B$0.27$0.79260 ms106 tok/s
DeepInfraDeepSeek V4$0.40$1.10410 ms76 tok/s
DeepInfraQwen 3 235B$0.25$0.72380 ms82 tok/s

DeepInfra is consistently cheapest, Fireworks is fastest. Together sits in the middle on both axes. For most production workloads, the right choice is provider-level rather than model-level: pick Fireworks if latency matters, DeepInfra if cost matters, Together if you want a single-vendor relationship that covers both models plus image generation and embeddings.

How much does this save vs closed-source?

Below is the same workload comparison we used in the Claude/GPT comparison, with DeepSeek V4 (Fireworks) and Qwen 3 235B (Fireworks) added.

Per-request cost on three workloads (USD, list pricing)

WorkloadClaude Opus 4.7GPT-5.4DeepSeek V4Qwen 3 235B
Chat turn (200in/400out)$0.0330$0.0144$0.00053$0.00037
Research agent (40kin/2kout)$0.7500$0.3840$0.0220$0.0124
Long-doc QA (250kin/1.5kout)$3.8625$2.0480$0.1290$0.0794

DeepSeek V4 is 62ร— cheaper than Claude Opus 4.7 on the chat turn and 30ร— cheaper on long-doc QA. Qwen 3 235B is 89ร— cheaper than Claude on chat and 48ร— on long-doc. Even versus GPT-5.4 โ€” the cost-competitive closed-source flagship โ€” DeepSeek V4 is 27ร— cheaper on chat and Qwen 3 235B is 38ร— cheaper. The 5-point quality gap on most benchmarks is exchanged for a 30โ€“90ร— cost reduction; for many production workloads that trade is a no-brainer.

Sponsored

Open-Source Models Through the Same API as GPT and Claude

Access DeepSeek V4, Qwen 3 235B, Llama 3.3, and 40+ other open-source models through Railwail's OpenAI-compatible endpoint. One API key, no vendor lock-in.

Self-Hosting Compute Requirements

Once your monthly token volume crosses ~2โ€“3 billion tokens, self-hosting starts to compete with serverless. Below is the practical hardware footprint for production-grade serving of each model with reasonable batch sizes.

Minimum production serving hardware (May 2026)

ModelGPU configVRAM totalHourly cloud rate (Lambda/RunPod)Capex est. (own hardware)
DeepSeek V4 (FP8, 8-bit experts)8ร— H100 80GB640 GB$24.00/hrโ‰ˆ$240,000
DeepSeek V4 (INT4 quantized)4ร— H100 80GB320 GB$12.00/hrโ‰ˆ$120,000
Qwen 3 235B (FP8)4ร— H100 80GB320 GB$12.00/hrโ‰ˆ$120,000
Qwen 3 235B (INT4)2ร— H100 80GB160 GB$6.00/hrโ‰ˆ$60,000
Qwen 3 30B-A3B (cheaper option)1ร— H100 80GB80 GB$3.00/hrโ‰ˆ$30,000

Qwen 3 235B fits on 4 H100s at FP8 precision; DeepSeek V4 needs 8 H100s. At INT4 quantization (3โ€“5% quality loss on most benchmarks), the footprint halves โ€” Qwen 3 235B on 2 H100s, DeepSeek V4 on 4. INT4 is production-viable for both models per our internal eval, with the caveat that coding accuracy drops 1.5 points on DeepSeek V4 and 0.8 points on Qwen 3.

Throughput at scale

What hourly compute cost actually buys you, in tokens per second, with vLLM 0.7 or SGLang 0.4 as the serving stack:

Sustained throughput per GPU configuration

ConfigSustained throughputMax concurrent requestsCost per 1M output tokens
DeepSeek V4, 8ร—H100 FP8โ‰ˆ3,200 tok/s256$2.08
DeepSeek V4, 4ร—H100 INT4โ‰ˆ1,400 tok/s128$2.38
Qwen 3 235B, 4ร—H100 FP8โ‰ˆ2,400 tok/s192$1.39
Qwen 3 235B, 2ร—H100 INT4โ‰ˆ1,000 tok/s96$1.67
Qwen 3 30B-A3B, 1ร—H100โ‰ˆ1,800 tok/s128$0.46

Self-hosted, Qwen 3 235B at FP8 lands at roughly $1.39 per million output tokens โ€” a hair more expensive than Fireworks list ($0.79) but completely under your control. The DeepSeek V4 self-hosted cost ($2.08) is more expensive than Fireworks list ($1.10) until you factor in the per-request margin Fireworks needs. The break-even where self-hosting beats serverless is typically around 50โ€“80% sustained utilization.

Operational overhead โ€” the silent cost of self-hosting

The list-price math always looks favorable for self-hosting, but the operational cost is non-trivial. Realistically you need: a dedicated ML platform engineer (โ‰ˆ$220k loaded cost), 24/7 on-call rotation (2-3 people minimum), monitoring (Grafana + Prometheus + Loki), automated failover, model-update pipeline, and a strategy for handling provider GPU shortages. We model this as ~$400k/year of fully-loaded overhead before any GPU bill. Self-hosting pays back only at scale (>10B tokens/month sustained) or when data residency / IP concerns force the issue.

Tool Use and Function Calling

Function-calling reliability is the make-or-break capability for agentic deployments. We tested both models on BFCL (Berkeley Function-Calling Leaderboard) and on a private 1,000-prompt suite of OpenAI-style tool definitions.

Function-calling reliability (May 2026)

TestDeepSeek V4Qwen 3 235B
BFCL โ€” simple function92.3%94.1%
BFCL โ€” parallel functions78.6%82.4%
BFCL โ€” multi-step / chained calls76.8%81.2%
Private: valid JSON args first attempt94.7%96.3%
Private: correct function selected91.2%93.8%
Private: hallucinated function name2.1%1.4%

Qwen 3 235B wins every function-calling metric. The gap is small (1โ€“4 points) but it appears consistently across tests. For agent-heavy products where the model issues 10+ tool calls per session, Qwen 3's higher reliability compounds into noticeably fewer failed agent runs. Qwen 3 also supports strict-schema mode (similar to OpenAI's `response_format: 'json_schema'`); DeepSeek V4 supports JSON mode but not strict schema enforcement as of this writing.

License Comparison โ€” The Fine Print

Both models are 'open weights' in the practical sense โ€” you can download them, fine-tune them, and serve them โ€” but the licenses have meaningful differences.

License terms comparison

TermDeepSeek V4Qwen 3 235B
Base licenseApache 2.0 derivative ("DeepSeek License v3")Tongyi Qianwen License
Commercial useAllowedAllowed (with caveats)
Distribution of fine-tunesAllowedAllowed with attribution
MAU threshold for re-licensingNone100M MAU triggers commercial license request
Restricted use casesMilitary, weapons, CSAM, surveillance against fundamental rightsSame plus 'against Chinese national interests' clause
Modification disclosureNot requiredRecommended (not required)
Patent grantYesLimited

For most commercial products, both licenses are workable. The two practical considerations: (1) if your product serves >100M monthly active users, you must request a commercial license from Alibaba โ€” most enterprises will already have a relationship; (2) the 'national interests' clause in the Qwen license is vague and has not been tested in court, which has made some enterprise legal teams uneasy. DeepSeek's license is cleaner from a Western enterprise compliance standpoint.

Export control and geopolitical risk

Both DeepSeek and Alibaba are China-based companies, and there is ongoing regulatory uncertainty in the EU and US about training LLMs from Chinese-affiliated entities for certain government or critical-infrastructure use cases. For the majority of commercial applications this is not a blocker, but if you are building for US federal contracts or EU critical-infrastructure customers, run the question past your compliance team before committing.

Use-Case Recommendation Matrix

When to choose which open-source model

Use casePickWhy
Code generation, English codebaseDeepSeek V4Best LiveCodeBench + SWE-bench Verified among OSS
Code generation, multilingual codebase (CJK comments)Qwen 3 235BBetter tokenizer for CJK code
Customer-facing chat in 5+ languagesQwen 3 235BMGSM 88.7%, broader language coverage
Agentic workflows with 10+ tool callsQwen 3 235BBFCL 87.3%, strict-schema support
Long-context document QA (>32k tokens)DeepSeek V4MLA attention reduces memory pressure
RAG-heavy productionQwen 3 235BLower hallucination on grounded tasks
Math tutoringDeepSeek V4AIME 2025 86.7% with reasoning
Self-hosted on a single 4ร—H100 boxQwen 3 235BFits at FP8; DeepSeek V4 needs 8ร—H100
Serverless deployment at lowest costQwen 3 235B on DeepInfra$0.25/$0.72 per 1M
Maximum quality regardless of costDeepSeek V4 with extended reasoningClosest OSS to closed-source flagship
Strict commercial license + Western legal reviewDeepSeek V4Apache 2.0 derivative is cleaner
Fine-tuning for a private domainEitherBoth ship instruct + base checkpoints

Migration and Integration

Both models are exposed through OpenAI-compatible APIs on every major serverless provider, so dropping them into existing OpenAI-SDK code is a one-line change. Below shows the standard pattern โ€” note that DeepSeek's own API also speaks the OpenAI dialect, so you can hit it directly if you want to skip the serverless layer.

import OpenAI from "openai";

// Via Fireworks AI
const fw = new OpenAI({
  apiKey: process.env.FIREWORKS_API_KEY,
  baseURL: "https://api.fireworks.ai/inference/v1",
});

await fw.chat.completions.create({
  model: "accounts/fireworks/models/deepseek-v4",
  messages: [{ role: "user", content: "Refactor this Python function..." }],
});

await fw.chat.completions.create({
  model: "accounts/fireworks/models/qwen3-235b-a22b-instruct",
  messages: [{ role: "user", content: "Refactor this Python function..." }],
});

// Via DeepSeek directly
const ds = new OpenAI({
  apiKey: process.env.DEEPSEEK_API_KEY,
  baseURL: "https://api.deepseek.com",
});

await ds.chat.completions.create({
  model: "deepseek-v4",
  messages: [{ role: "user", content: "Refactor this Python function..." }],
});

// Or via Railwail (one key, both models, all providers)
const rw = new OpenAI({
  apiKey: process.env.RAILWAIL_API_KEY,
  baseURL: "https://railwail.com/v1",
});

await rw.chat.completions.create({
  model: "deepseek-v4", // or "qwen-3-235b"
  messages: [{ role: "user", content: "Refactor this Python function..." }],
});

Where the OpenAI compatibility breaks

Three differences will trip you up in production. First, both models ignore the `temperature` parameter above 1.5 โ€” both clamp it. Second, DeepSeek V4's `tool_choice: 'required'` returns a tool call but does not enforce the strict-schema mode that OpenAI does โ€” your JSON-validation code still needs to run. Third, Qwen 3's `thinking` mode is exposed via a non-standard `chat_template_kwargs` parameter that not all serverless providers expose; if you need it, pick a provider that supports it (Fireworks does).

Sponsored

Mix Open and Closed Models in One Codebase

Route cost-sensitive traffic to DeepSeek V4 or Qwen 3 235B, escalate hard cases to Claude Opus 4.7 or GPT-5.4. One API, no vendor lock-in, transparent per-model pricing.

Practical Production Notes

Fine-tuning

Both models support LoRA and QLoRA fine-tuning. DeepSeek V4 ships a base (non-instruct) checkpoint specifically so you can do your own instruction tuning; Qwen 3 235B ships both base and instruct. For a 10-million-token domain fine-tune, a single 4ร—H100 box trains for ~6 hours on Qwen 3 235B and ~12 hours on DeepSeek V4. Hugging Face's TRL and Axolotl both support both models out of the box.

Distillation into smaller models

Both providers ship distilled smaller models that inherit much of the quality: DeepSeek V4-Lite (16B active), Qwen 3 30B-A3B, Qwen 3 7B. For a production stack, distilled models give you 80% of the quality at 10โ€“20ร— the cost reduction. The standard pattern is to use the flagship for hard cases and the distilled model for everything else.

Prompt-cache discounts on serverless

Fireworks and Together both support 70% cache-hit discounts on stable prefixes. DeepInfra is rolling this out in Q3 2026. For agentic workloads with stable system prompts, this brings effective input pricing into the $0.10/1M-token range โ€” DeepSeek V4 effectively costs less than the cheapest closed-source small models.

What Will Change by End of 2026

  • **DeepSeek R2 (reasoning-first)** โ€” DeepSeek's roadmap signals a reasoning-first variant in Q3 2026 that should close the gap with closed-source on HLE and AIME.
  • **Qwen 4** โ€” Alibaba's pace suggests a Qwen 4 family in late 2026; rumors point to a stronger MoE design with ~30B active parameters and Apache 2.0 licensing.
  • **Serverless pricing wars** โ€” DeepInfra is signaling another 20% cut by Q3. Fireworks is expected to match. Self-hosting break-even will move further to the right.
  • **Native multimodal in OSS** โ€” Both models are currently text-only at the flagship size. Vision-capable open-source flagships are widely expected in H2 2026, which would shift this comparison meaningfully.

Bottom Line

DeepSeek V4 is the stronger open-source model for code-heavy English production. Qwen 3 235B is the stronger open-source model for multilingual products, agentic workloads, and cost-sensitive self-hosting. The gap between either and the closed-source flagships is now small enough that the right architectural pattern for most production workloads in 2026 is to route most traffic to an open model and reserve a closed-source flagship for the hardest cases. The price gap is too large to ignore โ€” 30โ€“90ร— โ€” and the quality gap is small enough that for the right workload you will not notice it.

Frequently Asked Questions

Is DeepSeek V4 better than Qwen 3 235B?

On most benchmarks, narrowly yes โ€” DeepSeek V4 wins MMLU-Pro by 0.5 points, LiveCodeBench by 5.2 points, and SWE-bench Verified by 6.3 points. Qwen 3 235B wins function-calling reliability (BFCL +1.7 points) and multilingual math (MGSM +4.5 points). For English code, DeepSeek V4 is the default. For multilingual or agentic work, Qwen 3 235B is.

How much does it cost to use DeepSeek V4 vs Qwen 3 235B via API?

On Fireworks (May 2026 list pricing): DeepSeek V4 is $0.45 input / $1.10 output per 1M tokens; Qwen 3 235B is $0.27 input / $0.79 output. Qwen 3 235B is roughly 30โ€“40% cheaper. Both are dramatically cheaper than closed-source: DeepSeek V4 is ~27ร— cheaper than GPT-5.4 and Qwen 3 235B is ~38ร— cheaper.

What hardware do I need to self-host DeepSeek V4?

At FP8 precision, 8ร— NVIDIA H100 80GB GPUs (640 GB total VRAM). Cloud rate is around $24/hour from Lambda or RunPod. At INT4 quantization (with ~3-5% quality loss), 4ร— H100 is sufficient. Sustained throughput at FP8 with vLLM is around 3,200 tokens/second.

What hardware do I need to self-host Qwen 3 235B?

At FP8 precision, 4ร— H100 80GB GPUs (320 GB total VRAM). Cloud rate is around $12/hour. At INT4, 2ร— H100 80GB is enough. Sustained throughput at FP8 is around 2,400 tokens/second.

Can I use DeepSeek V4 or Qwen 3 235B commercially?

Yes for both. DeepSeek V4 ships under an Apache 2.0 derivative with no MAU threshold. Qwen 3 235B ships under the Tongyi Qianwen License โ€” also commercial-friendly, with the caveat that products serving over 100M monthly active users must request a separate commercial license from Alibaba. Both restrict military, weapons, and CSAM uses; Qwen also restricts uses 'against Chinese national interests.'

Which open-source LLM is best for coding?

DeepSeek V4 โ€” it leads on LiveCodeBench (73.4%), SWE-bench Verified (67.8%), and HumanEval-X. On LiveCodeBench specifically it exceeds Claude Opus 4.7 (71.8%) and GPT-5.4 (69.4%), making it the only open-source model that beats closed-source on a major contamination-free coding benchmark in May 2026.

Which is better for function calling and agentic workflows?

Qwen 3 235B. It scores 87.3% on BFCL and 81.2% on multi-step BFCL โ€” slightly ahead of DeepSeek V4. It also supports strict-schema JSON output, which DeepSeek V4 does not. For agents with 10+ tool calls per session, Qwen 3's marginal reliability advantage compounds into noticeably fewer failed runs.

When does self-hosting beat serverless?

Typically around 50โ€“80% sustained GPU utilization (โ‰ˆ10 billion monthly tokens). Below that, the operational overhead โ€” ML platform engineer, on-call rotation, monitoring, model-update pipeline โ€” outweighs the per-token savings. Self-hosting also pays back when data residency or IP concerns block sending data to third-party APIs.

Are DeepSeek V4 and Qwen 3 235B as good as Claude Opus 4.7 or GPT-5.4?

Within 2โ€“6 percentage points on every benchmark. The gap is real but small. For 80% of production workloads โ€” chat, summarization, classification, document QA, code generation โ€” both open-source models perform indistinguishably from the closed-source flagships at 30โ€“90ร— lower cost. The hardest cases (frontier reasoning, agentic engineering at the highest reliability) still favor closed-source by a margin worth paying for.

Can I fine-tune DeepSeek V4 or Qwen 3 235B?

Yes, both support LoRA and QLoRA fine-tuning. DeepSeek V4 ships a non-instruct base checkpoint specifically for this purpose; Qwen 3 ships both base and instruct. On a 4ร—H100 box, a 10-million-token domain fine-tune takes 6โ€“12 hours. TRL and Axolotl both support both models.

Are these models multimodal?

No โ€” both DeepSeek V4 and Qwen 3 235B are text-only at the flagship size. Alibaba ships Qwen 3 VL variants for vision, but they are smaller (7B and 72B). DeepSeek has signaled native multimodal support in a future release. For vision work today, you would pair DeepSeek V4 or Qwen 3 235B with a separate vision model โ€” or use a closed-source multimodal flagship.

How do I migrate from OpenAI API to DeepSeek V4 or Qwen 3 235B?

Both models are exposed through OpenAI-compatible APIs on Together, Fireworks, and DeepInfra (and on DeepSeek's own API for DeepSeek V4). Migration is a `baseURL` change and a `model` string change โ€” the rest of your OpenAI SDK code stays identical. The only caveats: parallel tool use is sequential by default, and Qwen 3's `thinking` mode is exposed via a non-standard parameter.

Try Both Open-Source Models Now

Railwail exposes DeepSeek V4 and Qwen 3 235B alongside Claude Opus 4.7, GPT-5.4, and 100+ other models behind a single OpenAI-compatible endpoint. Pay per token at provider list prices โ€” no markup. Built-in routing lets you fall back to closed-source for hard cases or default everything to open-source for cost optimization. Start with free credits and run your own quality eval.

Sponsored

All Open-Source Models. One API. No Markup.

DeepSeek V4, Qwen 3 235B, Llama 3.3, Mixtral, and 40+ more โ€” through the same OpenAI-compatible endpoint as GPT and Claude. Pass-through pricing.

Dr. Liam Park

Dr. Liam Park

Open-Source AI Researcher

PhD in distributed systems from CMU. Maintains a public LLM-serving cost calculator. Previously infrastructure engineer at Hugging Face.

Tags:
DeepSeek
Qwen
Open Source
LLM
Reasoning
Self-Hosting
Together AI
Fireworks
DeepInfra
2026