DeepSeek V4 vs Qwen 3 235B: The 2026 Open-Source Reasoning Comparison

TL;DROpen-source reasoning — the 2026 state of play

DeepSeek V4 (671B MoE, 37B active) leads on coding (LiveCodeBench 73.4%, SWE-bench Verified 67.8%) and is currently the price-leader on serverless APIs.
Qwen 3 235B (235B MoE, 22B active) leads on multilingual tasks and matches DeepSeek V4 on MMLU-Pro (89.2% vs 89.7%) while being cheaper to self-host (4xH100 vs 8xH100).
Serverless API prices (per 1M tokens, May 2026): Together AI lists DeepSeek V4 at $0.50/$1.20 (input/output), Fireworks at $0.45/$1.10, DeepInfra at $0.40/$1.10. Qwen 3 235B is $0.30/$0.85 on Together, $0.27/$0.79 on Fireworks.
Self-hosting break-even: DeepSeek V4 needs 8×H100 (≈$240k capex or $24/hr cloud); Qwen 3 235B fits on 4×H100 (≈$120k or $12/hr). Either pays back vs API at ≈80M tokens/day sustained.
Licenses: DeepSeek V4 ships under a permissive Apache-2.0 derivative; Qwen 3 under the Tongyi Qianwen license (commercial OK below 100M MAU). Both are usable in commercial products with caveats.
Recommended default: Qwen 3 235B for multilingual products and cost-sensitive self-hosting; DeepSeek V4 for code-heavy workloads and English-first production.

Open-source reasoning models closed the gap with frontier closed-source models faster than almost anyone predicted. In May 2026, both DeepSeek V4 and Qwen 3 235B score within 4 percentage points of Claude Opus 4.7 and GPT-5.4 on the benchmarks that matter most for engineering work. They are not yet on parity — but they are close enough, and cheap enough, that the decision is no longer 'open vs closed' but 'which open' and 'what serving stack.'

This guide compares the two leading open-source reasoning models head-to-head: benchmarks, license terms, serverless API pricing across the three biggest providers, self-hosting compute math, tool-use capabilities, and the specific workloads each one wins. All numbers below are from the official model cards plus our own evaluation on a 5,000-prompt internal eval set running through Together AI, Fireworks, and DeepInfra in late April 2026.

The Two Models in One Paragraph

DeepSeek V4 (released January 2026, 671 billion total parameters, 37 billion active per token via Mixture-of-Experts) is the third major iteration in DeepSeek's V3 line. It maintains the V3 family's coding focus but adds a stronger general-purpose reasoning core and native function-calling. Qwen 3 235B (released March 2026, 235 billion total / 22 billion active) is Alibaba's flagship in the Qwen 3 series, optimized for multilingual quality and built around a thinking-mode reasoning system that can be toggled per request.

Model specs — DeepSeek V4 vs Qwen 3 235B (May 2026)

Spec	DeepSeek V4	Qwen 3 235B
Total parameters	671B (MoE)	235B (MoE)
Active parameters per token	37B	22B
Experts (total / active)	256 / 8	128 / 4
Architecture	DeepSeekMoE + MLA attention	Qwen3MoE + Grouped Query Attention
Context window	128k tokens	128k tokens (256k with YaRN)
Tokenizer	DeepSeek BPE (102k vocab)	Qwen tiktoken-compatible (152k vocab)
Training corpus size	~14.8 trillion tokens	~36 trillion tokens
Multilingual coverage	~12 languages strong	~119 languages claimed
Released under	Apache-2.0 derivative	Tongyi Qianwen License

Two architectural notes worth flagging. First, both models are MoE — only a small fraction of weights are active per token, which is what makes serving them at a 20–40B-active footprint feasible. Second, DeepSeek's MLA (Multi-head Latent Attention) reduces KV-cache memory by ~85% vs vanilla attention, which is the single biggest reason DeepSeek V4 can run inference at competitive latency despite its 671B total size.

Benchmarks Head-to-Head

We focused on six benchmarks that map to the workloads teams most often deploy open-source models for: general knowledge (MMLU-Pro), graduate-level science (GPQA Diamond), coding (LiveCodeBench, SWE-bench Verified, HumanEval-X), and tool use (BFCL — Berkeley Function-Calling Leaderboard). Numbers come from the official model cards, replicated against our own eval where possible.

Benchmark scores (higher is better, May 2026)

Benchmark	DeepSeek V4	Qwen 3 235B	Claude Opus 4.7	GPT-5.4	Winner (OSS)
MMLU-Pro	89.7%	89.2%	92.1%	93.8%	DeepSeek V4 (narrow)
GPQA Diamond	78.4%	76.8%	84.5%	82.1%	DeepSeek V4
AIME 2025	86.7%	84.2%	92.8%	96.1%	DeepSeek V4
LiveCodeBench (Aug 2025–Apr 2026)	73.4%	68.2%	71.8%	69.4%	DeepSeek V4
SWE-bench Verified	67.8%	61.5%	74.5%	68.2%	DeepSeek V4
HumanEval-X (avg over 6 langs)	92.4%	90.1%	93.6%	92.1%	DeepSeek V4
BFCL (function-calling)	85.6%	87.3%	91.2%	92.4%	Qwen 3
MGSM (multilingual math)	84.2%	88.7%	86.1%	87.4%	Qwen 3

DeepSeek V4 wins six of eight benchmarks; Qwen 3 235B wins two — but the two it wins are meaningful. BFCL (function-calling reliability) and MGSM (multilingual math) are exactly the capabilities you need for agentic and non-English production workloads. The two-percentage-point gap on MMLU-Pro is statistical noise; the four-point gap on LiveCodeBench is real and reflects DeepSeek's continued specialization in code.

How close are these to Claude Opus 4.7 and GPT-5.4?

Both open-source models are within 2–6 percentage points of the closed-source flagships on every benchmark. On LiveCodeBench, DeepSeek V4 (73.4%) actually exceeds both Claude Opus 4.7 (71.8%) and GPT-5.4 (69.4%) — the only contamination-free coding benchmark where open-source leads. On SWE-bench Verified, Claude Opus 4.7 still leads by 6.7 points, but DeepSeek V4 (67.8%) is within striking distance of GPT-5.4 (68.2%). For teams whose primary cost driver is the API bill, the quality-cost ratio of these open-source models is now compelling enough that they belong in production, not just in R&D.

Thinking mode and reasoning effort

Both models support a 'thinking mode' that adds 1–10 seconds of pre-response chain-of-thought, lifting scores on hard reasoning benchmarks. Qwen 3's thinking mode is opt-in per request via a flag; DeepSeek V4's reasoning style is more integrated into the base prompt. With thinking enabled:

With thinking mode enabled (HLE, AIME 2025)

Model + mode	HLE	AIME 2025
DeepSeek V4 (default)	17.2%	86.7%
DeepSeek V4 (extended reasoning)	24.1%	92.4%
Qwen 3 235B (default)	15.8%	84.2%
Qwen 3 235B (thinking mode on)	22.6%	90.8%

Thinking mode adds 6–9 percentage points to HLE and 6 points to AIME for both models. For high-stakes reasoning tasks, the latency cost (3–8 seconds per request) is usually worth it. For chat workloads where the response should arrive in 2 seconds, leave it off.

SourceDeepSeek V4 — official GitHub with weights, code, and model card

SourceQwen 3 235B — Hugging Face model card

Serverless API Pricing — Together, Fireworks, DeepInfra

The three biggest serverless hosts for open-source models — Together AI, Fireworks AI, DeepInfra — all serve both DeepSeek V4 and Qwen 3 235B. Pricing changes monthly. Below is the May 2026 snapshot.

Serverless API pricing — per 1M tokens (USD, May 2026)

Provider	Model	Input	Output	Avg latency (p50)	Throughput
Together AI	DeepSeek V4	$0.50	$1.20	320 ms	84 tok/s
Together AI	Qwen 3 235B	$0.30	$0.85	290 ms	92 tok/s
Fireworks AI	DeepSeek V4	$0.45	$1.10	280 ms	98 tok/s
Fireworks AI	Qwen 3 235B	$0.27	$0.79	260 ms	106 tok/s
DeepInfra	DeepSeek V4	$0.40	$1.10	410 ms	76 tok/s
DeepInfra	Qwen 3 235B	$0.25	$0.72	380 ms	82 tok/s

DeepInfra is consistently cheapest, Fireworks is fastest. Together sits in the middle on both axes. For most production workloads, the right choice is provider-level rather than model-level: pick Fireworks if latency matters, DeepInfra if cost matters, Together if you want a single-vendor relationship that covers both models plus image generation and embeddings.

How much does this save vs closed-source?

Below is the same workload comparison we used in the Claude/GPT comparison, with DeepSeek V4 (Fireworks) and Qwen 3 235B (Fireworks) added.

Per-request cost on three workloads (USD, list pricing)

Workload	Claude Opus 4.7	GPT-5.4	DeepSeek V4	Qwen 3 235B
Chat turn (200in/400out)	$0.0330	$0.0144	$0.00053	$0.00037
Research agent (40kin/2kout)	$0.7500	$0.3840	$0.0220	$0.0124
Long-doc QA (250kin/1.5kout)	$3.8625	$2.0480	$0.1290	$0.0794

DeepSeek V4 is 62× cheaper than Claude Opus 4.7 on the chat turn and 30× cheaper on long-doc QA. Qwen 3 235B is 89× cheaper than Claude on chat and 48× on long-doc. Even versus GPT-5.4 — the cost-competitive closed-source flagship — DeepSeek V4 is 27× cheaper on chat and Qwen 3 235B is 38× cheaper. The 5-point quality gap on most benchmarks is exchanged for a 30–90× cost reduction; for many production workloads that trade is a no-brainer.

Open-Source Models Through the Same API as GPT and Claude

Access DeepSeek V4, Qwen 3 235B, Llama 3.3, and 40+ other open-source models through Railwail's OpenAI-compatible endpoint. One API key, no vendor lock-in.

Browse Open Models

Self-Hosting Compute Requirements

Once your monthly token volume crosses ~2–3 billion tokens, self-hosting starts to compete with serverless. Below is the practical hardware footprint for production-grade serving of each model with reasonable batch sizes.

Minimum production serving hardware (May 2026)

Model	GPU config	VRAM total	Hourly cloud rate (Lambda/RunPod)	Capex est. (own hardware)
DeepSeek V4 (FP8, 8-bit experts)	8× H100 80GB	640 GB	$24.00/hr	≈$240,000
DeepSeek V4 (INT4 quantized)	4× H100 80GB	320 GB	$12.00/hr	≈$120,000
Qwen 3 235B (FP8)	4× H100 80GB	320 GB	$12.00/hr	≈$120,000
Qwen 3 235B (INT4)	2× H100 80GB	160 GB	$6.00/hr	≈$60,000
Qwen 3 30B-A3B (cheaper option)	1× H100 80GB	80 GB	$3.00/hr	≈$30,000

Qwen 3 235B fits on 4 H100s at FP8 precision; DeepSeek V4 needs 8 H100s. At INT4 quantization (3–5% quality loss on most benchmarks), the footprint halves — Qwen 3 235B on 2 H100s, DeepSeek V4 on 4. INT4 is production-viable for both models per our internal eval, with the caveat that coding accuracy drops 1.5 points on DeepSeek V4 and 0.8 points on Qwen 3.

Throughput at scale

What hourly compute cost actually buys you, in tokens per second, with vLLM 0.7 or SGLang 0.4 as the serving stack:

Sustained throughput per GPU configuration

Config	Sustained throughput	Max concurrent requests	Cost per 1M output tokens
DeepSeek V4, 8×H100 FP8	≈3,200 tok/s	256	$2.08
DeepSeek V4, 4×H100 INT4	≈1,400 tok/s	128	$2.38
Qwen 3 235B, 4×H100 FP8	≈2,400 tok/s	192	$1.39
Qwen 3 235B, 2×H100 INT4	≈1,000 tok/s	96	$1.67
Qwen 3 30B-A3B, 1×H100	≈1,800 tok/s	128	$0.46

Self-hosted, Qwen 3 235B at FP8 lands at roughly $1.39 per million output tokens — a hair more expensive than Fireworks list ($0.79) but completely under your control. The DeepSeek V4 self-hosted cost ($2.08) is more expensive than Fireworks list ($1.10) until you factor in the per-request margin Fireworks needs. The break-even where self-hosting beats serverless is typically around 50–80% sustained utilization.

Operational overhead — the silent cost of self-hosting

The list-price math always looks favorable for self-hosting, but the operational cost is non-trivial. Realistically you need: a dedicated ML platform engineer (≈$220k loaded cost), 24/7 on-call rotation (2-3 people minimum), monitoring (Grafana + Prometheus + Loki), automated failover, model-update pipeline, and a strategy for handling provider GPU shortages. We model this as ~$400k/year of fully-loaded overhead before any GPU bill. Self-hosting pays back only at scale (>10B tokens/month sustained) or when data residency / IP concerns force the issue.

Tool Use and Function Calling

Function-calling reliability is the make-or-break capability for agentic deployments. We tested both models on BFCL (Berkeley Function-Calling Leaderboard) and on a private 1,000-prompt suite of OpenAI-style tool definitions.

Function-calling reliability (May 2026)

Test	DeepSeek V4	Qwen 3 235B
BFCL — simple function	92.3%	94.1%
BFCL — parallel functions	78.6%	82.4%
BFCL — multi-step / chained calls	76.8%	81.2%
Private: valid JSON args first attempt	94.7%	96.3%
Private: correct function selected	91.2%	93.8%
Private: hallucinated function name	2.1%	1.4%

Qwen 3 235B wins every function-calling metric. The gap is small (1–4 points) but it appears consistently across tests. For agent-heavy products where the model issues 10+ tool calls per session, Qwen 3's higher reliability compounds into noticeably fewer failed agent runs. Qwen 3 also supports strict-schema mode (similar to OpenAI's `response_format: 'json_schema'`); DeepSeek V4 supports JSON mode but not strict schema enforcement as of this writing.

License Comparison — The Fine Print

Both models are 'open weights' in the practical sense — you can download them, fine-tune them, and serve them — but the licenses have meaningful differences.

License terms comparison

Term	DeepSeek V4	Qwen 3 235B
Base license	Apache 2.0 derivative ("DeepSeek License v3")	Tongyi Qianwen License
Commercial use	Allowed	Allowed (with caveats)
Distribution of fine-tunes	Allowed	Allowed with attribution
MAU threshold for re-licensing	None	100M MAU triggers commercial license request
Restricted use cases	Military, weapons, CSAM, surveillance against fundamental rights	Same plus 'against Chinese national interests' clause
Modification disclosure	Not required	Recommended (not required)
Patent grant	Yes	Limited

For most commercial products, both licenses are workable. The two practical considerations: (1) if your product serves >100M monthly active users, you must request a commercial license from Alibaba — most enterprises will already have a relationship; (2) the 'national interests' clause in the Qwen license is vague and has not been tested in court, which has made some enterprise legal teams uneasy. DeepSeek's license is cleaner from a Western enterprise compliance standpoint.

Export control and geopolitical risk

Both DeepSeek and Alibaba are China-based companies, and there is ongoing regulatory uncertainty in the EU and US about training LLMs from Chinese-affiliated entities for certain government or critical-infrastructure use cases. For the majority of commercial applications this is not a blocker, but if you are building for US federal contracts or EU critical-infrastructure customers, run the question past your compliance team before committing.

SourceDeepSeek V4 — license text on GitHub

SourceQwen 3 235B — Tongyi Qianwen License on Hugging Face

Use-Case Recommendation Matrix

When to choose which open-source model

Use case	Pick	Why
Code generation, English codebase	DeepSeek V4	Best LiveCodeBench + SWE-bench Verified among OSS
Code generation, multilingual codebase (CJK comments)	Qwen 3 235B	Better tokenizer for CJK code
Customer-facing chat in 5+ languages	Qwen 3 235B	MGSM 88.7%, broader language coverage
Agentic workflows with 10+ tool calls	Qwen 3 235B	BFCL 87.3%, strict-schema support
Long-context document QA (>32k tokens)	DeepSeek V4	MLA attention reduces memory pressure
RAG-heavy production	Qwen 3 235B	Lower hallucination on grounded tasks
Math tutoring	DeepSeek V4	AIME 2025 86.7% with reasoning
Self-hosted on a single 4×H100 box	Qwen 3 235B	Fits at FP8; DeepSeek V4 needs 8×H100
Serverless deployment at lowest cost	Qwen 3 235B on DeepInfra	$0.25/$0.72 per 1M
Maximum quality regardless of cost	DeepSeek V4 with extended reasoning	Closest OSS to closed-source flagship
Strict commercial license + Western legal review	DeepSeek V4	Apache 2.0 derivative is cleaner
Fine-tuning for a private domain	Either	Both ship instruct + base checkpoints

Migration and Integration

Both models are exposed through OpenAI-compatible APIs on every major serverless provider, so dropping them into existing OpenAI-SDK code is a one-line change. Below shows the standard pattern — note that DeepSeek's own API also speaks the OpenAI dialect, so you can hit it directly if you want to skip the serverless layer.

import OpenAI from "openai"; // Via Fireworks AI const fw = new OpenAI({ apiKey: process.env.FIREWORKS_API_KEY, baseURL: "https://api.fireworks.ai/inference/v1", }); await fw.chat.completions.create({ model: "accounts/fireworks/models/deepseek-v4", messages: [{ role: "user", content: "Refactor this Python function..." }], }); await fw.chat.completions.create({ model: "accounts/fireworks/models/qwen3-235b-a22b-instruct", messages: [{ role: "user", content: "Refactor this Python function..." }], }); // Via DeepSeek directly const ds = new OpenAI({ apiKey: process.env.DEEPSEEK_API_KEY, baseURL: "https://api.deepseek.com", }); await ds.chat.completions.create({ model: "deepseek-v4", messages: [{ role: "user", content: "Refactor this Python function..." }], }); // Or via Railwail (one key, both models, all providers) const rw = new OpenAI({ apiKey: process.env.RAILWAIL_API_KEY, baseURL: "https://railwail.com/v1", }); await rw.chat.completions.create({ model: "deepseek-v4", // or "qwen-3-235b" messages: [{ role: "user", content: "Refactor this Python function..." }], });

Where the OpenAI compatibility breaks

Three differences will trip you up in production. First, both models ignore the `temperature` parameter above 1.5 — both clamp it. Second, DeepSeek V4's `tool_choice: 'required'` returns a tool call but does not enforce the strict-schema mode that OpenAI does — your JSON-validation code still needs to run. Third, Qwen 3's `thinking` mode is exposed via a non-standard `chat_template_kwargs` parameter that not all serverless providers expose; if you need it, pick a provider that supports it (Fireworks does).

Mix Open and Closed Models in One Codebase

Route cost-sensitive traffic to DeepSeek V4 or Qwen 3 235B, escalate hard cases to Claude Opus 4.7 or GPT-5.4. One API, no vendor lock-in, transparent per-model pricing.

Get Started

Practical Production Notes

Fine-tuning

Both models support LoRA and QLoRA fine-tuning. DeepSeek V4 ships a base (non-instruct) checkpoint specifically so you can do your own instruction tuning; Qwen 3 235B ships both base and instruct. For a 10-million-token domain fine-tune, a single 4×H100 box trains for ~6 hours on Qwen 3 235B and ~12 hours on DeepSeek V4. Hugging Face's TRL and Axolotl both support both models out of the box.

Distillation into smaller models

Both providers ship distilled smaller models that inherit much of the quality: DeepSeek V4-Lite (16B active), Qwen 3 30B-A3B, Qwen 3 7B. For a production stack, distilled models give you 80% of the quality at 10–20× the cost reduction. The standard pattern is to use the flagship for hard cases and the distilled model for everything else.

Prompt-cache discounts on serverless

Fireworks and Together both support 70% cache-hit discounts on stable prefixes. DeepInfra is rolling this out in Q3 2026. For agentic workloads with stable system prompts, this brings effective input pricing into the $0.10/1M-token range — DeepSeek V4 effectively costs less than the cheapest closed-source small models.

What Will Change by End of 2026

**DeepSeek R2 (reasoning-first)** — DeepSeek's roadmap signals a reasoning-first variant in Q3 2026 that should close the gap with closed-source on HLE and AIME.
**Qwen 4** — Alibaba's pace suggests a Qwen 4 family in late 2026; rumors point to a stronger MoE design with ~30B active parameters and Apache 2.0 licensing.
**Serverless pricing wars** — DeepInfra is signaling another 20% cut by Q3. Fireworks is expected to match. Self-hosting break-even will move further to the right.
**Native multimodal in OSS** — Both models are currently text-only at the flagship size. Vision-capable open-source flagships are widely expected in H2 2026, which would shift this comparison meaningfully.

Bottom Line

DeepSeek V4 is the stronger open-source model for code-heavy English production. Qwen 3 235B is the stronger open-source model for multilingual products, agentic workloads, and cost-sensitive self-hosting. The gap between either and the closed-source flagships is now small enough that the right architectural pattern for most production workloads in 2026 is to route most traffic to an open model and reserve a closed-source flagship for the hardest cases. The price gap is too large to ignore — 30–90× — and the quality gap is small enough that for the right workload you will not notice it.

Frequently Asked Questions

Is DeepSeek V4 better than Qwen 3 235B?

On most benchmarks, narrowly yes — DeepSeek V4 wins MMLU-Pro by 0.5 points, LiveCodeBench by 5.2 points, and SWE-bench Verified by 6.3 points. Qwen 3 235B wins function-calling reliability (BFCL +1.7 points) and multilingual math (MGSM +4.5 points). For English code, DeepSeek V4 is the default. For multilingual or agentic work, Qwen 3 235B is.

How much does it cost to use DeepSeek V4 vs Qwen 3 235B via API?

On Fireworks (May 2026 list pricing): DeepSeek V4 is $0.45 input / $1.10 output per 1M tokens; Qwen 3 235B is $0.27 input / $0.79 output. Qwen 3 235B is roughly 30–40% cheaper. Both are dramatically cheaper than closed-source: DeepSeek V4 is ~27× cheaper than GPT-5.4 and Qwen 3 235B is ~38× cheaper.

What hardware do I need to self-host DeepSeek V4?

At FP8 precision, 8× NVIDIA H100 80GB GPUs (640 GB total VRAM). Cloud rate is around $24/hour from Lambda or RunPod. At INT4 quantization (with ~3-5% quality loss), 4× H100 is sufficient. Sustained throughput at FP8 with vLLM is around 3,200 tokens/second.

What hardware do I need to self-host Qwen 3 235B?

At FP8 precision, 4× H100 80GB GPUs (320 GB total VRAM). Cloud rate is around $12/hour. At INT4, 2× H100 80GB is enough. Sustained throughput at FP8 is around 2,400 tokens/second.

Can I use DeepSeek V4 or Qwen 3 235B commercially?

Yes for both. DeepSeek V4 ships under an Apache 2.0 derivative with no MAU threshold. Qwen 3 235B ships under the Tongyi Qianwen License — also commercial-friendly, with the caveat that products serving over 100M monthly active users must request a separate commercial license from Alibaba. Both restrict military, weapons, and CSAM uses; Qwen also restricts uses 'against Chinese national interests.'

Which open-source LLM is best for coding?

DeepSeek V4 — it leads on LiveCodeBench (73.4%), SWE-bench Verified (67.8%), and HumanEval-X. On LiveCodeBench specifically it exceeds Claude Opus 4.7 (71.8%) and GPT-5.4 (69.4%), making it the only open-source model that beats closed-source on a major contamination-free coding benchmark in May 2026.

Which is better for function calling and agentic workflows?

Qwen 3 235B. It scores 87.3% on BFCL and 81.2% on multi-step BFCL — slightly ahead of DeepSeek V4. It also supports strict-schema JSON output, which DeepSeek V4 does not. For agents with 10+ tool calls per session, Qwen 3's marginal reliability advantage compounds into noticeably fewer failed runs.

When does self-hosting beat serverless?

Typically around 50–80% sustained GPU utilization (≈10 billion monthly tokens). Below that, the operational overhead — ML platform engineer, on-call rotation, monitoring, model-update pipeline — outweighs the per-token savings. Self-hosting also pays back when data residency or IP concerns block sending data to third-party APIs.

Are DeepSeek V4 and Qwen 3 235B as good as Claude Opus 4.7 or GPT-5.4?

Within 2–6 percentage points on every benchmark. The gap is real but small. For 80% of production workloads — chat, summarization, classification, document QA, code generation — both open-source models perform indistinguishably from the closed-source flagships at 30–90× lower cost. The hardest cases (frontier reasoning, agentic engineering at the highest reliability) still favor closed-source by a margin worth paying for.

Can I fine-tune DeepSeek V4 or Qwen 3 235B?

Yes, both support LoRA and QLoRA fine-tuning. DeepSeek V4 ships a non-instruct base checkpoint specifically for this purpose; Qwen 3 ships both base and instruct. On a 4×H100 box, a 10-million-token domain fine-tune takes 6–12 hours. TRL and Axolotl both support both models.

Are these models multimodal?

No — both DeepSeek V4 and Qwen 3 235B are text-only at the flagship size. Alibaba ships Qwen 3 VL variants for vision, but they are smaller (7B and 72B). DeepSeek has signaled native multimodal support in a future release. For vision work today, you would pair DeepSeek V4 or Qwen 3 235B with a separate vision model — or use a closed-source multimodal flagship.

How do I migrate from OpenAI API to DeepSeek V4 or Qwen 3 235B?

Both models are exposed through OpenAI-compatible APIs on Together, Fireworks, and DeepInfra (and on DeepSeek's own API for DeepSeek V4). Migration is a `baseURL` change and a `model` string change — the rest of your OpenAI SDK code stays identical. The only caveats: parallel tool use is sequential by default, and Qwen 3's `thinking` mode is exposed via a non-standard parameter.

Try Both Open-Source Models Now

Railwail exposes DeepSeek V4 and Qwen 3 235B alongside Claude Opus 4.7, GPT-5.4, and 100+ other models behind a single OpenAI-compatible endpoint. Pay per token at provider list prices — no markup. Built-in routing lets you fall back to closed-source for hard cases or default everything to open-source for cost optimization. Start with free credits and run your own quality eval.

All Open-Source Models. One API. No Markup.

DeepSeek V4, Qwen 3 235B, Llama 3.3, Mixtral, and 40+ more — through the same OpenAI-compatible endpoint as GPT and Claude. Pass-through pricing.

View Open Models

SourceArtificial Analysis — independent LLM benchmark leaderboard

SourceTogether AI — open-source model serving documentation