Microsoft Phi-3.5 MoE Instruct
Mixture-of-experts Phi-3.5: 42B total / 6.6B active params. 128k context, multilingual.
Microsoft Phi-3.5 MoE Instruct is text & chat AI model from Microsoft, priced at β¬0.000 per 1M input tokens with a 131.1K tokens context window.
0.7
Pricing
API Integration
Use our OpenAI-compatible API to integrate Microsoft Phi-3.5 MoE Instruct into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple β just pass a string
const reply = await rw.run("phi-3-5-moe-instruct", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("phi-3-5-moe-instruct", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("phi-3-5-moe-instruct", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive β Microsoft Research's Microsoft Phi-3.5 MoE Instruct
Microsoft Research's Machine Learning Foundations group β led by SΓ©bastien Bubeck and Ronen Eldan β drove the Phi series of small-but-capable language models. The Phi thesis is that synthetic 'textbook-quality' training data can produce small models that punch far above their weight on reasoning benchmarks. The series began with Phi-1 (1.3B, code, 2023), Phi-1.5 (general reasoning, 2023), Phi-2 (2.7B, 2023), Phi-3 (Mini, Small, Medium dense models, April 2024) and Phi-3.5 (Mini, Vision, MoE, August 2024). Phi-3.5 MoE was Microsoft's first Mixture-of-Experts Phi variant β 16 experts of 3.8B parameters each with top-2 routing. Microsoft Research itself was founded in 1991 and remains one of the largest industrial AI research organisations in the world; Phi is one of its flagship open-weights AI projects.
Visit Microsoft Research βPhi-3.5 MoE Instruct is a 16x3.8B Mixture-of-Experts decoder transformer β 16 experts each approximately the size of Phi-3-Mini, with top-2 routing yielding 6.6B active parameters out of 41.9B total. The architecture uses 32 layers, 4,096 hidden size, 32-head grouped-query attention with 8 KV heads, RoPE positional embeddings (theta=10000, extended for 128K context), SwiGLU activations, and a 32,064-token Llama-derived BPE tokeniser. Routing uses a sparse mixer with auxiliary loss for expert balancing. The model was pretrained on 4.9 trillion tokens of heavily curated data, with the Phi recipe emphasising synthetic 'textbook-quality' data generated from larger models β explicitly oversampling reasoning-dense content over breadth. Training used 512 H100 GPUs for 23 days. Post-training is supervised fine-tuning plus Direct Preference Optimisation (DPO) with explicit safety post-training. Released August 2024 under MIT license.
- Parameters
- 41.9B total, 6.6B active per token (16 experts of ~3.8B each, top-2 routing)
- Context
- 131.1K tokens
- 16-expert MoE β Microsoft's first MoE Phi variant
- Only 6.6B active parameters β cheap inference for MoE
- Punches above weight: matches Mixtral 8x7B (12.9B active) and Llama 3.1 8B on many benchmarks
- Strong math and reasoning for active-param size (MMLU 78.9, GSM8K 88.7)
- 128K context window
- Multilingual support for 22 languages
- Open weights under permissive MIT license
- Best for: cost-efficient reasoning, on-device inference (INT4 ~12GB), education and tutoring applications.
Pretrained on 4.9 trillion tokens. The mix is heavily curated and includes filtered web data, synthetic 'textbook-quality' data generated from larger models, code, math and 22-language multilingual sources. Knowledge cutoff October 2023. Training used 512 NVIDIA H100 GPUs for 23 days. Post-training is supervised fine-tuning plus DPO with explicit safety post-training and red-team feedback.
License: MIT License for the open weights. Commercial use, redistribution and modification permitted without restriction β one of the most permissive licenses among major open-weight LLMs.
Known limitations
- Total memory ~42B parameters needs ~80GB FP16 β heavier than 6.6B active suggests
- MoE routing means latency spikes on imbalanced batches
- Knowledge breadth narrower than larger dense models β Phi trades breadth for reasoning
- Behind frontier models on coding benchmarks despite strong math
- Synthetic-data-heavy training can produce 'textbook-like' answers that don't match real-world tone
- No vision modality (use Phi-3.5-Vision instead)
Frequently asked questions
Related Models
View all Text & ChatClaude Opus 4
Anthropic's most powerful model. Exceptional at complex analysis, agentic tasks, and extended reasoning.
Claude Sonnet 4
Anthropic's most capable model. Excellent for complex analysis, coding, math, and creative writing.
DeepSeek V3.1
DeepSeek's refreshed V3.1 release. 671B MoE / 37B active. Tops open-weights leaderboards on coding and reasoning.
DeepSeek V4 Pro
DeepSeek's April 2026 flagship. 1.6T MoE / 49B active params, 1M context, rivals top closed-source models on STEM and coding at a fraction of the price.
Start using Microsoft Phi-3.5 MoE Instruct today
Get started with free credits. No credit card required. Access Microsoft Phi-3.5 MoE Instruct and 100+ other models through a single API.