ElevenLabs Multilingual V2
ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.
ElevenLabs Multilingual V2 is text-to-speech AI model from ElevenLabs, priced at €0.000 per 1M input tokens with a unknown context window.
1x
Examples
See what ElevenLabs Multilingual V2 can generate
Narration
Input text:
"Welcome to the future of artificial intelligence. In this episode, we explore how large language models are reshaping industries from healthcare to creative arts, and what it means for the next decade of human progress."
Podcast Intro
Input text:
"Hey everyone, welcome back to another episode of Tech Unfiltered! I'm your host, and today we have an incredible guest who just shipped one of the most downloaded apps of the year. Grab your coffee, because this conversation is going to be a wild ride."
Pricing
API Integration
Use our OpenAI-compatible API to integrate ElevenLabs Multilingual V2 into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("elevenlabs-multilingual-v2", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("elevenlabs-multilingual-v2", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("elevenlabs-multilingual-v2", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — ElevenLabs's ElevenLabs Multilingual V2
ElevenLabs was founded in 2022 by Piotr Dabkowski and Mati Staniszewski, two Polish friends with backgrounds at Google and Palantir respectively. The product mission was to fix the poor dubbing experience for non-English films by producing AI voices with realistic intonation and emotion across many languages. The company is headquartered in London and New York with engineering centres in Warsaw and the Bay Area, and has raised over $280M across rounds led by Andreessen Horowitz and ICONIQ at valuations rising from $100M (Series A, 2023) to $3.3B (Series C, 2025). Multilingual V2, released in August 2023, became the default model for production audiobook and dubbing workflows at companies like Storytel, TheSoul Publishing and many indie publishers, and remained the flagship until v3 was previewed in 2025.
Visit ElevenLabs →ElevenLabs Multilingual V2 is a hosted autoregressive Transformer text-to-speech model that predicts neural-codec audio tokens conditioned on text and a speaker embedding. The speaker embedding is obtained either from a stock voice (the curated 'Voice Library'), an Instant Voice Clone (1 minute of reference audio) or a Professional Voice Clone fine-tuned from 30 minutes of clean recording. Multilingual V2 supports 29 languages with strong code-switching between them in the same paragraph. Output is 24/44.1 kHz MP3 or PCM through a hosted API and the ElevenLabs Studio editor. The model is the production workhorse for ElevenLabs' Dubbing Studio (auto-translate plus voice match) and the AI Audiobook product. ElevenLabs has not published a technical paper, but the system architecturally resembles published neural-codec language-model TTS such as VALL-E and Voicebox.
- Parameters
- Undisclosed
- Context
- 5K tokens
- 29 languages with seamless code-switching
- Voice Library with thousands of community and stock voices
- Instant Voice Clone (~1 min audio) and Professional Voice Clone (~30 min)
- Stability and similarity sliders for per-request prosody control
- Studio editor for multi-paragraph long-form projects
- Dubbing Studio with automatic translation and voice matching
- Up to ~5,000 characters per request
- Best for: production audiobooks, multilingual dubbing, content localisation
Not disclosed. Mix of licensed professional voice talent, public-domain audiobooks and opt-in user voice contributions.
License: Proprietary commercial SaaS. Commercial use permitted on paid plans; customer voice clones remain customer property.
Known limitations
- Higher per-character price than OpenAI TTS-1 or Cartesia Sonic
- Latency in the 300-600 ms range; slower than Sonic for real-time use
- Limited SSML support
- No on-premises deployment
- 29-language list smaller than v3's 70+
Frequently asked questions
Related Models
View all Text-to-SpeechAudioCraft
Meta's AudioCraft framework wrapping MusicGen, AudioGen and EnCodec. Unified text-to-audio research toolkit for music and sound effects.
AudioLDM 2
Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.
Cartesia Sonic
Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.
Edge TTS
Microsoft Edge neural voices accessed via the open-source edge-tts wrapper. 400+ voices across 100+ locales, suitable for batch generation.
Start using ElevenLabs Multilingual V2 today
Get started with free credits. No credit card required. Access ElevenLabs Multilingual V2 and 100+ other models through a single API.