ElevenLabs v3 (alpha)
ElevenLabs' v3 alpha TTS. Most expressive voice model with audio tags and laughter, higher latency.
ElevenLabs v3 (alpha) is text-to-speech AI model from ElevenLabs, priced at β¬0.300 per 1M input tokens with a unknown context window.
1x
Pricing
API Integration
Use our OpenAI-compatible API to integrate ElevenLabs v3 (alpha) into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple β just pass a string
const reply = await rw.run("eleven-v3", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("eleven-v3", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("eleven-v3", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive β ElevenLabs's ElevenLabs v3 (alpha)
ElevenLabs was founded in 2022 by Piotr Dabkowski (CTO, ex-Google ML engineer) and Mati Staniszewski (CEO, ex-Palantir), two Polish high-school friends frustrated with the poor quality of TV-show dubbing in Polish. The company set out to build voice AI that captures intonation and emotion across languages. Headquartered in London and New York with engineering hubs in Warsaw and the Bay Area, ElevenLabs raised a $19M Series A in June 2023 led by Andreessen Horowitz, a $80M Series B in January 2024 also led by a16z at a $1.1B valuation, and a $180M Series C in January 2025 at a $3.3B valuation co-led by a16z and ICONIQ. ElevenLabs v3 (alpha) was previewed in 2025 as the next generation flagship model with expressive emotion tags, longer context and more languages, succeeding the Multilingual V2 family that became the de-facto standard for AI dubbing.
Visit ElevenLabs βElevenLabs v3 (alpha) is the company's 2025 flagship text-to-speech model and the first ElevenLabs system to expose explicit emotion and event tags inside text input ([whispers], [laughs], [angry], [sighs]). It is a proprietary Transformer-based autoregressive model that predicts neural-codec audio tokens conditioned on a text prompt and a speaker embedding obtained from a few seconds of reference audio (Instant Voice Clone) or a fully fine-tuned voice (Professional Voice Clone, requires ~30 minutes of clean audio). v3 expands language coverage from 29 (v2) to 70+ languages, lengthens the input window to roughly 10,000 characters per request, and adds dialogue mode for multi-speaker scenes. ElevenLabs has not published a technical paper; product blog posts describe internal improvements in speaker disentanglement, code-switching and emotional range. v3 is offered through the same hosted API and Studio UI as Multilingual V2 but at higher latency and price.
- Parameters
- Undisclosed
- Context
- 10K tokens
- Expressive emotion and event tags ([laughs], [whispers], [angry], [crying])
- 70+ languages with high-quality code-switching
- Multi-speaker dialogue mode for podcast and audiobook generation
- Instant Voice Clone from ~1 minute of audio and Professional Voice Clone from ~30 minutes
- Long-form input up to ~10,000 characters per request
- Studio editor for multi-paragraph projects with per-line speaker control
- Best for: audiobooks, dubbing, narrative podcasts, character voices for games
Not disclosed. ElevenLabs licences professional voice talent, uses public-domain audiobooks and crowd-sourced opt-in voice contributions; commercial recordings are excluded per their public statements.
License: Proprietary commercial SaaS. Commercial use of generated audio is permitted on paid plans; voice clones remain customer property.
Known limitations
- Higher latency than v2 Turbo or Cartesia Sonic
- Tag interpretation occasionally inconsistent in alpha
- Hard refusal for likeness of named public figures without verified consent
- Closed weights, no on-premise deployment
- Pricing per character is among the highest in the market
Frequently asked questions
Related Models
View all Text-to-SpeechElevenLabs Multilingual V2
ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.
AudioCraft
Meta's AudioCraft framework wrapping MusicGen, AudioGen and EnCodec. Unified text-to-audio research toolkit for music and sound effects.
AudioLDM 2
Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.
Cartesia Sonic
Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.
Start using ElevenLabs v3 (alpha) today
Get started with free credits. No credit card required. Access ElevenLabs v3 (alpha) and 100+ other models through a single API.