ElevenLabs Scribe v1
ElevenLabs' STT. 99 languages, word-level timestamps, speaker diarization, audio-event tagging.
ElevenLabs Scribe v1 is speech-to-text AI model from ElevenLabs, priced at €0.000 per 1M input tokens with a unknown context window.
Drop audio file here
MP3, WAV, M4A, FLAC (max 25MB)
Pricing
API Integration
Use our OpenAI-compatible API to integrate ElevenLabs Scribe v1 into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("scribe-v1", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("scribe-v1", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("scribe-v1", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — ElevenLabs's ElevenLabs Scribe v1
ElevenLabs was founded in 2022 by Piotr Dabkowski and Mati Staniszewski, two Polish technologists who had worked at Google and Palantir. The company became famous for high-quality multilingual text-to-speech and AI dubbing, and in February 2025 expanded into the inverse problem with Scribe v1, the company's first dedicated automatic speech recognition model. Scribe was developed in part to power ElevenLabs' Dubbing Studio (transcribe source audio, translate, then re-synthesise in the target language), and is offered as a standalone API to enterprise customers who want a single vendor for the full STT-to-TTS pipeline. ElevenLabs has raised over $280M to date, with a Series C in January 2025 at a $3.3B valuation.
Visit ElevenLabs →ElevenLabs Scribe v1 is a hosted automatic speech recognition model launched in February 2025. ElevenLabs has not published a technical report, but the launch blog describes a Transformer encoder-decoder ASR architecture trained on a large multilingual speech corpus covering 99 languages, with particular emphasis on accuracy in long-tail languages where Whisper Large v3 underperforms. Scribe outperformed Whisper Large v3 and Deepgram Nova-2 in the company's published FLEURS and Common Voice evaluations across many language pairs, and ranked first overall in a head-to-head benchmark on Hindi, Mandarin, German and Italian. The model supports speaker diarisation up to 32 speakers, word-level timestamps with sub-100 ms precision, character-level confidence scores, automatic non-speech event detection ([applause], [laughter], [music]) and audio-event classification. Maximum file size is 1 GB and maximum audio length is 2 hours per request.
- Parameters
- Undisclosed
- Context
- 7.2K tokens
- 99-language multilingual ASR including many low-resource languages
- Speaker diarisation up to 32 speakers
- Word-level timestamps with sub-100 ms precision
- Non-speech event detection ([applause], [laughter], [music])
- Character-level confidence scores
- Up to 2 hours per request, 1 GB file limit
- Direct integration with ElevenLabs Dubbing Studio (STT to translate to TTS)
- Best for: dubbing pipelines, multilingual transcription, podcast indexing, media analytics
Not disclosed. ElevenLabs reports training on a 'large multilingual corpus' with curation for long-tail languages; data is described as a mix of licensed and crowd-sourced opt-in audio.
License: Proprietary commercial API. Commercial use permitted on paid plans.
Known limitations
- No streaming mode at launch (file-based only)
- Hard cap of 2 hours per request
- Pricing per minute higher than Deepgram Nova-3 for English
- Closed weights, hosted only
- Diarisation accuracy degrades in noisy cross-talk
Frequently asked questions
Related Models
View all Speech-to-TextWhisper Large V3
OpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.
Whisper Large v3 Turbo
OpenAI's distilled Whisper Large v3. ~216x realtime, 99+ languages, MIT-licensed weights.
Deepgram Nova-3
Deepgram's flagship STT. First to offer realtime multilingual transcription with self-serve customization.
SeamlessM4T v2 Large (Speech)
Meta SeamlessM4T v2 Large speech mode. Speech-to-speech, speech-to-text, and text-to-speech translation across 100+ languages in a single unified model.
Start using ElevenLabs Scribe v1 today
Get started with free credits. No credit card required. Access ElevenLabs Scribe v1 and 100+ other models through a single API.