Speech-to-Text
Transcribe and understand audio with AI
Speech-to-text models for transcription, meetings, and search
Speech-to-text (STT) models convert spoken audio into written text. The category covers everything from podcast transcripts to real-time captioning pipelines to voice-command interfaces inside mobile apps. Reach for STT when you need to search inside audio, build dictation, summarize meetings, or generate captions for accessibility.
5 models available
Whisper Large V3
OpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.
Whisper Large v3 Turbo
OpenAI's distilled Whisper Large v3. ~216x realtime, 99+ languages, MIT-licensed weights.
Deepgram Nova-3
Deepgram's flagship STT. First to offer realtime multilingual transcription with self-serve customization.
ElevenLabs Scribe v1
ElevenLabs' STT. 99 languages, word-level timestamps, speaker diarization, audio-event tagging.
SeamlessM4T v2 Large (Speech)
Meta SeamlessM4T v2 Large speech mode. Speech-to-speech, speech-to-text, and text-to-speech translation across 100+ languages in a single unified model.
Top speech-to-text picks
Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.
OpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.
Learn moreElevenLabs' STT. 99 languages, word-level timestamps, speaker diarization, audio-event tagging.
Learn moreOpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.
Learn moreOpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.
Learn morePricing is almost always per-minute of audio. Flagship models (Whisper Large V3, Deepgram Nova-3, ElevenLabs Scribe) cost roughly €0.005-€0.015 per minute. A one-hour podcast transcript costs €0.30-€0.90 depending on the tier. Some providers charge extra for premium features like speaker diarization, word-level timestamps, summaries, or translation, so do the math with the features you actually need turned on.
The trade-off is accuracy, latency, and feature richness. Whisper Large V3 leads on raw word-error-rate in benchmark evaluations and is open-weights, so you can self-host. Deepgram Nova-3 and AssemblyAI Universal lead on streaming latency (sub-300ms first token) and diarization quality. ElevenLabs Scribe leads on multilingual coverage and code-switching (when speakers swap languages mid-sentence). For batch transcription, Whisper usually wins on cost-and-accuracy. For realtime call transcription, a streaming-first provider wins.
Watch out for noisy audio: word-error-rate roughly doubles below 20 dB SNR on every model, and overlapping speakers degrade diarization even on the flagships. Pre-process with a noise-suppression model (RNNoise, Krisp) if your source is unpredictable. Also watch out for proper nouns: every model still mistranscribes uncommon names, technical terms, and brand names. Most providers accept a `keywords` hint list to bias the decoder — use it.
Top picks above cover the most accurate model, the cheapest workhorse, the longest-audio supporter, and the fastest streaming option.
Popular use cases
Common patterns built with speech-to-text on Railwail.
Related comparisons
Side-by-side reviews of the most-compared models in this category.
Frequently asked questions
Start Building with AI
Access all models through a single API. Get free credits when you sign up — no credit card required.