Speech-to-Text

Transcribe and understand audio with AI

Speech-to-text models for transcription, meetings, and search

Speech-to-text (STT) models convert spoken audio into written text. The category covers everything from podcast transcripts to real-time captioning pipelines to voice-command interfaces inside mobile apps. Reach for STT when you need to search inside audio, build dictation, summarize meetings, or generate captions for accessibility.

Top speech-to-text picks

Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.

Best overall
Whisper Large V3

OpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.

Learn more
Cheapest
ElevenLabs Scribe v1

ElevenLabs' STT. 99 languages, word-level timestamps, speaker diarization, audio-event tagging.

Learn more
Longest audio
Whisper Large V3

OpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.

Learn more
Fastest
Whisper Large V3

OpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.

Learn more

Pricing is almost always per-minute of audio. Flagship models (Whisper Large V3, Deepgram Nova-3, ElevenLabs Scribe) cost roughly €0.005-€0.015 per minute. A one-hour podcast transcript costs €0.30-€0.90 depending on the tier. Some providers charge extra for premium features like speaker diarization, word-level timestamps, summaries, or translation, so do the math with the features you actually need turned on.

The trade-off is accuracy, latency, and feature richness. Whisper Large V3 leads on raw word-error-rate in benchmark evaluations and is open-weights, so you can self-host. Deepgram Nova-3 and AssemblyAI Universal lead on streaming latency (sub-300ms first token) and diarization quality. ElevenLabs Scribe leads on multilingual coverage and code-switching (when speakers swap languages mid-sentence). For batch transcription, Whisper usually wins on cost-and-accuracy. For realtime call transcription, a streaming-first provider wins.

Watch out for noisy audio: word-error-rate roughly doubles below 20 dB SNR on every model, and overlapping speakers degrade diarization even on the flagships. Pre-process with a noise-suppression model (RNNoise, Krisp) if your source is unpredictable. Also watch out for proper nouns: every model still mistranscribes uncommon names, technical terms, and brand names. Most providers accept a `keywords` hint list to bias the decoder — use it.

Top picks above cover the most accurate model, the cheapest workhorse, the longest-audio supporter, and the fastest streaming option.

Related comparisons

Side-by-side reviews of the most-compared models in this category.

Frequently asked questions

Start Building with AI

Access all models through a single API. Get free credits when you sign up — no credit card required.