Question 1

Which STT model is the most accurate?

Accepted Answer

Whisper Large V3 leads on word-error-rate in independent benchmarks across most languages. Deepgram Nova-3 leads on English with low-latency streaming. AssemblyAI Universal leads on call-center and meeting audio. Run a sample of your own audio on the model detail page before committing.

Question 2

Is realtime streaming supported?

Accepted Answer

Yes — Deepgram, AssemblyAI, ElevenLabs Scribe, and OpenAI Realtime all stream transcripts with first-token latency under 300ms. Batch-only providers (some Whisper deployments) lag here. For captioning and voice agents, always pick a streaming-capable model.

Question 3

How is STT billed?

Accepted Answer

Per-minute of audio. Flagship rates run €0.005-€0.015 per minute. Premium features (diarization, timestamps, translation) sometimes carry surcharges. A typical one-hour interview costs €0.30-€0.90.

Question 4

What languages are supported?

Accepted Answer

Whisper Large V3 supports 99 languages. ElevenLabs Scribe covers 100+ with strong code-switching. Deepgram Nova-3 currently covers 40+ with English as the strongest. For lower-resource languages, run a sample first — accuracy varies widely.

Question 5

Can it identify different speakers (diarization)?

Accepted Answer

Yes on most flagships — speaker diarization labels each segment with 'Speaker 1', 'Speaker 2', etc. Accuracy depends on audio quality and how often speakers overlap. Some providers also accept enrollment audio to identify specific named speakers.

Question 6

Are timestamps provided?

Accepted Answer

Yes — word-level or segment-level timestamps are standard on flagship tiers. Use word-level for video captioning and karaoke-style highlighting; segment-level is enough for transcript search and meeting summaries.

Question 7

What audio formats are accepted?

Accepted Answer

MP3, WAV, M4A, FLAC, OGG, and most browser-native streaming formats. Sample rates from 8 kHz (telephony) up to 48 kHz (studio). Max file size varies — typically 25 MB on managed APIs and unlimited for self-hosted Whisper.

Question 8

Can it translate while transcribing?

Accepted Answer

Yes — Whisper has a built-in translate mode that produces English transcripts from any of its 99 supported source languages. ElevenLabs Scribe and a few other providers support translation to a broader target set. Translation accuracy is lower than dedicated translation models — fine for search but not for publication.

Speech-to-Text

Speech-to-text models for transcription, meetings, and search

Incredibly Fast Whisper

Whisper

Whisper Large V3

Whisper Large v3 Turbo

Deepgram Nova-3

SeamlessM4T

SeamlessM4T v2 Large (Speech)

Whisper Diarization

WhisperX

Top speech-to-text picks

Popular use cases

Related comparisons

Whisper Large V3 vs Deepgram Nova-3

Frequently asked questions

Start Building with AI