Whisper Large V3
OpenAI's Whisper model. State-of-the-art speech recognition supporting 99+ languages.
Whisper Large V3 is speech-to-text AI model from OpenAI, priced at €0.000 per 1M input tokens with a unknown context window.
Drop audio file here
MP3, WAV, M4A, FLAC (max 25MB)
Examples
See what Whisper Large V3 can generate
Meeting Notes
Good morning everyone. Let's start with the sprint review. The authentication module shipped on Friday and we've had zero critical bugs reported so far. Sarah, can you walk us through the performance metrics? We're seeing a forty percent reduction in login latency which is well above our target.
Interview Transcript
So tell me about your experience with distributed systems. I spent three years at a fintech startup building event-driven microservices using Kafka and Kubernetes. The biggest challenge was handling exactly-once delivery semantics across our payment processing pipeline. We ended up implementing an idempotency layer that reduced duplicate transactions by ninety-nine point nine percent.
Pricing
API Integration
Use our OpenAI-compatible API to integrate Whisper Large V3 into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("whisper-large-v3", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("whisper-large-v3", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("whisper-large-v3", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — OpenAI's Whisper Large V3
OpenAI was founded in December 2015 by Sam Altman, Elon Musk, Greg Brockman, Ilya Sutskever, Wojciech Zaremba and John Schulman, restructured to capped-profit OpenAI LP in 2019. Whisper was first released as an open-weights speech recognition model in September 2022 by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey and Ilya Sutskever, and quickly became the de-facto open ASR baseline thanks to its zero-shot multilingual quality. Whisper Large v3 was released in November 2023 alongside the GPT-4 Turbo launch as the final open checkpoint in the original Whisper line. OpenAI later released Whisper Large v3 Turbo (October 2024), a distilled faster variant. All Whisper weights remain available under MIT licence on GitHub and Hugging Face and are widely used in production by competitors.
Visit OpenAI →Whisper Large v3 is a 1.55B-parameter encoder-decoder Transformer trained for multitask audio understanding. Input is a 30-second log-mel spectrogram with 128 mel bins (up from 80 in v2) processed by a 32-layer audio encoder; the decoder is a 32-layer Transformer that emits special task tokens for language identification, transcription, translation-to-English and timestamp prediction. Whisper was trained on 5 million hours of audio in total: 680k hours of weakly supervised multilingual web audio for v1/v2 plus an additional 4 million hours of pseudo-labelled audio generated by Whisper Large v2 for v3, yielding approximately 1 million hours of weakly supervised and 4 million hours of pseudo-labelled data. The model covers 99 languages, with very strong performance on English long-form (~5% WER on LibriSpeech test-clean) and significantly improved low-resource coverage. Audio longer than 30 seconds is processed with a sliding window. The model is released under MIT licence.
- Parameters
- 1.55B (Large v3)
- Context
- 30 tokens
- 99-language transcription with automatic language detection
- Translation of any supported language directly to English text
- Word-level and segment-level timestamps
- 30-second context window with sliding window for long-form audio
- Open weights under MIT licence, runs on a single 16 GB GPU in fp16
- Strong robustness to accents, background noise and technical jargon
- Best for: open-source ASR pipelines, research baselines, on-premise transcription
5 million hours total: ~680k hours of weakly supervised multilingual audio scraped from the public web (subtitles aligned with audio) plus ~4 million hours of pseudo-labelled audio generated by Whisper Large v2. 17% of the supervised set is non-English speech.
License: MIT licence for code and weights; commercial use permitted.
Known limitations
- 30-second hard window requires chunking for long audio
- Known to hallucinate transcripts on silent / music-only segments
- WER on low-resource languages still much higher than English
- No native diarisation (must be combined with pyannote)
- Slow on CPU; requires GPU for real-time
Frequently asked questions
Related Models
View all Speech-to-TextWhisper Large v3 Turbo
OpenAI's distilled Whisper Large v3. ~216x realtime, 99+ languages, MIT-licensed weights.
Deepgram Nova-3
Deepgram's flagship STT. First to offer realtime multilingual transcription with self-serve customization.
ElevenLabs Scribe v1
ElevenLabs' STT. 99 languages, word-level timestamps, speaker diarization, audio-event tagging.
SeamlessM4T v2 Large (Speech)
Meta SeamlessM4T v2 Large speech mode. Speech-to-speech, speech-to-text, and text-to-speech translation across 100+ languages in a single unified model.
Start using Whisper Large V3 today
Get started with free credits. No credit card required. Access Whisper Large V3 and 100+ other models through a single API.