F5-TTS
Open-source flow-matching TTS with strong zero-shot voice cloning. Code MIT, weights CC-BY-NC.
F5-TTS is text-to-speech AI model from Replicate, priced at €0.000 per 1M input tokens with a unknown context window.
1x
Pricing
API Integration
Use our OpenAI-compatible API to integrate F5-TTS into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple — just pass a string
const reply = await rw.run("f5-tts", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("f5-tts", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("f5-tts", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive — SWivid (Shanghai Jiao Tong University et al.)'s F5-TTS
F5-TTS was released in October 2024 by the SWivid research collective, an open-source group anchored at Shanghai Jiao Tong University with contributors from the X-LANCE Lab, Microsoft Research Asia and the Chinese University of Hong Kong. Lead authors Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Haian Lin, Wendi He, Xiaofei Wang, Tomoki Toda, Kai Yu and Xie Chen released F5-TTS under MIT licence on GitHub together with model weights and a Hugging Face demo. The work followed the earlier E2-TTS (Microsoft) paper and quickly became one of the most popular open-weights TTS models of 2024-2025, frequently cited as the best fully open alternative to ElevenLabs for English and Chinese voice cloning.
Visit SWivid (Shanghai Jiao Tong University et al.) →F5-TTS (Fairy-tale Fast Flow Matching TTS) is a non-autoregressive text-to-speech system that replaces the encoder-decoder pipeline with a single Diffusion Transformer trained with conditional flow matching. Text is first converted to character tokens and zero-padded to the target mel-spectrogram length, then concatenated with a noisy mel reference for the target voice. A DiT backbone with ConvNeXt v2 blocks predicts the velocity field that maps Gaussian noise to a clean mel-spectrogram, which is then converted to a waveform via Vocos vocoder. Training data is the 100k-hour open Emilia dataset (multilingual long-form audio scraped from podcasts and audiobooks). Because there is no autoregressive decoder, F5-TTS achieves real-time factor below 0.2 on a single A10 GPU while keeping competitive WER with VALL-E and NaturalSpeech 3. The model is famous for very high-quality zero-shot voice cloning from a single 5-15 second reference clip.
- Parameters
- ~330M (Base) and ~1.4B (Large)
- Context
- 30 tokens
- Zero-shot voice cloning from ~10 s of reference audio with no fine-tuning
- Non-autoregressive flow matching with real-time-factor < 0.2 on a single GPU
- Open weights under MIT licence, multiple checkpoints (Base, Small, Multilingual)
- English and Chinese out of the box; community fine-tunes for German, French, Japanese, Spanish
- Up to 30 s of generated audio per inference
- Speed control by adjusting the duration prompt
- Best for: research, on-device TTS, voice cloning prototypes, commercial products built on open weights
Pretrained on the 100,000-hour open Emilia dataset of multilingual long-form audio with weakly supervised transcripts. Additional community fine-tunes use LibriSpeech, AISHELL-3 and bespoke audiobook collections.
License: Code and weights under MIT licence; commercial use permitted.
Known limitations
- No formal SSML / emotion tags
- Quality degrades on noisy reference audio
- Multilingual coverage outside English/Chinese depends on community checkpoints
- 30-second hard cap per generation
- Voice cloning quality slightly below ElevenLabs Multilingual V2 on emotional acting
Frequently asked questions
Related Models
View all Text-to-SpeechElevenLabs Multilingual V2
ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.
AudioCraft
Meta's AudioCraft framework wrapping MusicGen, AudioGen and EnCodec. Unified text-to-audio research toolkit for music and sound effects.
AudioLDM 2
Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.
Cartesia Sonic
Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.
Start using F5-TTS today
Get started with free credits. No credit card required. Access F5-TTS and 100+ other models through a single API.