F5-TTS

Replicate
Text-to-Speech

Open-source flow-matching TTS with strong zero-shot voice cloning. Code MIT, weights CC-BY-NC.

Speak with F5-TTS
Type any text and hear it spoken in a chosen voice.
Sign in to try this model with €5 free credits.
Sign in
Audio player appears here.
TL;DRΒ·Last updated May 16, 2026

F5-TTS is text-to-speech AI model from Replicate, priced at €0.000 per 1M input tokens with a unknown context window.

Try F5-TTS

1x

Sign in to generate β€” 50 free credits on sign-up

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate F5-TTS into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple β€” just pass a string
const reply = await rw.run("f5-tts", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("f5-tts", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("f5-tts", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Developer
Replicate
Category
Text-to-Speech
Supported Formats
text
audio
Tags
f5
tts
open-weights
voice-cloning
research
pricing-tbd

Deep dive β€” SWivid (Shanghai Jiao Tong University et al.)'s F5-TTS

About SWivid (Shanghai Jiao Tong University et al.)
Founded 2024 Β· Shanghai, China

F5-TTS was released in October 2024 by the SWivid research collective, an open-source group anchored at Shanghai Jiao Tong University with contributors from the X-LANCE Lab, Microsoft Research Asia and the Chinese University of Hong Kong. Lead authors Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Haian Lin, Wendi He, Xiaofei Wang, Tomoki Toda, Kai Yu and Xie Chen released F5-TTS under MIT licence on GitHub together with model weights and a Hugging Face demo. The work followed the earlier E2-TTS (Microsoft) paper and quickly became one of the most popular open-weights TTS models of 2024-2025, frequently cited as the best fully open alternative to ElevenLabs for English and Chinese voice cloning.

Visit SWivid (Shanghai Jiao Tong University et al.) β†’
Architecture
Flow-matching non-autoregressive TTS with Diffusion Transformer (DiT)

F5-TTS (Fairy-tale Fast Flow Matching TTS) is a non-autoregressive text-to-speech system that replaces the encoder-decoder pipeline with a single Diffusion Transformer trained with conditional flow matching. Text is first converted to character tokens and zero-padded to the target mel-spectrogram length, then concatenated with a noisy mel reference for the target voice. A DiT backbone with ConvNeXt v2 blocks predicts the velocity field that maps Gaussian noise to a clean mel-spectrogram, which is then converted to a waveform via Vocos vocoder. Training data is the 100k-hour open Emilia dataset (multilingual long-form audio scraped from podcasts and audiobooks). Because there is no autoregressive decoder, F5-TTS achieves real-time factor below 0.2 on a single A10 GPU while keeping competitive WER with VALL-E and NaturalSpeech 3. The model is famous for very high-quality zero-shot voice cloning from a single 5-15 second reference clip.

Parameters
~330M (Base) and ~1.4B (Large)
Context
30 tokens
What it can do
  • Zero-shot voice cloning from ~10 s of reference audio with no fine-tuning
  • Non-autoregressive flow matching with real-time-factor < 0.2 on a single GPU
  • Open weights under MIT licence, multiple checkpoints (Base, Small, Multilingual)
  • English and Chinese out of the box; community fine-tunes for German, French, Japanese, Spanish
  • Up to 30 s of generated audio per inference
  • Speed control by adjusting the duration prompt
  • Best for: research, on-device TTS, voice cloning prototypes, commercial products built on open weights
Training & License

Pretrained on the 100,000-hour open Emilia dataset of multilingual long-form audio with weakly supervised transcripts. Additional community fine-tunes use LibriSpeech, AISHELL-3 and bespoke audiobook collections.

License: Code and weights under MIT licence; commercial use permitted.

Known limitations
  • No formal SSML / emotion tags
  • Quality degrades on noisy reference audio
  • Multilingual coverage outside English/Chinese depends on community checkpoints
  • 30-second hard cap per generation
  • Voice cloning quality slightly below ElevenLabs Multilingual V2 on emotional acting

Frequently asked questions

Start using F5-TTS today

Get started with free credits. No credit card required. Access F5-TTS and 100+ other models through a single API.