How much does F5-TTS cost via Railwail?

No monthly minimum, no subscription. Start with €5 free credits.

What is the context window of F5-TTS?

F5-TTS supports a unknown context window — enough for typical AI workloads.

Latency depends on prompt length and load — typically 200ms to 2s for short prompts. We measure p50/p95 in real-time on /rankings.

Is F5-TTS better than ElevenLabs Multilingual V2?

It depends on your use case. F5-TTS (Replicate) and ElevenLabs Multilingual V2 (ElevenLabs) are both strong choices in text-to-speech. Compare them side-by-side at /compare/f5-tts-vs-elevenlabs-multilingual-v2.

Does F5-TTS support audio input?

Yes — F5-TTS processes audio input. Supported formats: text, audio. Use the standard Railwail API endpoint with audio content blocks.

F5-TTS

Name: F5-TTS
Brand: Replicate
SKU: f5-tts
Availability: InStock

Replicate

Text-to-Speech

Open-source flow-matching TTS with strong zero-shot voice cloning. Code MIT, weights CC-BY-NC.

Speak with F5-TTS

Type any text and hear it spoken in a chosen voice.

Voice

Audio player appears here.

TL;DR·Last updated June 24, 2026

F5-TTS is text-to-speech AI model from Replicate, priced at €0.000 per 1M input tokens with a unknown context window.

Try F5-TTS

Text to speak

Voice

Speed

Pricing

Price per Generation

Per generationFree

API Integration

Use our OpenAI-compatible API to integrate F5-TTS into your application.

Install

npm install railwail

JavaScript / TypeScript

import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("f5-tts", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("f5-tts", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("f5-tts", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);

Specifications

Developer

Replicate

Deep dive — SWivid (Shanghai Jiao Tong University et al.)'s F5-TTS

About SWivid (Shanghai Jiao Tong University et al.)

Founded 2024 · Shanghai, China

F5-TTS was released in October 2024 by the SWivid research collective, an open-source group anchored at Shanghai Jiao Tong University with contributors from the X-LANCE Lab, Microsoft Research Asia and the Chinese University of Hong Kong. Lead authors Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Haian Lin, Wendi He, Xiaofei Wang, Tomoki Toda, Kai Yu and Xie Chen released F5-TTS under MIT licence on GitHub together with model weights and a Hugging Face demo. The work followed the earlier E2-TTS (Microsoft) paper and quickly became one of the most popular open-weights TTS models of 2024-2025, frequently cited as the best fully open alternative to ElevenLabs for English and Chinese voice cloning.

Visit SWivid (Shanghai Jiao Tong University et al.) →

Architecture

Flow-matching non-autoregressive TTS with Diffusion Transformer (DiT)

F5-TTS (Fairy-tale Fast Flow Matching TTS) is a non-autoregressive text-to-speech system that replaces the encoder-decoder pipeline with a single Diffusion Transformer trained with conditional flow matching. Text is first converted to character tokens and zero-padded to the target mel-spectrogram length, then concatenated with a noisy mel reference for the target voice. A DiT backbone with ConvNeXt v2 blocks predicts the velocity field that maps Gaussian noise to a clean mel-spectrogram, which is then converted to a waveform via Vocos vocoder. Training data is the 100k-hour open Emilia dataset (multilingual long-form audio scraped from podcasts and audiobooks). Because there is no autoregressive decoder, F5-TTS achieves real-time factor below 0.2 on a single A10 GPU while keeping competitive WER with VALL-E and NaturalSpeech 3. The model is famous for very high-quality zero-shot voice cloning from a single 5-15 second reference clip.

Parameters: ~330M (Base) and ~1.4B (Large)
Context: 30 tokens

What it can do

Zero-shot voice cloning from ~10 s of reference audio with no fine-tuning
Non-autoregressive flow matching with real-time-factor < 0.2 on a single GPU
Open weights under MIT licence, multiple checkpoints (Base, Small, Multilingual)
English and Chinese out of the box; community fine-tunes for German, French, Japanese, Spanish
Up to 30 s of generated audio per inference
Speed control by adjusting the duration prompt
Best for: research, on-device TTS, voice cloning prototypes, commercial products built on open weights

Training & License

Pretrained on the 100,000-hour open Emilia dataset of multilingual long-form audio with weakly supervised transcripts. Additional community fine-tunes use LibriSpeech, AISHELL-3 and bespoke audiobook collections.

License: Code and weights under MIT licence; commercial use permitted.

Known limitations

No formal SSML / emotion tags
Quality degrades on noisy reference audio
Multilingual coverage outside English/Chinese depends on community checkpoints
30-second hard cap per generation
Voice cloning quality slightly below ElevenLabs Multilingual V2 on emotional acting

Research papers

Frequently asked questions

Related Models

View all Text-to-Speech

ElevenLabs Multilingual V2

ElevenLabs

ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.

€1.00

AudioLDM 2

AudioLDM

Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.

€0.01

Cartesia Sonic

Custom

Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.

Free

Chatterbox

Replicate

Resemble AI's open Chatterbox TTS. Zero-shot voice cloning from a short audio prompt with an exaggeration control for emotion intensity, plus CFG weight to balance pacing and fidelity.

€2.00

Start using F5-TTS today

Get started with free credits. No credit card required. Access F5-TTS and 100+ other models through a single API.

Get Started Free Browse All Models

F5-TTS

Pricing

API Integration

Deep dive — SWivid (Shanghai Jiao Tong University et al.)'s F5-TTS

Research papers

Frequently asked questions

What is F5-TTS?

How much does F5-TTS cost via Railwail?

What is the context window of F5-TTS?

How fast is F5-TTS?

Is F5-TTS better than ElevenLabs Multilingual V2?

Does F5-TTS support audio input?

Related Models

ElevenLabs Multilingual V2

AudioLDM 2

Cartesia Sonic

Chatterbox

Start using F5-TTS today