Text-to-Speech

Convert text to natural-sounding speech

Text-to-Speech-Modelle für Voice-Apps, Hörbücher und IVR

Text-to-Speech (TTS) verwandelt geschriebenen Text in natürlich klingendes gesprochenes Audio. Die Kategorie deckt alles ab: vom nüchternen IVR-Voiceover über die ausdrucksstarke Hörbuch-Narration bis hin zu Echtzeit-Sprachagenten, die ein Telefonat führen. Du greifst zu einem TTS-Modell, wenn Software eine Stimme bekommen soll — für Barrierefreiheit, Content-Produktion in großem Umfang oder Conversational AI.

27 models available

ElevenLabs Multilingual V2

TTSElevenLabs
Popular

ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.

€1.003.0s
naturalmultilingualpopular

AudioCraft

TTSReplicate

Meta's AudioCraft framework wrapping MusicGen, AudioGen and EnCodec. Unified text-to-audio research toolkit for music and sound effects.

€0.01
metamusic-generationsound-effects

AudioLDM 2

TTSAudioLDM

Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.

€0.01
audioldmmusic-generationdiffusion

Cartesia Sonic

TTSCustom

Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.

Free
cartesiattslow-latency

Edge TTS

TTSCustom

Microsoft Edge neural voices accessed via the open-source edge-tts wrapper. 400+ voices across 100+ locales, suitable for batch generation.

Free
microsoftttsmultilingual

ElevenLabs v3 (alpha)

TTSElevenLabs

ElevenLabs' v3 alpha TTS. Most expressive voice model with audio tags and laughter, higher latency.

Free
elevenlabsttsexpressive

F5-TTS

TTSReplicate

Open-source flow-matching TTS with strong zero-shot voice cloning. Code MIT, weights CC-BY-NC.

Free
f5ttsopen-weights

Kokoro TTS 82M

TTSReplicate

Open-weights 82M-parameter TTS. Punches above its size class on naturalness benchmarks at a fraction of the inference cost of larger models.

€0.002
kokorottsopen-weights

MAGNeT MusicGen

TTSReplicate

Meta MAGNeT non-autoregressive music generator. Up to 7x faster than MusicGen with comparable quality via masked generative transformers.

€0.007
metamusic-generationmagnet

MusicGen Large

TTSMeta

Meta's 3.3B-parameter MusicGen Large. Text-conditioned music generation with single-stage autoregressive transformer, supports melody conditioning.

€0.02
metamusic-generationopen-weights

MusicGen Medium

TTSMeta

Meta MusicGen Medium (1.5B params). Strong quality-to-speed tradeoff for text-to-music with optional melody guidance.

€0.01
metamusic-generationopen-weights

MusicGen Small

TTSMeta

Meta MusicGen Small (300M params). Fast text-to-music generation suitable for prototyping and low-latency demos.

€0.006
metamusic-generationopen-weights

OpenAI TTS-1

TTSOpenAI

OpenAI's text-to-speech model. Six built-in voices with natural intonation.

€0.602.0s
fastaffordable

OpenAI TTS-1 HD

TTSOpenAI

OpenAI's high-definition TTS model. Better quality for production use cases.

€1.204.0s
high-quality

OpenVoice v1

TTSReplicate

MyShell OpenVoice v1. Cross-lingual voice cloning with flexible style control: emotion, accent, rhythm, pauses, and intonation.

€0.004
myshellttsvoice-cloning

OpenVoice v2

TTSReplicate

MyShell OpenVoice v2. Multilingual zero-shot voice cloning with accurate tone-color reproduction and style/emotion control.

€0.004
myshellttsvoice-cloning

Parler-TTS

TTSReplicate

Hugging Face Parler-TTS Mini. Lightweight TTS conditioned on a natural-language style description for fine-grained control over voice characteristics.

€0.003
parlerttshuggingface

Parler-TTS Large

TTSReplicate

Parler-TTS Large v1. 2.2B parameters, natural-language style prompting and improved prosody over the Mini variant.

€0.005
parlerttshuggingface

PlayHT 2.0

TTSCustom

PlayHT's 2.0 generative voice model. Multi-lingual expressive speech synthesis with sub-second latency and high-fidelity voice cloning.

Free
playhtttsvoice-cloning

Riffusion

TTSRiffusion

Stable-Diffusion-based real-time music generator. Operates on spectrogram images then resynthesizes audio, enables seamless transitions and looping.

€0.008
riffusionmusic-generationopen-weights

RVC Voice Conversion

TTSCommunity

Retrieval-based Voice Conversion. Converts a source recording into a target speaker's voice, preserving pitch, prosody and rhythm.

€0.006
rvcvoice-conversionvoice-cloning

Spark TTS

TTSReplicate

Spark efficient TTS with disentangled control over speaker, content and style. Strong cross-lingual zero-shot performance.

€0.004
sparkttsvoice-cloning

Stable Audio 2

TTSUdio

Stability AI's Stable Audio 2.0. Text-to-music up to 3 minutes of full-length, structured tracks at 44.1 kHz.

Free
stabilitymusic-generationpricing-tbd

StyleTTS 2

TTSReplicate

Style-based TTS using diffusion and adversarial training. Human-level naturalness in zero-shot voice synthesis from a 3-5s reference clip.

€0.004
stylettsttsvoice-cloning

Suno Bark

TTSSuno

Suno's text-prompted generative audio model. Speech, music, ambient sound and effects with non-verbal cues like laughter or sighs.

€0.01
sunobarkmusic-generation

Tortoise TTS

TTSCommunity

Multi-voice expressive TTS. Slow but high-quality with strong prosody and natural intonation. Trained for long-form narration use cases.

€0.01
tortoisettsexpressive

XTTS v2

TTSCommunity

Coqui's XTTS v2 multilingual TTS with voice cloning from 6 seconds of reference audio. Supports 17 languages and emotion transfer.

€0.005
coquittsvoice-cloning

Top text-to-speech picks

Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.

Beste Wahl insgesamt
ElevenLabs Multilingual V2

ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.

Learn more
Günstigstes
Cartesia Sonic

Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.

Learn more
Längster Input
ElevenLabs Multilingual V2

ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.

Learn more
Schnellstes
OpenAI TTS-1

OpenAI's text-to-speech model. Six built-in voices with natural intonation.

Learn more

Die Abrechnung läuft fast immer pro Zeichen oder pro Tausend Zeichen. Flagship-Stimmen (ElevenLabs V3, Cartesia Sonic, OpenAI TTS-HD) kosten rund 0,15–0,30 € pro Tausend Zeichen; Budget-Stufen liegen unter 0,02 € pro Tausend. Ein typisches kurzes Hörbuchkapitel (3.000 Wörter, ca. 18.000 Zeichen) kostet je nach Stufe 0,30 € bis 5,00 €. Manche Anbieter berechnen außerdem Voice Cloning — eine einmalige Gebühr, um eine Custom-Voice einzurichten, plus die übliche Per-Zeichen-Rate bei der Synthese.

Das Trade-off-Dreieck heißt Natürlichkeit, Latenz und Kosten. Flagship-Stimmen sind kaum von menschlicher Erzählung zu unterscheiden, haben aber typischerweise eine First-Byte-Latenz von 200–600 ms — für Batch-Synthese in Ordnung, im Echtzeit-Chat aber träge. Streaming-TTS (Cartesia, OpenAI Realtime, ElevenLabs Turbo) hält die First-Byte-Latenz unter 100 ms, indem Audio ausgegeben wird, sobald das erste Phonem dekodiert ist. Budget-Stufen laufen flott, haben aber bei langen Sätzen hörbare Roboter-Artefakte.

Achte auf Prosodie-Steuerung: Selbst die besten Modelle betonen gelegentlich einen Eigennamen falsch, sprechen ein Akronym falsch aus oder verlieren bei langen Sätzen die emotionale Absicht. Nutze SSML-Tags (wo unterstützt), oder zerlege lange Passagen in kürzere Stücke mit expliziten Phrasen-Grenzen. Bei mehrsprachigen Inhalten prüfe die Aussprache jedes Sprachpaars, bevor du ausspielst — manche Stimmen sprechen Englisch makellos und Deutsch mit starkem Akzent.

Die Top-Picks oben decken die natürlichste Stimme, das günstigste Arbeitspferd, das Modell mit dem längsten unterstützten Input und die schnellste Streaming-Option ab.

Related comparisons

Side-by-side reviews of the most-compared models in this category.

Frequently asked questions

Start Building with AI

Access all models through a single API. Get free credits when you sign up — no credit card required.