Text-to-Speech

Convert text to natural-sounding speech

Modelos de texto a voz para apps de voz, audiolibros e IVR

Los modelos de texto a voz (TTS) convierten texto escrito en audio hablado con sonido natural. La categoría cubre desde una locución plana de IVR hasta narración expresiva para audiolibros o agentes conversacionales en tiempo real que sostienen una llamada telefónica. Recurres a un modelo TTS cuando quieres darle voz al software — para accesibilidad, producción de contenido a escala o IA conversacional.

27 models available

ElevenLabs Multilingual V2

TTSElevenLabs
Popular

ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.

€1.003.0s
naturalmultilingualpopular

AudioCraft

TTSReplicate

Meta's AudioCraft framework wrapping MusicGen, AudioGen and EnCodec. Unified text-to-audio research toolkit for music and sound effects.

€0.01
metamusic-generationsound-effects

AudioLDM 2

TTSAudioLDM

Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.

€0.01
audioldmmusic-generationdiffusion

Cartesia Sonic

TTSCustom

Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.

Free
cartesiattslow-latency

Edge TTS

TTSCustom

Microsoft Edge neural voices accessed via the open-source edge-tts wrapper. 400+ voices across 100+ locales, suitable for batch generation.

Free
microsoftttsmultilingual

ElevenLabs v3 (alpha)

TTSElevenLabs

ElevenLabs' v3 alpha TTS. Most expressive voice model with audio tags and laughter, higher latency.

Free
elevenlabsttsexpressive

F5-TTS

TTSReplicate

Open-source flow-matching TTS with strong zero-shot voice cloning. Code MIT, weights CC-BY-NC.

Free
f5ttsopen-weights

Kokoro TTS 82M

TTSReplicate

Open-weights 82M-parameter TTS. Punches above its size class on naturalness benchmarks at a fraction of the inference cost of larger models.

€0.002
kokorottsopen-weights

MAGNeT MusicGen

TTSReplicate

Meta MAGNeT non-autoregressive music generator. Up to 7x faster than MusicGen with comparable quality via masked generative transformers.

€0.007
metamusic-generationmagnet

MusicGen Large

TTSMeta

Meta's 3.3B-parameter MusicGen Large. Text-conditioned music generation with single-stage autoregressive transformer, supports melody conditioning.

€0.02
metamusic-generationopen-weights

MusicGen Medium

TTSMeta

Meta MusicGen Medium (1.5B params). Strong quality-to-speed tradeoff for text-to-music with optional melody guidance.

€0.01
metamusic-generationopen-weights

MusicGen Small

TTSMeta

Meta MusicGen Small (300M params). Fast text-to-music generation suitable for prototyping and low-latency demos.

€0.006
metamusic-generationopen-weights

OpenAI TTS-1

TTSOpenAI

OpenAI's text-to-speech model. Six built-in voices with natural intonation.

€0.602.0s
fastaffordable

OpenAI TTS-1 HD

TTSOpenAI

OpenAI's high-definition TTS model. Better quality for production use cases.

€1.204.0s
high-quality

OpenVoice v1

TTSReplicate

MyShell OpenVoice v1. Cross-lingual voice cloning with flexible style control: emotion, accent, rhythm, pauses, and intonation.

€0.004
myshellttsvoice-cloning

OpenVoice v2

TTSReplicate

MyShell OpenVoice v2. Multilingual zero-shot voice cloning with accurate tone-color reproduction and style/emotion control.

€0.004
myshellttsvoice-cloning

Parler-TTS

TTSReplicate

Hugging Face Parler-TTS Mini. Lightweight TTS conditioned on a natural-language style description for fine-grained control over voice characteristics.

€0.003
parlerttshuggingface

Parler-TTS Large

TTSReplicate

Parler-TTS Large v1. 2.2B parameters, natural-language style prompting and improved prosody over the Mini variant.

€0.005
parlerttshuggingface

PlayHT 2.0

TTSCustom

PlayHT's 2.0 generative voice model. Multi-lingual expressive speech synthesis with sub-second latency and high-fidelity voice cloning.

Free
playhtttsvoice-cloning

Riffusion

TTSRiffusion

Stable-Diffusion-based real-time music generator. Operates on spectrogram images then resynthesizes audio, enables seamless transitions and looping.

€0.008
riffusionmusic-generationopen-weights

RVC Voice Conversion

TTSCommunity

Retrieval-based Voice Conversion. Converts a source recording into a target speaker's voice, preserving pitch, prosody and rhythm.

€0.006
rvcvoice-conversionvoice-cloning

Spark TTS

TTSReplicate

Spark efficient TTS with disentangled control over speaker, content and style. Strong cross-lingual zero-shot performance.

€0.004
sparkttsvoice-cloning

Stable Audio 2

TTSUdio

Stability AI's Stable Audio 2.0. Text-to-music up to 3 minutes of full-length, structured tracks at 44.1 kHz.

Free
stabilitymusic-generationpricing-tbd

StyleTTS 2

TTSReplicate

Style-based TTS using diffusion and adversarial training. Human-level naturalness in zero-shot voice synthesis from a 3-5s reference clip.

€0.004
stylettsttsvoice-cloning

Suno Bark

TTSSuno

Suno's text-prompted generative audio model. Speech, music, ambient sound and effects with non-verbal cues like laughter or sighs.

€0.01
sunobarkmusic-generation

Tortoise TTS

TTSCommunity

Multi-voice expressive TTS. Slow but high-quality with strong prosody and natural intonation. Trained for long-form narration use cases.

€0.01
tortoisettsexpressive

XTTS v2

TTSCommunity

Coqui's XTTS v2 multilingual TTS with voice cloning from 6 seconds of reference audio. Supports 17 languages and emotion transfer.

€0.005
coquittsvoice-cloning

Top text-to-speech picks

Hand-picked across four common criteria — resolved against the live catalog so the picks track price and performance changes.

Mejor en general
ElevenLabs Multilingual V2

ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.

Learn more
Más barato
Cartesia Sonic

Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.

Learn more
Entrada más larga
ElevenLabs Multilingual V2

ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.

Learn more
Más rápido
OpenAI TTS-1

OpenAI's text-to-speech model. Six built-in voices with natural intonation.

Learn more

La tarificación es casi siempre por carácter o por mil caracteres. Las voces neuronales punteras (ElevenLabs V3, Cartesia Sonic, OpenAI TTS-HD) cuestan unos 0,15-0,30 € por mil caracteres; los niveles económicos tiran por debajo de 0,02 € por mil. Un capítulo corto típico de audiolibro (3 000 palabras, unos 18 000 caracteres) cuesta 0,30 € a 5,00 € según el nivel que elijas. Algunos proveedores también cobran por clonación de voz — una tarifa única para configurar una voz personalizada más la tarifa estándar por carácter en el momento de síntesis.

El triángulo de compromiso es naturalidad, latencia y coste. Las voces punteras son casi indistinguibles de narración humana pero típicamente tienen una latencia de primer byte de 200-600 ms, lo cual está bien para síntesis por lotes pero se siente lento en chat en tiempo real. El TTS streaming (Cartesia, OpenAI Realtime, ElevenLabs Turbo) mantiene la latencia de primer byte por debajo de 100 ms emitiendo audio en cuanto se decodifica el primer fonema. Los niveles económicos corren a velocidad puntero pero con artefactos robóticos audibles en frases largas.

Cuidado con el control prosódico: incluso los mejores modelos ocasionalmente acentúan mal un nombre propio, pronuncian mal un acrónimo o pierden la intención emocional en frases largas. Usa etiquetas SSML (donde se admitan) o divide pasajes largos en chunks más cortos con límites de frase explícitos. Para contenido multilingüe, verifica la pronunciación en cada par de idiomas antes de lanzar — algunas voces hablan inglés impecable y alemán con marcado acento.

Las selecciones principales arriba cubren la voz más natural, el caballo de batalla más barato, el modelo de entrada más larga y la opción streaming más rápida.

Related comparisons

Side-by-side reviews of the most-compared models in this category.

Frequently asked questions

Start Building with AI

Access all models through a single API. Get free credits when you sign up — no credit card required.