Cartesia Sonic

Custom
Text-to-Speech

Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.

Speak with Cartesia Sonic
Type any text and hear it spoken in a chosen voice.
Sign in to try this model with €5 free credits.
Sign in
Audio player appears here.
TL;DR·Last updated May 16, 2026

Cartesia Sonic is text-to-speech AI model from Custom, priced at €0.030 per 1M input tokens with a unknown context window.

Try Cartesia Sonic

1x

Direct API access coming soon

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Cartesia Sonic into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple — just pass a string
const reply = await rw.run("cartesia-sonic", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("cartesia-sonic", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("cartesia-sonic", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Developer
Custom
Category
Text-to-Speech
Supported Formats
text
Tags
cartesia
tts
low-latency
voice-cloning
realtime

Deep dive — Cartesia AI's Cartesia Sonic

About Cartesia AI
Founded 2023 · San Francisco, California, USA

Cartesia AI was founded in 2023 by Karan Goel and Albert Gu, the academic team behind the influential state-space model line of research at Stanford and CMU (S4, H3, Mamba, Mamba-2). Co-founders also include Arjun Desai, Brandon Yang and Chris Re (Stanford advisor). The company set out to build voice and multimodal foundation models on a state-space backbone instead of Transformers in order to achieve sub-100 ms latency and stream-friendly inference. Cartesia raised a $27M seed in March 2024 led by Index Ventures with participation from Conviction, A* and Lightspeed, followed by a $64M Series A in October 2024 led by Kleiner Perkins at a reported $325M valuation. Sonic, the company's first product, launched in May 2024 and quickly became one of the lowest-latency commercial TTS systems on the market.

Visit Cartesia AI →
Architecture
State-space model (Mamba family) text-to-speech with neural codec output

Cartesia Sonic is a streaming text-to-speech model built on the structured state-space modelling (SSM) architecture pioneered by Cartesia's founders (S4, H3, Mamba, Mamba-2). Unlike Transformer-based TTS systems that scale quadratically with sequence length, Sonic uses linear-time SSMs with selective scan, which lets the model maintain a small constant-memory recurrent state and generate audio chunks as text streams in. Cartesia reports a model first-byte latency of around 75-90 ms on their hosted API, which is faster than ElevenLabs Turbo and OpenAI TTS-1. Sonic outputs 24 kHz PCM via a neural codec decoder, supports streaming text input (so it can start speaking before the LLM finishes its sentence) and offers voice cloning from short reference samples (3-30 seconds). Sonic 2 added improved prosody, multilingual coverage (15+ languages) and reduced WER. The system is offered exclusively as a hosted API; weights are not released.

Parameters
Undisclosed
Context
24K tokens
What it can do
  • Sub-100 ms first-byte latency suitable for real-time voice agents
  • State-space (Mamba-family) backbone with linear-time inference
  • Streaming text input and streaming audio output via WebSocket or gRPC
  • Instant voice cloning from short reference audio
  • Multilingual: English, Spanish, French, German, Portuguese, Mandarin, Japanese and more
  • Emotion and pace controls via inline tags
  • 24 kHz PCM, MP3 and Opus output formats
  • Best for: real-time voice agents, IVR systems, conversational AI, low-latency phone bots
Training & License

Cartesia has not disclosed the training corpus. Public statements describe a 'diverse multilingual speech dataset' with permissioned voice talent for the stock voice library.

License: Proprietary commercial API. Voice clones produced from customer audio remain customer property under the Terms of Service; commercial use is permitted on paid tiers.

Known limitations
  • Closed weights, hosted-only
  • Voice clone quality below ElevenLabs Multilingual V2 for nuanced emotional acting
  • Limited SSML / fine-grained prosody control
  • Hard cap of around 24,000 input characters per request
  • Mandarin and Japanese still less polished than English/Spanish

Frequently asked questions

Start using Cartesia Sonic today

Get started with free credits. No credit card required. Access Cartesia Sonic and 100+ other models through a single API.