Cartesia Sonic

Custom
Text-to-Speech

Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.

Speak with Cartesia Sonic
Type any text and hear it spoken in a chosen voice.
Sign in to try this model with €5 free credits.
Sign in
Audio player appears here.
TL;DRΒ·Last updated May 16, 2026

Cartesia Sonic is text-to-speech AI model from Custom, priced at €0.030 per 1M input tokens with a unknown context window.

Try Cartesia Sonic

1x

Direct API access coming soon

Pricing

Price per Generation
Per generationFree

API Integration

Use our OpenAI-compatible API to integrate Cartesia Sonic into your application.

Install
npm install railwail
JavaScript / TypeScript
import railwail from "railwail";

const rw = railwail("YOUR_API_KEY");

// Simple β€” just pass a string
const reply = await rw.run("cartesia-sonic", "Hello! What can you do?");
console.log(reply);

// With message history
const reply2 = await rw.run("cartesia-sonic", [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);

// Full response with usage info
const res = await rw.chat("cartesia-sonic", [
  { role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);
Specifications
Developer
Custom
Category
Text-to-Speech
Supported Formats
text
Tags
cartesia
tts
low-latency
voice-cloning
realtime

Deep dive β€” Cartesia AI's Cartesia Sonic

About Cartesia AI
Founded 2023 Β· San Francisco, California, USA

Cartesia AI was founded in 2023 by Karan Goel and Albert Gu, the academic team behind the influential state-space model line of research at Stanford and CMU (S4, H3, Mamba, Mamba-2). Co-founders also include Arjun Desai, Brandon Yang and Chris Re (Stanford advisor). The company set out to build voice and multimodal foundation models on a state-space backbone instead of Transformers in order to achieve sub-100 ms latency and stream-friendly inference. Cartesia raised a $27M seed in March 2024 led by Index Ventures with participation from Conviction, A* and Lightspeed, followed by a $64M Series A in October 2024 led by Kleiner Perkins at a reported $325M valuation. Sonic, the company's first product, launched in May 2024 and quickly became one of the lowest-latency commercial TTS systems on the market.

Visit Cartesia AI β†’
Architecture
State-space model (Mamba family) text-to-speech with neural codec output

Cartesia Sonic is a streaming text-to-speech model built on the structured state-space modelling (SSM) architecture pioneered by Cartesia's founders (S4, H3, Mamba, Mamba-2). Unlike Transformer-based TTS systems that scale quadratically with sequence length, Sonic uses linear-time SSMs with selective scan, which lets the model maintain a small constant-memory recurrent state and generate audio chunks as text streams in. Cartesia reports a model first-byte latency of around 75-90 ms on their hosted API, which is faster than ElevenLabs Turbo and OpenAI TTS-1. Sonic outputs 24 kHz PCM via a neural codec decoder, supports streaming text input (so it can start speaking before the LLM finishes its sentence) and offers voice cloning from short reference samples (3-30 seconds). Sonic 2 added improved prosody, multilingual coverage (15+ languages) and reduced WER. The system is offered exclusively as a hosted API; weights are not released.

Parameters
Undisclosed
Context
24K tokens
What it can do
  • Sub-100 ms first-byte latency suitable for real-time voice agents
  • State-space (Mamba-family) backbone with linear-time inference
  • Streaming text input and streaming audio output via WebSocket or gRPC
  • Instant voice cloning from short reference audio
  • Multilingual: English, Spanish, French, German, Portuguese, Mandarin, Japanese and more
  • Emotion and pace controls via inline tags
  • 24 kHz PCM, MP3 and Opus output formats
  • Best for: real-time voice agents, IVR systems, conversational AI, low-latency phone bots
Training & License

Cartesia has not disclosed the training corpus. Public statements describe a 'diverse multilingual speech dataset' with permissioned voice talent for the stock voice library.

License: Proprietary commercial API. Voice clones produced from customer audio remain customer property under the Terms of Service; commercial use is permitted on paid tiers.

Known limitations
  • Closed weights, hosted-only
  • Voice clone quality below ElevenLabs Multilingual V2 for nuanced emotional acting
  • Limited SSML / fine-grained prosody control
  • Hard cap of around 24,000 input characters per request
  • Mandarin and Japanese still less polished than English/Spanish

Frequently asked questions

Start using Cartesia Sonic today

Get started with free credits. No credit card required. Access Cartesia Sonic and 100+ other models through a single API.