Cartesia Sonic
Cartesia's ultra-low-latency TTS (~90ms TTFB). State-space model with voice cloning support.
Cartesia Sonic is text-to-speech AI model from Custom, priced at β¬0.030 per 1M input tokens with a unknown context window.
1x
Pricing
API Integration
Use our OpenAI-compatible API to integrate Cartesia Sonic into your application.
npm install railwailimport railwail from "railwail";
const rw = railwail("YOUR_API_KEY");
// Simple β just pass a string
const reply = await rw.run("cartesia-sonic", "Hello! What can you do?");
console.log(reply);
// With message history
const reply2 = await rw.run("cartesia-sonic", [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing simply." },
]);
console.log(reply2);
// Full response with usage info
const res = await rw.chat("cartesia-sonic", [
{ role: "user", content: "Hello!" },
], { temperature: 0.7, max_tokens: 500 });
console.log(res.choices[0].message.content);
console.log(res.usage);Deep dive β Cartesia AI's Cartesia Sonic
Cartesia AI was founded in 2023 by Karan Goel and Albert Gu, the academic team behind the influential state-space model line of research at Stanford and CMU (S4, H3, Mamba, Mamba-2). Co-founders also include Arjun Desai, Brandon Yang and Chris Re (Stanford advisor). The company set out to build voice and multimodal foundation models on a state-space backbone instead of Transformers in order to achieve sub-100 ms latency and stream-friendly inference. Cartesia raised a $27M seed in March 2024 led by Index Ventures with participation from Conviction, A* and Lightspeed, followed by a $64M Series A in October 2024 led by Kleiner Perkins at a reported $325M valuation. Sonic, the company's first product, launched in May 2024 and quickly became one of the lowest-latency commercial TTS systems on the market.
Visit Cartesia AI βCartesia Sonic is a streaming text-to-speech model built on the structured state-space modelling (SSM) architecture pioneered by Cartesia's founders (S4, H3, Mamba, Mamba-2). Unlike Transformer-based TTS systems that scale quadratically with sequence length, Sonic uses linear-time SSMs with selective scan, which lets the model maintain a small constant-memory recurrent state and generate audio chunks as text streams in. Cartesia reports a model first-byte latency of around 75-90 ms on their hosted API, which is faster than ElevenLabs Turbo and OpenAI TTS-1. Sonic outputs 24 kHz PCM via a neural codec decoder, supports streaming text input (so it can start speaking before the LLM finishes its sentence) and offers voice cloning from short reference samples (3-30 seconds). Sonic 2 added improved prosody, multilingual coverage (15+ languages) and reduced WER. The system is offered exclusively as a hosted API; weights are not released.
- Parameters
- Undisclosed
- Context
- 24K tokens
- Sub-100 ms first-byte latency suitable for real-time voice agents
- State-space (Mamba-family) backbone with linear-time inference
- Streaming text input and streaming audio output via WebSocket or gRPC
- Instant voice cloning from short reference audio
- Multilingual: English, Spanish, French, German, Portuguese, Mandarin, Japanese and more
- Emotion and pace controls via inline tags
- 24 kHz PCM, MP3 and Opus output formats
- Best for: real-time voice agents, IVR systems, conversational AI, low-latency phone bots
Cartesia has not disclosed the training corpus. Public statements describe a 'diverse multilingual speech dataset' with permissioned voice talent for the stock voice library.
License: Proprietary commercial API. Voice clones produced from customer audio remain customer property under the Terms of Service; commercial use is permitted on paid tiers.
Known limitations
- Closed weights, hosted-only
- Voice clone quality below ElevenLabs Multilingual V2 for nuanced emotional acting
- Limited SSML / fine-grained prosody control
- Hard cap of around 24,000 input characters per request
- Mandarin and Japanese still less polished than English/Spanish
Frequently asked questions
Related Models
View all Text-to-SpeechElevenLabs Multilingual V2
ElevenLabs' most natural-sounding TTS model. Supports 29 languages with emotional range.
AudioCraft
Meta's AudioCraft framework wrapping MusicGen, AudioGen and EnCodec. Unified text-to-audio research toolkit for music and sound effects.
AudioLDM 2
Latent-diffusion model for general-purpose text-to-audio. Generates speech, music, and sound effects with a unified prior.
Edge TTS
Microsoft Edge neural voices accessed via the open-source edge-tts wrapper. 400+ voices across 100+ locales, suitable for batch generation.
Start using Cartesia Sonic today
Get started with free credits. No credit card required. Access Cartesia Sonic and 100+ other models through a single API.