Overview
This API wraps the Chatterbox-Turbo model behind an OpenAI-compatible /v1/audio/speech interface — a drop-in replacement for openai.audio.speech.create() with powerful extras.
| Feature | Details |
|---|---|
| Model | ResembleAI Chatterbox-Turbo (350M params) |
| Architecture | Streaming encoder-decoder transformer |
| License | MIT — free commercial use |
| Voice Cloning | Zero-shot from 5+ seconds of audio |
| Emotion control | Continuous exaggeration slider (0.0 → 1.0) |
| Paralinguistic tags | [laugh], [sigh], [cough] and 6 more |
| Pre-made voices | 20 (11 male, 9 female) |
| Output formats | WAV, MP3, FLAC, Opus, PCM |
| Sample rate | 24,000 Hz |
| Max input | 4,096 characters |
Quick Start
The fastest way to get audio from text. Uses the OpenAI Python SDK as a drop-in.
from openai import OpenAI
client = OpenAI(
base_url="https://naimulislam864-chatterbox-tts.hf.space/v1",
api_key="YOUR_TTS_API_KEY"
)
response = client.audio.speech.create(
model="chatterbox-turbo",
voice="andy",
input="Hello! This is Chatterbox Turbo speaking."
)
response.stream_to_file("output.wav")
base_url is your HuggingFace Space URL + /v1. The api_key is the TTS_API_KEY secret you set in Space settings.
Authentication
All endpoints except /health require a Bearer token in the Authorization header.
Authorization: Bearer YOUR_TTS_API_KEY
The API key is the value you set as TTS_API_KEY in HuggingFace Space → Settings → Variables and secrets.
Endpoints
Returns raw binary audio. Content-Type depends on the response_format parameter.
Response Headers
| Header | Example | Description |
|---|---|---|
| Content-Type | audio/wav | MIME type of audio |
| X-Sample-Rate | 24000 | Sample rate in Hz |
| X-Audio-Format | wav | Format name |
| Content-Disposition | attachment; filename="speech.wav" | Suggested filename |
{
"object": "list",
"data": [{
"id": "chatterbox-turbo",
"object": "model",
"owned_by": "resemble-ai",
"capabilities": {
"tts": true, "voice_cloning": true,
"sample_rate": 24000, "max_chars": 4096
}
}]
}
{ "status": "ok", "model": "chatterbox-turbo", "device": "cpu", "sample_rate": 24000 }
Request Parameters
Standard (OpenAI-compatible)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| model | string | ✅ | — | Must be "chatterbox-turbo" |
| input | string | ✅ | — | Text to synthesize. Max 4,096 chars |
| voice | string | ✅ | "default" | Voice name — see Voice Names section |
| response_format | string | ❌ | "wav" | wav / mp3 / flac / opus / pcm |
| speed | float | ❌ | 1.0 | 0.25–4.0. Accepted for compatibility |
Extended (Chatterbox-specific)
| Parameter | Type | Default | Description |
|---|---|---|---|
| exaggeration | float | 0.5 | Emotion intensity. Range: 0.0–1.0 |
| cfg_weight | float | 0.5 | Guidance weight. Range: 0.0–1.0 |
| voice_sample_b64 | string | null | Base64 WAV for zero-shot voice cloning |
Output Formats
| Format | response_format | MIME Type | Best For |
|---|---|---|---|
| WAV | wav | audio/wav | Lossless. Best quality. Default |
| MP3 | mp3 | audio/mpeg | Compressed. Smaller file size |
| FLAC | flac | audio/flac | Lossless compression |
| Opus | opus | audio/ogg | Best compression for streaming |
| PCM | pcm | audio/pcm | Raw 16-bit signed integer samples |
Voice Names
Pass the voice name (lowercase) in the voice field. Overridden by voice_sample_b64 if set.
♂ Male Voices
♀ Female Voices
Paralinguistic Tags
Insert tags directly into text to trigger natural vocal sounds. Generated in the actual voice — no audio splicing.
Examples
# Surprise
"And then she opened the box... [gasp] I could not believe what was inside."
# Emotional narration
"[sigh] It had been a long journey. But standing at the top, every step was worth it."
# Professional opening
"[clear throat] Good morning, everyone. Today I'd like to share some findings."
# Comedy reaction
"He showed up three hours late? [groan] And then asked why everyone looked annoyed? [chuckle]"
Emotion & CFG Control
exaggeration — Emotion Intensity
Controls how dramatically the voice delivers the text. Same voice, same text — completely different feel.
| Range | Delivery | Best For |
|---|---|---|
0.0 – 0.2 | Flat, monotone | IVR, notifications, clinical readouts |
0.3 – 0.5 | Neutral, measured | News, technical documentation |
0.5 – 0.7 | Natural, conversational | General narration, tutorials (default) |
0.7 – 0.9 | Expressive, engaging | Podcasts, audiobooks, marketing |
0.9 – 1.0 | Theatrical, dramatic | Audio dramas, trailers, characters |
cfg_weight — Guidance Weight
Controls how strictly the model follows voice reference or style. Lower = looser pacing. Higher = tighter accuracy.
Recommended Presets
| Use Case | exaggeration | cfg_weight |
|---|---|---|
| Audiobook narration | 0.65 | 0.50 |
| Corporate / professional | 0.25 | 0.60 |
| Character voice / drama | 0.90 | 0.45 |
| Meditation / wellness | 0.30 | 0.55 |
| Podcast / conversational | 0.60 | 0.50 |
| Game character | 0.85 | 0.40 |
| IVR / phone system | 0.15 | 0.65 |
Voice Cloning
Zero-shot voice cloning from a reference audio clip. No training or fine-tuning required.
import base64, requests
with open("my_voice.wav", "rb") as f:
voice_b64 = base64.b64encode(f.read()).decode("utf-8")
response = requests.post(
"https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech",
headers={"Authorization": "Bearer YOUR_KEY", "Content-Type": "application/json"},
json={
"model": "chatterbox-turbo",
"input": "This is my cloned voice.",
"response_format": "wav",
"voice_sample_b64": voice_b64,
"exaggeration": 0.6,
}
)
open("cloned.wav", "wb").write(response.content)
Code Examples
from openai import OpenAI
client = OpenAI(
base_url="https://naimulislam864-chatterbox-tts.hf.space/v1",
api_key="YOUR_TTS_API_KEY"
)
# Basic
response = client.audio.speech.create(
model="chatterbox-turbo", voice="dylan",
input="Welcome to Chapter One.", response_format="wav"
)
response.stream_to_file("chapter1.wav")
# With emotion + tags
response = client.audio.speech.create(
model="chatterbox-turbo", voice="anaya",
input="[gasp] She couldn't believe her eyes. [sigh] After all this time...",
response_format="mp3",
extra_body={"exaggeration": 0.8, "cfg_weight": 0.45}
)
response.stream_to_file("scene.mp3")
import requests
response = requests.post(
"https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech",
headers={
"Authorization": "Bearer YOUR_TTS_API_KEY",
"Content-Type": "application/json"
},
json={
"model": "chatterbox-turbo",
"input": "Hello! This is Chatterbox Turbo.",
"voice": "emily",
"response_format": "wav",
"exaggeration": 0.6,
"cfg_weight": 0.5
},
timeout=120
)
with open("output.wav", "wb") as f:
f.write(response.content)
print(f"Saved {len(response.content):,} bytes")
const response = await fetch(
"https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech", {
method: "POST",
headers: {
"Authorization": "Bearer YOUR_TTS_API_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "chatterbox-turbo", input: "Hello from JavaScript!",
voice: "andy", response_format: "wav",
exaggeration: 0.6, cfg_weight: 0.5,
}),
}
);
// Browser: play directly
const blob = new Blob([await response.arrayBuffer()], { type: "audio/wav" });
new Audio(URL.createObjectURL(blob)).play();
// Node.js: save to file
import fs from "fs";
fs.writeFileSync("output.wav", Buffer.from(await response.arrayBuffer()));
# Basic
curl -X POST "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech" \
-H "Authorization: Bearer YOUR_TTS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"chatterbox-turbo","input":"Hello!","voice":"gordon","response_format":"wav"}' \
--output output.wav
# Health check (no auth)
curl "https://naimulislam864-chatterbox-tts.hf.space/health"
Advanced Usage
Waking a sleeping Space
HuggingFace free Spaces sleep after 48h of inactivity. Poll /health until it responds.
import time, requests
def wait_for_space(base_url, timeout=120):
start = time.time()
while time.time() - start < timeout:
try:
r = requests.get(f"{base_url.rstrip('/v1')}/health", timeout=10)
if r.ok and r.json().get("status") == "ok":
print("Space is online."); return True
except: pass
print("Waking up... retrying in 10s"); time.sleep(10)
raise TimeoutError("Space did not wake up in time.")
Batch generation
lines = [
("Corporate", "aaron", 0.25, 0.65),
("Audiobook", "dylan", 0.65, 0.50),
("Dramatic", "emmanuel", 0.90, 0.40),
("Meditation", "laura", 0.30, 0.55),
]
TEXT = "The results exceeded all expectations. This changes everything."
for name, voice, exag, cfg in lines:
r = requests.post(URL, headers=HEADERS, json={
"model": "chatterbox-turbo", "input": TEXT,
"voice": voice, "exaggeration": exag, "cfg_weight": cfg
})
open(f"{name.lower()}.wav", "wb").write(r.content)
print(f"✓ {name}")
Error Codes
| Code | Meaning | Fix |
|---|---|---|
401 | Unauthorized | Check TTS_API_KEY is correct and sent as Bearer token |
400 | Bad request | input is empty, >4096 chars, or invalid base64 |
422 | Validation error | Parameter out of range (e.g. exaggeration: 5.0) |
500 | Server error | Inference failed. Check /health and retry |
503 | Service unavailable | Space is sleeping. Wait 30–60s and retry |
Limits & Performance
| Metric | Value |
|---|---|
| Max input characters | 4,096 |
| Max concurrent requests | 1 (queued) |
| Output sample rate | 24,000 Hz |
| Typical generation time | 10–60s (CPU, depends on length) |
| Cold start (first boot) | 5–10 minutes (3GB model download) |
| Warm start | 5–15 seconds |
| Space sleep after | 48h of inactivity (free tier) |