Chatterbox-Turbo · TTS API Docs

◈ Overview

This API wraps the Chatterbox-Turbo model behind an OpenAI-compatible /v1/audio/speech interface — a drop-in replacement for openai.audio.speech.create() with powerful extras.

Feature	Details
Model	ResembleAI Chatterbox-Turbo (350M params)
Architecture	Streaming encoder-decoder transformer
License	MIT — free commercial use
Voice Cloning	Zero-shot from 5+ seconds of audio
Emotion control	Continuous `exaggeration` slider (0.0 → 1.0)
Paralinguistic tags	`[laugh]`, `[sigh]`, `[cough]` and 6 more
Pre-made voices	20 (11 male, 9 female)
Output formats	WAV, MP3, FLAC, Opus, PCM
Sample rate	24,000 Hz
Max input	4,096 characters

▶ Quick Start

The fastest way to get audio from text. Uses the OpenAI Python SDK as a drop-in.

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://naimulislam864-chatterbox-tts.hf.space/v1",
    api_key="YOUR_TTS_API_KEY"
)

response = client.audio.speech.create(
    model="chatterbox-turbo",
    voice="andy",
    input="Hello! This is Chatterbox Turbo speaking."
)

response.stream_to_file("output.wav")

Tip: The base_url is your HuggingFace Space URL + /v1. The api_key is the TTS_API_KEY secret you set in Space settings.

◉ Authentication

All endpoints except /health require a Bearer token in the Authorization header.

HTTP Header

Authorization: Bearer YOUR_TTS_API_KEY

The API key is the value you set as TTS_API_KEY in HuggingFace Space → Settings → Variables and secrets.

⬡ Endpoints

POST /v1/audio/speech Synthesize speech from text

Returns raw binary audio. Content-Type depends on the response_format parameter.

Response Headers

Header	Example	Description
Content-Type	`audio/wav`	MIME type of audio
X-Sample-Rate	`24000`	Sample rate in Hz
X-Audio-Format	`wav`	Format name
Content-Disposition	`attachment; filename="speech.wav"`	Suggested filename

GET /v1/models List available models

Response

{
  "object": "list",
  "data": [{
    "id": "chatterbox-turbo",
    "object": "model",
    "owned_by": "resemble-ai",
    "capabilities": {
      "tts": true, "voice_cloning": true,
      "sample_rate": 24000, "max_chars": 4096
    }
  }]
}

GET /health Health check — no auth required

Response

{ "status": "ok", "model": "chatterbox-turbo", "device": "cpu", "sample_rate": 24000 }

⊞ Request Parameters

Standard (OpenAI-compatible)

Parameter	Type	Required	Default	Description
model	string	✅	—	Must be `"chatterbox-turbo"`
input	string	✅	—	Text to synthesize. Max 4,096 chars
voice	string	✅	`"default"`	Voice name — see Voice Names section
response_format	string	❌	`"wav"`	`wav` / `mp3` / `flac` / `opus` / `pcm`
speed	float	❌	`1.0`	0.25–4.0. Accepted for compatibility

Extended (Chatterbox-specific)

Parameter	Type	Default	Description
exaggeration	float	`0.5`	Emotion intensity. Range: 0.0–1.0
cfg_weight	float	`0.5`	Guidance weight. Range: 0.0–1.0
voice_sample_b64	string	`null`	Base64 WAV for zero-shot voice cloning

◫ Output Formats

Format	response_format	MIME Type	Best For
WAV	`wav`	`audio/wav`	Lossless. Best quality. Default
MP3	`mp3`	`audio/mpeg`	Compressed. Smaller file size
FLAC	`flac`	`audio/flac`	Lossless compression
Opus	`opus`	`audio/ogg`	Best compression for streaming
PCM	`pcm`	`audio/pcm`	Raw 16-bit signed integer samples

◎ Voice Names

Pass the voice name (lowercase) in the voice field. Overridden by voice_sample_b64 if set.

♂ Male Voices

aaronmale

Professional · Confident

Corporate content, presentations

andymale

Friendly · Versatile

Default voice. General-purpose TTS

archermale

Authoritative · Deep

Documentaries, serious narration

brianmale

Casual · Laid-back

Podcasts, informal content

dylanmale

Smooth · Narrative

Audiobooks, long-form content

emmanuelmale

Deep · Resonant

Trailers, promos, drama

ethanmale

Conversational · Natural

Tutorials, explainers

gavinmale

Energetic · Dynamic

Gaming, sports, high-energy ads

gordonmale

Mature · Seasoned

Documentaries, history, education

ivanmale

Commanding · Powerful

Announcements, authority

waltermale

Classic · Announcer

Commercials, intros, voiceovers

♀ Female Voices

abigailfemale

Warm · Approachable

Narration, e-learning

anayafemale

Expressive · Dynamic

Storytelling, creative content

chloefemale

Bright · Energetic

Social media, upbeat ads

evelynfemale

Elegant · Sophisticated

Luxury brands, premium content

laurafemale

Soothing · Calm

Meditation, wellness, ASMR

lucyfemale

Lively · Charismatic

Entertainment, ads

madisonfemale

Clear · Articulate

Education, instructions

marisolfemale

Vibrant · Passionate

Creative projects, storytelling

meerafemale

Thoughtful · Measured

Tech, science, analysis

◈ Paralinguistic Tags

Insert tags directly into text to trigger natural vocal sounds. Generated in the actual voice — no audio splicing.

[laugh]

A natural, spontaneous laugh

[chuckle]

A soft, quiet laugh

[sigh]

Exhaled breath expressing emotion

[cough]

A realistic cough sound

[gasp]

Sharp intake of breath — surprise

[groan]

Low vocal expression of discomfort

[sniff]

A nasal inhalation — sadness

[clear throat]

Throat-clearing before speaking

[sush]

A hushing or shushing sound

Rules: Tags can appear anywhere in text. Multiple tags per request. Tags work with both pre-made and cloned voices. Tags are case-sensitive — always use lowercase brackets.

Examples

Text Examples

# Surprise
"And then she opened the box... [gasp] I could not believe what was inside."

# Emotional narration
"[sigh] It had been a long journey. But standing at the top, every step was worth it."

# Professional opening
"[clear throat] Good morning, everyone. Today I'd like to share some findings."

# Comedy reaction
"He showed up three hours late? [groan] And then asked why everyone looked annoyed? [chuckle]"

◑ Emotion & CFG Control

exaggeration — Emotion Intensity

Controls how dramatically the voice delivers the text. Same voice, same text — completely different feel.

Range	Delivery	Best For
`0.0 – 0.2`	Flat, monotone	IVR, notifications, clinical readouts
`0.3 – 0.5`	Neutral, measured	News, technical documentation
`0.5 – 0.7`	Natural, conversational	General narration, tutorials (default)
`0.7 – 0.9`	Expressive, engaging	Podcasts, audiobooks, marketing
`0.9 – 1.0`	Theatrical, dramatic	Audio dramas, trailers, characters

cfg_weight — Guidance Weight

Controls how strictly the model follows voice reference or style. Lower = looser pacing. Higher = tighter accuracy.

Recommended Presets

Use Case	exaggeration	cfg_weight
Audiobook narration	`0.65`	`0.50`
Corporate / professional	`0.25`	`0.60`
Character voice / drama	`0.90`	`0.45`
Meditation / wellness	`0.30`	`0.55`
Podcast / conversational	`0.60`	`0.50`
Game character	`0.85`	`0.40`
IVR / phone system	`0.15`	`0.65`

⬡ Voice Cloning

Zero-shot voice cloning from a reference audio clip. No training or fine-tuning required.

Reference clip requirements: WAV format · 5–30 seconds (10s ideal) · Clear speech, minimal background noise · Any spoken content works

Python

import base64, requests

with open("my_voice.wav", "rb") as f:
    voice_b64 = base64.b64encode(f.read()).decode("utf-8")

response = requests.post(
    "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech",
    headers={"Authorization": "Bearer YOUR_KEY", "Content-Type": "application/json"},
    json={
        "model":            "chatterbox-turbo",
        "input":            "This is my cloned voice.",
        "response_format":  "wav",
        "voice_sample_b64": voice_b64,
        "exaggeration":     0.6,
    }
)
open("cloned.wav", "wb").write(response.content)

{ } Code Examples

Python · openai SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://naimulislam864-chatterbox-tts.hf.space/v1",
    api_key="YOUR_TTS_API_KEY"
)

# Basic
response = client.audio.speech.create(
    model="chatterbox-turbo", voice="dylan",
    input="Welcome to Chapter One.", response_format="wav"
)
response.stream_to_file("chapter1.wav")

# With emotion + tags
response = client.audio.speech.create(
    model="chatterbox-turbo", voice="anaya",
    input="[gasp] She couldn't believe her eyes. [sigh] After all this time...",
    response_format="mp3",
    extra_body={"exaggeration": 0.8, "cfg_weight": 0.45}
)
response.stream_to_file("scene.mp3")

Python · requests

import requests

response = requests.post(
    "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech",
    headers={
        "Authorization": "Bearer YOUR_TTS_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model":           "chatterbox-turbo",
        "input":           "Hello! This is Chatterbox Turbo.",
        "voice":           "emily",
        "response_format": "wav",
        "exaggeration":    0.6,
        "cfg_weight":      0.5
    },
    timeout=120
)

with open("output.wav", "wb") as f:
    f.write(response.content)
print(f"Saved {len(response.content):,} bytes")

JavaScript / Node.js

const response = await fetch(
  "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech", {
    method: "POST",
    headers: {
      "Authorization": "Bearer YOUR_TTS_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "chatterbox-turbo", input: "Hello from JavaScript!",
      voice: "andy", response_format: "wav",
      exaggeration: 0.6, cfg_weight: 0.5,
    }),
  }
);

// Browser: play directly
const blob  = new Blob([await response.arrayBuffer()], { type: "audio/wav" });
new Audio(URL.createObjectURL(blob)).play();

// Node.js: save to file
import fs from "fs";
fs.writeFileSync("output.wav", Buffer.from(await response.arrayBuffer()));

cURL

# Basic
curl -X POST "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech" \
  -H "Authorization: Bearer YOUR_TTS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"chatterbox-turbo","input":"Hello!","voice":"gordon","response_format":"wav"}' \
  --output output.wav

# Health check (no auth)
curl "https://naimulislam864-chatterbox-tts.hf.space/health"

⬡ Advanced Usage

Waking a sleeping Space

HuggingFace free Spaces sleep after 48h of inactivity. Poll /health until it responds.

Python

import time, requests

def wait_for_space(base_url, timeout=120):
    start = time.time()
    while time.time() - start < timeout:
        try:
            r = requests.get(f"{base_url.rstrip('/v1')}/health", timeout=10)
            if r.ok and r.json().get("status") == "ok":
                print("Space is online."); return True
        except: pass
        print("Waking up... retrying in 10s"); time.sleep(10)
    raise TimeoutError("Space did not wake up in time.")

Batch generation

Python

lines = [
    ("Corporate",   "aaron",    0.25, 0.65),
    ("Audiobook",   "dylan",    0.65, 0.50),
    ("Dramatic",    "emmanuel", 0.90, 0.40),
    ("Meditation",  "laura",    0.30, 0.55),
]

TEXT = "The results exceeded all expectations. This changes everything."

for name, voice, exag, cfg in lines:
    r = requests.post(URL, headers=HEADERS, json={
        "model": "chatterbox-turbo", "input": TEXT,
        "voice": voice, "exaggeration": exag, "cfg_weight": cfg
    })
    open(f"{name.lower()}.wav", "wb").write(r.content)
    print(f"✓ {name}")

△ Error Codes

Code	Meaning	Fix
`401`	Unauthorized	Check `TTS_API_KEY` is correct and sent as Bearer token
`400`	Bad request	`input` is empty, >4096 chars, or invalid base64
`422`	Validation error	Parameter out of range (e.g. `exaggeration: 5.0`)
`500`	Server error	Inference failed. Check `/health` and retry
`503`	Service unavailable	Space is sleeping. Wait 30–60s and retry

◫ Limits & Performance

Metric	Value
Max input characters	4,096
Max concurrent requests	1 (queued)
Output sample rate	24,000 Hz
Typical generation time	10–60s (CPU, depends on length)
Cold start (first boot)	5–10 minutes (3GB model download)
Warm start	5–15 seconds
Space sleep after	48h of inactivity (free tier)

Free tier note: HuggingFace free Spaces use 2 vCPUs and ~16GB RAM. Generation is CPU-only and slower than GPU. For real-time use, upgrade to a GPU Space or self-host.

ChatterboxTurbo

◈ Overview

▶ Quick Start

◉ Authentication

⬡ Endpoints

Response Headers

⊞ Request Parameters

Standard (OpenAI-compatible)

Extended (Chatterbox-specific)

◫ Output Formats

◎ Voice Names

♂ Male Voices

♀ Female Voices

◈ Paralinguistic Tags

Examples

◑ Emotion & CFG Control

exaggeration — Emotion Intensity

cfg_weight — Guidance Weight

Recommended Presets

⬡ Voice Cloning

{ } Code Examples

⬡ Advanced Usage

Waking a sleeping Space

Batch generation

△ Error Codes

◫ Limits & Performance