API Documentation

ChatterboxTurbo

OpenAI-compatible Text-to-Speech API with voice cloning, emotion control, and paralinguistic tags. Self-hosted on HuggingFace Spaces.

MIT License
20 Pre-made Voices
Zero-shot Voice Cloning
9 Paralinguistic Tags
OpenAI-compatible
350M Parameters

Overview

This API wraps the Chatterbox-Turbo model behind an OpenAI-compatible /v1/audio/speech interface — a drop-in replacement for openai.audio.speech.create() with powerful extras.

FeatureDetails
ModelResembleAI Chatterbox-Turbo (350M params)
ArchitectureStreaming encoder-decoder transformer
LicenseMIT — free commercial use
Voice CloningZero-shot from 5+ seconds of audio
Emotion controlContinuous exaggeration slider (0.0 → 1.0)
Paralinguistic tags[laugh], [sigh], [cough] and 6 more
Pre-made voices20 (11 male, 9 female)
Output formatsWAV, MP3, FLAC, Opus, PCM
Sample rate24,000 Hz
Max input4,096 characters

Quick Start

The fastest way to get audio from text. Uses the OpenAI Python SDK as a drop-in.

Python
from openai import OpenAI

client = OpenAI(
    base_url="https://naimulislam864-chatterbox-tts.hf.space/v1",
    api_key="YOUR_TTS_API_KEY"
)

response = client.audio.speech.create(
    model="chatterbox-turbo",
    voice="andy",
    input="Hello! This is Chatterbox Turbo speaking."
)

response.stream_to_file("output.wav")
Tip: The base_url is your HuggingFace Space URL + /v1. The api_key is the TTS_API_KEY secret you set in Space settings.

Authentication

All endpoints except /health require a Bearer token in the Authorization header.

HTTP Header
Authorization: Bearer YOUR_TTS_API_KEY

The API key is the value you set as TTS_API_KEY in HuggingFace Space → Settings → Variables and secrets.

Endpoints

POST /v1/audio/speech Synthesize speech from text

Returns raw binary audio. Content-Type depends on the response_format parameter.

Response Headers

HeaderExampleDescription
Content-Typeaudio/wavMIME type of audio
X-Sample-Rate24000Sample rate in Hz
X-Audio-FormatwavFormat name
Content-Dispositionattachment; filename="speech.wav"Suggested filename
GET /v1/models List available models
Response
{
  "object": "list",
  "data": [{
    "id": "chatterbox-turbo",
    "object": "model",
    "owned_by": "resemble-ai",
    "capabilities": {
      "tts": true, "voice_cloning": true,
      "sample_rate": 24000, "max_chars": 4096
    }
  }]
}
GET /health Health check — no auth required
Response
{ "status": "ok", "model": "chatterbox-turbo", "device": "cpu", "sample_rate": 24000 }

Request Parameters

Standard (OpenAI-compatible)

ParameterTypeRequiredDefaultDescription
modelstringMust be "chatterbox-turbo"
inputstringText to synthesize. Max 4,096 chars
voicestring"default"Voice name — see Voice Names section
response_formatstring"wav"wav / mp3 / flac / opus / pcm
speedfloat1.00.25–4.0. Accepted for compatibility

Extended (Chatterbox-specific)

ParameterTypeDefaultDescription
exaggerationfloat0.5Emotion intensity. Range: 0.0–1.0
cfg_weightfloat0.5Guidance weight. Range: 0.0–1.0
voice_sample_b64stringnullBase64 WAV for zero-shot voice cloning

Output Formats

Formatresponse_formatMIME TypeBest For
WAVwavaudio/wavLossless. Best quality. Default
MP3mp3audio/mpegCompressed. Smaller file size
FLACflacaudio/flacLossless compression
Opusopusaudio/oggBest compression for streaming
PCMpcmaudio/pcmRaw 16-bit signed integer samples

Voice Names

Pass the voice name (lowercase) in the voice field. Overridden by voice_sample_b64 if set.

♂ Male Voices

aaronmale
Professional · Confident
Corporate content, presentations
andymale
Friendly · Versatile
Default voice. General-purpose TTS
archermale
Authoritative · Deep
Documentaries, serious narration
brianmale
Casual · Laid-back
Podcasts, informal content
dylanmale
Smooth · Narrative
Audiobooks, long-form content
emmanuelmale
Deep · Resonant
Trailers, promos, drama
ethanmale
Conversational · Natural
Tutorials, explainers
gavinmale
Energetic · Dynamic
Gaming, sports, high-energy ads
gordonmale
Mature · Seasoned
Documentaries, history, education
ivanmale
Commanding · Powerful
Announcements, authority
waltermale
Classic · Announcer
Commercials, intros, voiceovers

♀ Female Voices

abigailfemale
Warm · Approachable
Narration, e-learning
anayafemale
Expressive · Dynamic
Storytelling, creative content
chloefemale
Bright · Energetic
Social media, upbeat ads
evelynfemale
Elegant · Sophisticated
Luxury brands, premium content
laurafemale
Soothing · Calm
Meditation, wellness, ASMR
lucyfemale
Lively · Charismatic
Entertainment, ads
madisonfemale
Clear · Articulate
Education, instructions
marisolfemale
Vibrant · Passionate
Creative projects, storytelling
meerafemale
Thoughtful · Measured
Tech, science, analysis

Paralinguistic Tags

Insert tags directly into text to trigger natural vocal sounds. Generated in the actual voice — no audio splicing.

[laugh]
A natural, spontaneous laugh
[chuckle]
A soft, quiet laugh
[sigh]
Exhaled breath expressing emotion
[cough]
A realistic cough sound
[gasp]
Sharp intake of breath — surprise
[groan]
Low vocal expression of discomfort
[sniff]
A nasal inhalation — sadness
[clear throat]
Throat-clearing before speaking
[sush]
A hushing or shushing sound
Rules: Tags can appear anywhere in text. Multiple tags per request. Tags work with both pre-made and cloned voices. Tags are case-sensitive — always use lowercase brackets.

Examples

Text Examples
# Surprise
"And then she opened the box... [gasp] I could not believe what was inside."

# Emotional narration
"[sigh] It had been a long journey. But standing at the top, every step was worth it."

# Professional opening
"[clear throat] Good morning, everyone. Today I'd like to share some findings."

# Comedy reaction
"He showed up three hours late? [groan] And then asked why everyone looked annoyed? [chuckle]"

Emotion & CFG Control

exaggeration — Emotion Intensity

Controls how dramatically the voice delivers the text. Same voice, same text — completely different feel.

RangeDeliveryBest For
0.0 – 0.2Flat, monotoneIVR, notifications, clinical readouts
0.3 – 0.5Neutral, measuredNews, technical documentation
0.5 – 0.7Natural, conversationalGeneral narration, tutorials (default)
0.7 – 0.9Expressive, engagingPodcasts, audiobooks, marketing
0.9 – 1.0Theatrical, dramaticAudio dramas, trailers, characters

cfg_weight — Guidance Weight

Controls how strictly the model follows voice reference or style. Lower = looser pacing. Higher = tighter accuracy.

Recommended Presets

Use Caseexaggerationcfg_weight
Audiobook narration0.650.50
Corporate / professional0.250.60
Character voice / drama0.900.45
Meditation / wellness0.300.55
Podcast / conversational0.600.50
Game character0.850.40
IVR / phone system0.150.65

Voice Cloning

Zero-shot voice cloning from a reference audio clip. No training or fine-tuning required.

Reference clip requirements: WAV format · 5–30 seconds (10s ideal) · Clear speech, minimal background noise · Any spoken content works
Python
import base64, requests

with open("my_voice.wav", "rb") as f:
    voice_b64 = base64.b64encode(f.read()).decode("utf-8")

response = requests.post(
    "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech",
    headers={"Authorization": "Bearer YOUR_KEY", "Content-Type": "application/json"},
    json={
        "model":            "chatterbox-turbo",
        "input":            "This is my cloned voice.",
        "response_format":  "wav",
        "voice_sample_b64": voice_b64,
        "exaggeration":     0.6,
    }
)
open("cloned.wav", "wb").write(response.content)

{ } Code Examples

Python · openai SDK
from openai import OpenAI

client = OpenAI(
    base_url="https://naimulislam864-chatterbox-tts.hf.space/v1",
    api_key="YOUR_TTS_API_KEY"
)

# Basic
response = client.audio.speech.create(
    model="chatterbox-turbo", voice="dylan",
    input="Welcome to Chapter One.", response_format="wav"
)
response.stream_to_file("chapter1.wav")

# With emotion + tags
response = client.audio.speech.create(
    model="chatterbox-turbo", voice="anaya",
    input="[gasp] She couldn't believe her eyes. [sigh] After all this time...",
    response_format="mp3",
    extra_body={"exaggeration": 0.8, "cfg_weight": 0.45}
)
response.stream_to_file("scene.mp3")
Python · requests
import requests

response = requests.post(
    "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech",
    headers={
        "Authorization": "Bearer YOUR_TTS_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model":           "chatterbox-turbo",
        "input":           "Hello! This is Chatterbox Turbo.",
        "voice":           "emily",
        "response_format": "wav",
        "exaggeration":    0.6,
        "cfg_weight":      0.5
    },
    timeout=120
)

with open("output.wav", "wb") as f:
    f.write(response.content)
print(f"Saved {len(response.content):,} bytes")
JavaScript / Node.js
const response = await fetch(
  "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech", {
    method: "POST",
    headers: {
      "Authorization": "Bearer YOUR_TTS_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "chatterbox-turbo", input: "Hello from JavaScript!",
      voice: "andy", response_format: "wav",
      exaggeration: 0.6, cfg_weight: 0.5,
    }),
  }
);

// Browser: play directly
const blob  = new Blob([await response.arrayBuffer()], { type: "audio/wav" });
new Audio(URL.createObjectURL(blob)).play();

// Node.js: save to file
import fs from "fs";
fs.writeFileSync("output.wav", Buffer.from(await response.arrayBuffer()));
cURL
# Basic
curl -X POST "https://naimulislam864-chatterbox-tts.hf.space/v1/audio/speech" \
  -H "Authorization: Bearer YOUR_TTS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"chatterbox-turbo","input":"Hello!","voice":"gordon","response_format":"wav"}' \
  --output output.wav

# Health check (no auth)
curl "https://naimulislam864-chatterbox-tts.hf.space/health"

Advanced Usage

Waking a sleeping Space

HuggingFace free Spaces sleep after 48h of inactivity. Poll /health until it responds.

Python
import time, requests

def wait_for_space(base_url, timeout=120):
    start = time.time()
    while time.time() - start < timeout:
        try:
            r = requests.get(f"{base_url.rstrip('/v1')}/health", timeout=10)
            if r.ok and r.json().get("status") == "ok":
                print("Space is online."); return True
        except: pass
        print("Waking up... retrying in 10s"); time.sleep(10)
    raise TimeoutError("Space did not wake up in time.")

Batch generation

Python
lines = [
    ("Corporate",   "aaron",    0.25, 0.65),
    ("Audiobook",   "dylan",    0.65, 0.50),
    ("Dramatic",    "emmanuel", 0.90, 0.40),
    ("Meditation",  "laura",    0.30, 0.55),
]

TEXT = "The results exceeded all expectations. This changes everything."

for name, voice, exag, cfg in lines:
    r = requests.post(URL, headers=HEADERS, json={
        "model": "chatterbox-turbo", "input": TEXT,
        "voice": voice, "exaggeration": exag, "cfg_weight": cfg
    })
    open(f"{name.lower()}.wav", "wb").write(r.content)
    print(f"✓ {name}")

Error Codes

CodeMeaningFix
401UnauthorizedCheck TTS_API_KEY is correct and sent as Bearer token
400Bad requestinput is empty, >4096 chars, or invalid base64
422Validation errorParameter out of range (e.g. exaggeration: 5.0)
500Server errorInference failed. Check /health and retry
503Service unavailableSpace is sleeping. Wait 30–60s and retry

Limits & Performance

MetricValue
Max input characters4,096
Max concurrent requests1 (queued)
Output sample rate24,000 Hz
Typical generation time10–60s (CPU, depends on length)
Cold start (first boot)5–10 minutes (3GB model download)
Warm start5–15 seconds
Space sleep after48h of inactivity (free tier)
Free tier note: HuggingFace free Spaces use 2 vCPUs and ~16GB RAM. Generation is CPU-only and slower than GPU. For real-time use, upgrade to a GPU Space or self-host.