BA

higgs-audio-v3-tts-4b

Audioby Boson AI·Model page

higgs-audio-v3-tts-4b is a 4.65B-parameter multilingual text-to-speech model by Boson AI supporting expressive and controllable speech synthesis across 100+ languages.

Share:

Model Card

Higgs Audio v3 TTS is built for voice chat: it speaks, not just reads. It turns model responses into expressive conversational speech across 100+ languages, with zero-shot voice cloning and inline control over emotion, style, prosody, pauses, and sound effects.

[!TIP] Released for research and non-commercial use under the Boson Higgs Audio v3 Research and Non-Commercial License. Production, hosted APIs, or revenue-generating use requires a separate commercial license. Prohibited: voice cloning without consent, impersonation, fraud, election deception, biometric surveillance, or any unlawful use.

Higgs Audio v3 TTS Architecture

Higgs autoregressive decoder consumes interleaved text and audio tokens. Audio is encoded by the Higgs Tokenizer into 8 codebooks at 25 fps, staggered via a delay pattern, then mapped to backbone hidden states through a multi-codebook fused embedding. Output codes pass through a multi-codebook fused head, are de-delayed, and decoded back to waveform.

Component Spec
Backbone ~4B autoregressive decoder (36 L, hidden=2560, GQA 32/8)
Multi-codebook embedding / head Fused single-tensor, tied with text embedding
Context length 8,192 tokens (training sequence length)
Audio tokens 8 codebooks × 1026 vocab, delay pattern
Sample rate 24 kHz
Frame rate 25 fps (40 ms / frame)

Supported Languages

The model reaches single-digit WER/CER on 102 languages, which split into two tiers.

WER/CER under 5 — polished, production-quality (85)

🇿🇦 Afrikaans · 🇸🇦🇪🇬 Arabic · 🇦🇲 Armenian · 🇮🇳 Assamese · 🇪🇸 Asturian · 🇦🇿 Azerbaijani · 🇷🇺 Bashkir · 🇪🇸 Basque · 🇧🇾 Belarusian · 🇧🇩🇮🇳 Bengali · 🇧🇦 Bosnian · 🇧🇬 Bulgarian · 🇪🇸 Catalan · 🇵🇭 Cebuano · 🇮🇶 Central Kurdish · 🇨🇳 Chinese · 🇭🇷 Croatian · 🇨🇿 Czech · 🇩🇰 Danish · 🇳🇱🇧🇪 Dutch · 🇷🇺 Eastern Mari · 🇺🇸🇬🇧🇦🇺 English · 🌐 Esperanto · 🇪🇪 Estonian · 🇫🇮 Finnish · 🇫🇷🇨🇦 French · 🇪🇸 Galician · 🇬🇪 Georgian · 🇩🇪🇦🇹 German · 🇬🇷 Greek · 🇮🇳 Gujarati · 🇭🇹 Haitian Creole · 🇳🇬 Hausa · 🇮🇱 Hebrew · 🇮🇳 Hindi · 🇭🇺 Hungarian · 🇮🇩 Indonesian · 🇮🇹 Italian · 🇯🇵 Japanese · 🇮🇩 Javanese · 🇮🇳 Kannada · 🇰🇿 Kazakh · 🇰🇷 Korean · 🇷🇼 Kinyarwanda · 🇰🇬 Kyrgyz · 🇱🇻 Latvian · 🇨🇩 Lingala · 🇱🇹 Lithuanian · 🇰🇪 Luo · 🇲🇰 Macedonian · 🇲🇾🇮🇩 Malay · 🇮🇳 Malayalam · 🇲🇹 Maltese · 🇳🇿 Māori · 🇮🇳 Marathi · 🇲🇳 Mongolian · 🇳🇵 Nepali · 🇳🇴 Norwegian · 🇫🇷 Occitan · 🇮🇷🇦🇫 Persian · 🇵🇱 Polish · 🇵🇹🇧🇷 Portuguese · 🇷🇴 Romanian · 🇷🇺 Russian · 🇿🇦 Sepedi · 🇷🇸 Serbian · 🇿🇼 Shona · 🇸🇰 Slovak · 🇸🇮 Slovene · 🇪🇸🇲🇽 Spanish · 🇹🇿🇰🇪 Swahili · 🇸🇪 Swedish · 🇵🇭 Tagalog · 🇹🇯 Tajik · 🇮🇳🇱🇰 Tamil · 🇮🇳 Telugu · 🇹🇭 Thai · 🇹🇷 Turkish · 🇺🇦 Ukrainian · 🇵🇰🇮🇳 Urdu · 🇨🇳 Uyghur · 🇺🇿 Uzbek · 🇻🇳 Vietnamese · 🇿🇦 Xhosa · 🇿🇦 Zulu

WER/CER between 5 and 10 — usable, but less polished (17)

🇦🇱 Albanian · 🇲🇼🇿🇲 Chichewa/Nyanja · 🇮🇳🇵🇰 Eastern Punjabi · 🇺🇬 Ganda · 🇮🇸 Icelandic · 🇮🇪 Irish · 🇩🇿 Kabyle · 🇨🇻 Kabuverdianu · 🇰🇪 Kamba · 🇻🇦 Latin · 🇱🇺 Luxembourgish · 🇪🇹🇰🇪 Oromo · 🇦🇫🇵🇰 Pashto · 🇵🇰🇮🇳 Sindhi · 🇸🇴 Somali · 🇦🇴 Umbundu · 🇬🇧 Welsh

Control Tokens

All tags follow <|category:value|> syntax and can be inserted mid-utterance.

For how to place these tags when writing the target text (sentence-level vs. inline, sfx formatting, stacking, worked examples), see PROMPTING.md.

  • Emotionelation, amusement, enthusiasm, determination, pride, contentment, affection, relief, contemplation, confusion, surprise, awe, longing, arousal, anger, fear, disgust, bitterness, sadness, shame, helplessness

  • Stylesinging, shouting, whispering

  • Sound effectscough, laughter, crying, screaming, burping, humming, sigh, sniff, sneeze

  • Prosody

    • Speed — speed_very_slow, speed_slow, speed_fast, speed_very_fast
    • Pauses — pause, long_pause
    • Pitch — pitch_low, pitch_high
    • Delivery — expressive_high, expressive_low

Evaluation Benchmarks

Multilingual Voice Clone

We evaluate Higgs Audio v3 TTS on public multilingual TTS suites and our internal 111-language Higgs-Multilingual set, covering both common and lower-resource languages.

WER / CER (↓, ×100) macro-averaged across each benchmark's language set. Lower is better; bold marks the best per row. All numbers are reproducible end-to-end with original metrics and normalization.

Emergent TTS

Win-rate (↑) per category — judge preference vs the BASELINE row; bold marks the highest win-rate per column. For a fair comparison, every model shares the same reference audio per prompt, and we run the benchmark text verbatim — no inline control tags inserted.

Usage

SGLang Usage

Pair the weights in this repo with SGLang-Omni — a production serving stack with continuous batching for multi-codebook decoding and the same inline tag controls. The Higgs TTS cookbook walks you through installation, server launch, request examples, and the full API reference.

See the Higgs TTS cookbook for the full details.

Install and Serve

docker pull lmsysorg/sglang-omni:dev
docker run -it --gpus all --shm-size 32g --ipc host --network host --privileged \
  lmsysorg/sglang-omni:dev /bin/zsh

git clone git@github.com:sgl-project/sglang-omni.git && cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -v -e .
export HF_TOKEN=hf_xxxxxxxxxxxxxxxx
hf download bosonai/higgs-audio-v3-tts-4b

sgl-omni serve \
  --model-path bosonai/higgs-audio-v3-tts-4b \
  --port 8000

Zero-shot synthesis

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, how are you?"}' \
  --output output.wav

Voice cloning

Supplying the reference transcript (text) materially improves cloning fidelity.

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "input": "Have a nice day and enjoy south california sunshine.",
        "references": [{
            "audio_path": "ref.wav",
            "text": "Hey, Adam here. Let's create something that feels real, sounds human, and connects every time.",
        }],
        "temperature": 0.8, "top_k": 50, "max_new_tokens": 1024,
    },
)
with open("output.wav", "wb") as f:
    f.write(resp.content)

Streaming (Server-Sent Events)

Set "stream": true to receive base64-encoded WAV chunks as the vocoder emits them — sub-second time-to-first-audio. Each event carries audio.data (base64 WAV bytes); the terminal event has finish_reason: "stop" plus usage metadata.

import requests, base64, json

with requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={"input": "Get the trust fund to the bank early.", "stream": True},
    stream=True,
) as resp, open("output.wav", "wb") as f:
    for line in resp.iter_lines():
        if not line or not line.startswith(b"data: ") or line == b"data: [DONE]":
            continue
        event = json.loads(line[6:])
        if event.get("finish_reason") == "stop":
            break
        audio = event.get("audio") or {}
        if audio.get("data"):
            f.write(base64.b64decode(audio["data"]))

Inline control tokens

Embed <|emotion:…|>, <|style:…|>, <|prosody:…|>, and <|sfx:…|> tokens directly in input. Two rules:

  1. Delivery tokens first. Emotion, style, and the prosody speed / pitch / expressive tokens shape the whole turn — put them at the start of input. Positional tokens (<|prosody:pause|>, <|prosody:long_pause|>, <|sfx:…|>) go inline exactly where they fire.
  2. Pair every <|sfx:…|> with its onomatopoeia. E.g. <|sfx:laughter|>Haha, <|sfx:sigh|>Uh, <|sfx:sneeze|>Achoo. The written sound gives the model the acoustic cue to realize the effect.

Example — amusement + laughter:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "<|emotion:amusement|><|prosody:expressive_high|>Wait, wait, that was kind of hilarious. <|sfx:laughter|>Hehe, no, seriously, I was not ready for that."}' \
  --output output.wav

Throughput

Throughput on Seed-TTS EN (full set, N=1088 per run). Client --max-concurrency sweep against a Higgs server (max_running_requests=16, bf16, CUDA Graph on). Each row is the mean of 3 runs. Hardware: 1× H100.

  • Concurrency — Maximum number of in-flight client requests (--max-concurrency).
  • Throughput (req/s) — Completed requests divided by total benchmark wall-clock time.
  • Mean latency — Average end-to-end time per request (send to full response received).
  • RTF (per-req) — Average ratio of processing time to generated audio duration per request (<1 is faster than real time).
  • audio_s/s — Total seconds of audio produced divided by total benchmark wall-clock time.

To reproduce the results, follow the instructions in this script.

vLLM-Omni Usage

You can also serve these weights with vLLM-Omni, which exposes the same OpenAI-compatible /v1/audio/speech API with zero-shot voice cloning.

hf download bosonai/higgs-audio-v3-tts-4b

vllm-omni serve bosonai/higgs-audio-v3-tts-4b \
  --host 0.0.0.0 --port 8095 \
  --trust-remote-code --omni
curl -X POST http://localhost:8095/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "bosonai/higgs-audio-v3-tts-4b", "input": "Hello, how are you?"}' \
  --output output.wav

Plain-text TTS, voice-clone, and benchmark recipes are in the vLLM-Omni Higgs Audio v3 recipe.

API Usage

For zero-ops deployment, use the Boson AI API.

Citation

@misc{bosonai_higgs_audio_tts_v3_2026,
  title  = {Higgs Audio v3 TTS: Conversational Speech for Voice AI from Boson AI},
  author = {Boson AI},
  year   = {2026},
  howpublished = {https://huggingface.co/bosonai/higgs-audio-v3-tts-4b},
}

License

Boson Higgs Audio v3 Research and Non-Commercial License — see LICENSE.

Author
BA
Boson AI
Organization · ✓
bosonai
Details
Downloads57.4K
Likes486
AccessOpen Source
Tasktext-to-speech
Parameters4.7B
Trending130
Licenseother
Librarytransformers
CreatedJun 4, 2026
UpdatedJun 16, 2026
View on Hugging Face
Languages
afarasastazbabebgbnbscacebckbcscydadeeleneoeseteufafifrgaglguha
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

higgs-audio-v3-tts-4b — AI Model Details | Applied