How many parameters does ZONOS2 have?

Parameter count for ZONOS2 is not available. See the Hugging Face model page for full specifications.

ZONOS2 was published by Zyphra on Hugging Face.

ZONOS2

Name: ZONOS2
Author: Zyphra

Zyphra's ZONOS2 text-to-speech model for generating natural-sounding speech audio from text input.

Model Description

ZONOS2 is our latest text-to-speech model trained on more than 6 million hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers at low latency with MoE. ZONOS2 excels at high-fidelity and naturalistic voice cloning.

During inference we use nemo TN normalized UTF-8 bytes and an ECAPA-TDNN embedding to generate DAC tokens with our MoE backbone. An inference overview can be seen below.

Language support is as follows.

Tier	Languages
Tier 1	English, Mandarin Chinese, Japanese
Tier 2	Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, Dutch
Tier 3	Swedish, Hindi, Tamil, Telugu, Thai, Norwegian, Bengali, Tagalog, Arabic, Danish, Indonesian, Polish, Ukrainian, Romanian, Finnish, Hungarian, Lithuanian, Estonian, Slovak, Croatian, Latvian

For local inference we provide a high-performance TTS inference server built on Mini-SGLang.

For more details and speech samples, check out our blog.

We also have a hosted version available at cloud.zyphra.com/audio-playground.

Quick Start

Platform Support: Linux only (x86_64). Requires NVIDIA GPU with CUDA toolkit matching your driver version (nvidia-smi to check).

1. Installation

Requires uv.

git clone https://github.com/Zyphra/ZONOS2.git
cd ZONOS2
uv sync

2. Launch the TTS Server

uv run python -m minisgl --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/

uv run always uses the project environment, so no venv activation is needed.

The server starts on http://localhost:1919 by default. TTS mode is auto-detected for zonos2 models. --tts-default-voices-dir <folder> pre-populates the web UI with voice-clone speakers from disk; the folder is scanned recursively for speaker audio (.wav, .mp3, .flac, .m4a, .ogg, .opus, .aac, .webm) and saved embeddings (.npy, .npz). The newest voice is selected automatically on startup.

3. Generate Speech

curl:

curl -X POST http://localhost:1919/tts/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "stream": true}' \
  --output output.pcm

# Convert to WAV
ffmpeg -f f32le -ar 44100 -ac 1 -i output.pcm output.wav

Web UI: Open http://localhost:1919/ in your browser.

Python API (offline inference)

You can also run the engine directly in a Python script, without starting a server, via TTSLLM:

from minisgl.message import TTSSamplingParams
from minisgl.tts import TTSLLM

tts = TTSLLM(model_path="Zyphra/ZONOS2")

results = tts.generate(
    ["Hello from the offline Python API.", "Batched prompts work too."],
    TTSSamplingParams(seed=42),
)

for i, result in enumerate(results):
    print(f"frames={len(result['audio_tokens'])}, eos_frame={result['eos_frame']}")
    tts.save_audio(result["audio"], f"output_{i}.wav")

Citation

If you find this model useful in an academic context please cite as:

@misc{zyphra2025zonos,
  title     = {Zonos V2 Technical Report},
  author    = {Gabriel Clark, Sofian Mejjoute, Mohamed Osman, George Close, Beren Millidge},
  year      = {2026},
}

Author

Zyphra

Organization

Zyphra

Details

Downloads813

Likes119

AccessOpen Source

Tasktext-to-speech

Trending54

Licenseapache-2.0

LibraryZONOS2

CreatedJun 11, 2026

UpdatedJun 13, 2026

View on Hugging Face

Get the full context.

Author

Zyphra

Organization

Zyphra

Details

Downloads813

Likes119

AccessOpen Source

Tasktext-to-speech

Trending54

Licenseapache-2.0

LibraryZONOS2

CreatedJun 11, 2026

UpdatedJun 13, 2026

View on Hugging Face

Get the full context.