Voxtral-4B-TTS-2603
Mistral AI's text-to-speech model supporting 9 languages including English, French, Spanish, Arabic, and Hindi.
Base model
Model Card
Voxtral TTS is a frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents. The model is released with BF16 weights and a set of reference voices. These voices are licensed under CC BY-NC 4, which is the license that the model inherits.
For more details, see our:
Key Features
Voxtral TTS delivers enterprise-grade text-to-speech for production voice agents, with the following capabilities:
- Realistic, expressive speech with natural prosody and emotional range across 9 major languages, with support for diverse dialects
- Text-to-Speech generation with 20 preset voices and easy adaptation to new voices
- Multilingual support: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi
- Very low latency with fast time-to-first-audio, plus streaming and batch inference support
- 24 kHz audio output in WAV, PCM, FLAC, MP3, AAC, and Opus formats
- Production-ready performance for high-throughput, real-time voice agent workflows
[!Tip] For voice customization, visit our AI Studio.
Use Cases
- Customer support and call center infrastructure.
- Financial services. -- with video demo on banking KYC voice agents.
- Manufacturing and industrial operations.
- Public services and government.
- Compliance and risk.
- Supply chain and logistics.
- Automotive and in-vehicle systems.
- Sales and marketing.
- Real-time translation.
[!Warning] Responsible Use - You are responsible for complying with applicable laws and avoiding misuse.
Benchmark Results
- Measured using vllm_omni/examples/offline_inference/voxtral_tts/end2end.py.
- Input: 500-character text with a 10-second audio reference.
- Hardware: single NVIDIA H200.
- vllm version: v0.18.0.
Note: The RTF in end2end.py uses an inverted formula (higher = better). The table below converts it back to the standard RTF convention (lower = better)
| Concurrency | Latency | RTF | Throughput (char/s/GPU) |
|---|---|---|---|
| 1 | 70 ms | 0.103 | 119.14 |
| 16 | 331 ms | 0.237 | 879.11 |
| 32 | 552 ms | 0.302 | 1430.78 |
Usage
The model can also be deployed with the following libraries:
vLLM Omni (recommended)
[!Tip] We've worked hand-in-hand with the vLLM-Omni team to have production-grade support for Voxtral 4B TTS 2603 with vLLM-Omni. Special thanks goes out to Han Gao, Hongsheng Liu, Roger Wang, and Yueqian Lin from the vLLM-Omni team.
Installation
Make sure to install vllm from the latest (>= 0.18.0) pypi package. See here for a full installation guide.
uv pip install -U vllm
Next, you should install vllm-omni with vllm-omni >= 0.18.0.
uv pip install vllm-omni --upgrade # make sure to have >= 0.18.0
Alternatively, you can also make use of a ready-to-go docker image on the docker hub.
Installing vllm >= 0.18.0 should automatically install mistral_common >= 1.10.0 which you can verify by running:
python3 -c "import mistral_common; print(mistral_common.__version__)" # should print >= 1.10.0
Serve
Due to size and the BF16 format of the weights - Voxtral-4B-TTS-2603 can run on a single GPU with >= 16GB memory.
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni
Client
import io
import httpx
import soundfile as sf
BASE_URL = "http://<your-server-url>:8000/v1"
payload = {
"input": "Paris is a beautiful city!",
"model": "mistralai/Voxtral-4B-TTS-2603",
"response_format": "wav",
"voice": "casual_male",
}
response = httpx.post(f"{BASE_URL}/audio/speech", json=payload, timeout=120.0)
response.raise_for_status()
audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
print(f"Got audio: {len(audio_array)} samples at {sr} Hz")
# you can play the audio with a library like `sounddevice.play` for example
Demo
To run it:
git clone https://github.com/vllm-project/vllm-omni.git && \
cd vllm-omni && \
uv pip install gradio==5.50 && \
python examples/online_serving/voxtral_tts/gradio_demo.py \
--host <your-server-url> \
--port 8000
Alternatively you can also try it out live here ➡️ HF Space.
License
The provided voice-references compatible with this model are licensed under CC BY-NC 4, e.g. from EARS, CML-TTS, IndicVoices-R and Arabic Natural Audio datasets. Thus, this model inherits the same license.
You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.
Sign up to read complete case studies, access detailed metrics, and unlock all use cases.