OS

Inflect-Nano-v1

Audioby Owen Song·Model page

owensong's ultra-small experimental English TTS model designed for efficient local speech synthesis.

Share:

Model Card

Inflect-Nano-v1

Edit 06/17/2026 -- I'm really happy to see that this model is doing decently! If more people find it useful, I might consider training a v2 with a much larger budget. But for now, if you would like to see a v2, just like/favourite this model to get more people see it! Thank you for everyone for checking out this model.

Inflect-Nano-v1 is a tiny English text-to-speech model with 4.63M total inference parameters, including its vocoder.

It is not trying to beat large TTS models. It is a small, local, complete text-to-waveform stack built to test how far ultra-lightweight speech synthesis can go.

Highlights

  • 4.63M parameters total
  • Includes the vocoder
  • 24 kHz audio
  • Single English male voice
  • Runs locally with PyTorch
  • Built for tiny-model experiments, local assistants, embedded demos, and efficient inference research

Listen

Text Audio
"Did the timing change?" she answered. "Then why did Logan leave?"
Who puts a parking meter next to an ER label?
Please say neighborhood, statistics, and anesthesiologist clearly, without rushing through the middle syllables.
I said 91, not 306, which is a very different number.
The inference path looked natural, but the decoder still needed a smoother transition before Marcus approved the final test.
The appointment moved to 1:25, the invoice was $674.96, and the archive was labeled 1998.
If Logan sounded uneasy, then it happened near Long Beach, and the pause has to carry that.
The word aluminum should not steal attention from the softer ending after entrepreneur.

Install

git clone https://huggingface.co/owensong/Inflect-Nano-v1
cd Inflect-Nano-v1
pip install -r requirements.txt

Generate Speech

python inference.py --text "Wait, are you actually being for real now?" --out sample.wav

CPU:

python inference.py --device cpu --text "Please say neighborhood clearly." --out sample_cpu.wav

With simple controls:

python inference.py \
  --text "The appointment moved to 1:25." \
  --length-scale 1.03 \
  --pitch-scale 1.00 \
  --energy-scale 1.00 \
  --out sample_controlled.wav

Local Gradio demo:

python app.py

Model Size

Part Parameters
Acoustic model 3.465M
Vocoder generator 1.167M
Total inference stack 4.632M

The model files are:

weights/inflect_nano_v1_acoustic.pt
weights/inflect_nano_v1_vocoder.pt

Repo Layout

weights/                         model weights
examples/                        audio examples
assets/                          README banner
inflect_nano/                    runtime model code
third_party/tiny_tts_frontend/   vendored text frontend used for English G2P/token IDs
inference.py                     simple CLI inference
app.py                           local Gradio demo

The model itself is in weights/. The vendored frontend is included only so the released model can reproduce the same text normalization and tokenization path.

What Makes It Different

Many small TTS projects depend on a separate larger vocoder. Inflect-Nano-v1 includes the vocoder in the published inference stack, so the full text-to-waveform path stays under 5M parameters.

Pipeline:

text
-> English text frontend
-> compact FastSpeech-style acoustic model
-> 80-bin mel spectrogram
-> small Snake HiFi-GAN-style vocoder
-> 24 kHz waveform

Architecture

The acoustic model is a compact non-autoregressive FastSpeech-style network. It predicts duration, pitch, energy, and brightness, then decodes an 80-bin mel spectrogram.

The vocoder is a small Snake-activation HiFi-GAN-style generator trained for 24 kHz waveform reconstruction.

Main settings:

Setting Value
Sample rate 24 kHz
Mel bins 80
Acoustic hidden size 168
Encoder layers 5
Decoder layers 6
Vocoder upsample rates 8, 8, 2, 2

Good For

  • Tiny local TTS experiments
  • Offline assistant prototypes
  • Efficient inference research
  • Embedded speech demos
  • Browser/WASM-style exploration
  • A baseline for sub-5M TTS work

Not Good For

  • Production narration
  • Accessibility-critical output
  • Voice cloning
  • Multilingual speech
  • High-fidelity audiobook generation
  • Matching large modern TTS systems

Limitations

This is a very small experimental model. It can sound robotic, buzzy, or unstable, especially on difficult unseen text. Long prompts and unusual phrasing are less reliable. The vocoder is also a clear quality bottleneck.

Use it as a tiny-model research/demo release, not as a production TTS engine.

License

Apache-2.0.

This repository includes a small third-party English text frontend for tokenization/G2P compatibility. Its license is included at third_party/tiny_tts_frontend/LICENSE.

Author
OS
Owen Song
User
owensong
Details
Downloads0
Likes68
AccessOpen Source
Tasktext-to-speech
Trending68
Licenseapache-2.0
Librarypytorch
CreatedJun 16, 2026
UpdatedJun 18, 2026
View on Hugging Face
Languages
en
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

Inflect-Nano-v1 — AI Model Details | Applied