How many parameters does Inflect-Nano-v1 have?

Parameter count for Inflect-Nano-v1 is not available. See the Hugging Face model page for full specifications.

Who created Inflect-Nano-v1?

Inflect-Nano-v1 was published by Owen Song on Hugging Face.

Inflect-Nano-v1

Name: Inflect-Nano-v1
Author: Owen Song

Audioby Owen Song·Model page ↗

owensong's ultra-small experimental English TTS model designed for efficient local speech synthesis.

Inflect-Nano-v1

Edit 06/19/2026 -- #1 trending on TTS leaderboard... seeing is incredibly rewarding, as a solo developer with little budget and low expectations for this model. I have decided, I will do a Inflect-Nano-v2! It will be larger, though I am planning on releasing 2 model variants. One at 10m parameters and one at 4m parameters. V2 will be noticably better in all aspects, and the budget for v2 is much larger than v1 so I am excited to see how it turns out! V2 will also be much easier to finetune for things like other languages!

Inflect-Nano-v1 is a tiny English text-to-speech model with 4.63M total inference parameters, including its vocoder.

It is not trying to beat large TTS models. It is a small, local, complete text-to-waveform stack built to test how far ultra-lightweight speech synthesis can go.

Highlights

4.63M parameters total
Includes the vocoder
24 kHz audio
Single English male voice
Runs locally with PyTorch
Built for tiny-model experiments, local assistants, embedded demos, and efficient inference research

Listen

Text	Audio
"Did the timing change?" she answered. "Then why did Logan leave?"
Who puts a parking meter next to an ER label?
Please say neighborhood, statistics, and anesthesiologist clearly, without rushing through the middle syllables.
I said 91, not 306, which is a very different number.
The inference path looked natural, but the decoder still needed a smoother transition before Marcus approved the final test.
The appointment moved to 1:25, the invoice was $674.96, and the archive was labeled 1998.
If Logan sounded uneasy, then it happened near Long Beach, and the pause has to carry that.
The word aluminum should not steal attention from the softer ending after entrepreneur.

Install

git clone https://huggingface.co/owensong/Inflect-Nano-v1
cd Inflect-Nano-v1
pip install -r requirements.txt

Generate Speech

python inference.py --text "Wait, are you actually being for real now?" --out sample.wav

CPU:

python inference.py --device cpu --text "Please say neighborhood clearly." --out sample_cpu.wav

With simple controls:

python inference.py \
  --text "The appointment moved to 1:25." \
  --length-scale 1.03 \
  --pitch-scale 1.00 \
  --energy-scale 1.00 \
  --out sample_controlled.wav

Local Gradio demo:

python app.py

Model Size

Part	Parameters
Acoustic model	3.465M
Vocoder generator	1.167M
Total inference stack	4.632M

The model files are:

weights/inflect_nano_v1_acoustic.pt
weights/inflect_nano_v1_vocoder.pt

Repo Layout

weights/                         model weights
examples/                        audio examples
assets/                          README banner
inflect_nano/                    runtime model code
third_party/tiny_tts_frontend/   vendored text frontend used for English G2P/token IDs
inference.py                     simple CLI inference
app.py                           local Gradio demo

The model itself is in weights/. The vendored frontend is included only so the released model can reproduce the same text normalization and tokenization path.

What Makes It Different

Many small TTS projects depend on a separate larger vocoder. Inflect-Nano-v1 includes the vocoder in the published inference stack, so the full text-to-waveform path stays under 5M parameters.

Pipeline:

text
-> English text frontend
-> compact FastSpeech-style acoustic model
-> 80-bin mel spectrogram
-> small Snake HiFi-GAN-style vocoder
-> 24 kHz waveform

Architecture

The acoustic model is a compact non-autoregressive FastSpeech-style network. It predicts duration, pitch, energy, and brightness, then decodes an 80-bin mel spectrogram.

The vocoder is a small Snake-activation HiFi-GAN-style generator trained for 24 kHz waveform reconstruction.

Main settings:

Setting	Value
Sample rate	24 kHz
Mel bins	80
Acoustic hidden size	168
Encoder layers	5
Decoder layers	6
Vocoder upsample rates	8, 8, 2, 2

Good For

Tiny local TTS experiments
Offline assistant prototypes
Efficient inference research
Embedded speech demos
Browser/WASM-style exploration
A baseline for sub-5M TTS work

Not Good For

Production narration
Accessibility-critical output
Voice cloning
Multilingual speech
High-fidelity audiobook generation
Matching large modern TTS systems

Limitations

This is a very small experimental model. It can sound robotic, buzzy, or unstable, especially on difficult unseen text. Long prompts and unusual phrasing are less reliable. The vocoder is also a clear quality bottleneck.

Use it as a tiny-model research/demo release, not as a production TTS engine.

License

Apache-2.0.

This repository includes a small third-party English text frontend for tokenization/G2P compatibility. Its license is included at third_party/tiny_tts_frontend/LICENSE.

Author

Owen Song

User

owensong

Details

Downloads0

Likes207

AccessOpen Source

Tasktext-to-speech

Trending55

Licenseapache-2.0

Librarypytorch

CreatedJun 16, 2026

UpdatedJun 24, 2026

View on Hugging Face

Languages

Get the full context.

Author

Owen Song

User

owensong

Details

Downloads0

Likes207

AccessOpen Source

Tasktext-to-speech

Trending55

Licenseapache-2.0

Librarypytorch

CreatedJun 16, 2026

UpdatedJun 24, 2026

View on Hugging Face

Languages

Get the full context.

Model Description

Inflect-Nano-v1

Highlights

Listen

Install

Generate Speech

Model Size

Repo Layout

What Makes It Different

Architecture

Good For

Not Good For

Limitations

License