Show HN: Three new Kitten TTS models – smallest less than 25MB
Kitten TTS (https://github.com/KittenML/KittenTTS) is an open-source series of tiny and expressive text-to-speech models for on-device applications. We had a thread last year here: https://news.ycombinator.com/item?id=44807868.Today we're releasing three new models with 80M, 40M and 14M parameters.The largest model (80M) has the highest quality. The 14M variant reaches new SOTA in expressivity among similar sized models, despite being <25MB in size. This release is a major upgrade from the previous one and supports English text-to-speech applications in eight voices: four male and four female.Here's a short demo: https://www.youtube.com/watch?v=ge3u5qblqZA.Most models are quantized to int8 + fp16, and they use ONNX for runtime. Our models are designed to run anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required! This release aims to bridge the gap between on-device and cloud models for tts applications. Multi-lingual model release is coming soon.On-device AI is bottlenecked by one thing: a lack of tiny models that actually perform. Our goal is to open-source more models to run production-ready voice agents and apps entirely on-device.We would love your feedback!
173 points by rohan_joshi - 57 comments
I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.
Regarding running on the 3080 gpu, can you share more details on github issues, discord or email? it should be blazing fast on that. i'll add an example to run the model on gpu too.
I couldn't locate how to run it on a GPU anywhere in the repo.
Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.
Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?
Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.
Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].
the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?
A stretch goal is 'arbitrary tags' from [singing] [sung to the tune of {x}] [pausing for emphasis] [slowly decreasing speed for emphasis] [emphasizing the object of this sentence] [clapping] [car crash in the distance] [laser's pew pew].
But yeah: instruction/control via [tags] is the deciding feature for me, provided prompt adherence is strong enough.
Also: a thought...
Everyone is using [] for different kinds of tags in this space: which is very simple. Maybe it makes sense to differentiate kinds of tags? I.E. [tags for modifying how text is spoken] vs {tags for creating sounds not specifically speech: not modifying anything... but instead it's own 'sound/word'}
The new 15M is way better than the previous 80M model(v0.1). So we're able to predictably improve the quality which is very encouraging.
(That's using the example as-is. If you switch it to the smaller model, modify the above with +57 MiB of models from HuggingFace, or =727 MiB.)
If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.
Also:
https://github.com/sparkaudio/spark-tts
I want to be my own personal assistant...
EDIT: I can provide it a RTX 3080ti.
Qwen 3 TTS is good for voice cloning but requires GPU of some sort.
The iOS version is Swift-based.
Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.
Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.
This is a mind numbing task that requires workers to make hundreds of calls each day with only minor variations, sometimes navigating phone trees, half the time leaving almost the exact same message.
Anyway, I believe almost all such businesses will be automated within months. Human labour just cannot compete on cost.
Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.