Some progress on the state of speech detection in Platypush (powered by Picovoice)

Many ambitious voice projects have gone bust in the past couple of years, but one seems to be more promising than it was a while ago.

Published by Fabio Manganiello on Apr 07, 2024

I've picked up some development on Picovoice in these days as I'm rewriting some Platypush integrations that haven't been touched in a long time (and Picovoice is among those).

I originally worked with their APIs about 4-5 years ago, when I did some research on STT engines for Platypush.

Back then I kind of overlooked Picovoice. It wasn't very well documented, the APIs were a bit clunky, and their business model was based on a weird "send us an email with your use-case and we'll reach back to you" (definitely not the kind of thing you'd want other users to reuse with their own accounts and keys).

Eventually I did just enough work to get the basics to work, and then both my article 1 and article 2 on voice assistants focused more on other solutions - namely Google Assistant, Alexa, Snowboy, Mozilla DeepSpeech and Mycroft's models.

A couple of years down the line:

Snowboy is dead
Mycroft is dead
Mozilla DeepSpeech isn't officially dead, but it hasn't seen a commit in 3 years
Amazon's AVS APIs have become clunky and it's basically impossible to run any logic outside of Amazon's cloud
The Google Assistant library has been deprecated without a replacement. It still works on Platypush after I hammered it a lot (especially when it comes to its dependencies from 5-6 years ago), but it only works on x86_64 and Raspberry Pi 3/4 (not aarch64).

So I was like "ok, let's give Picovoice another try". And I must say that I'm impressed by what I've seen. The documentation has improved a lot. The APIs are much more polished. They also have a Web console that you can use to train your hotword models and intents logic - no coding involved, similar to what Snowboy used to have. The business model is still a bit weird, but at least now you can sign up from a Web form (and still explain what you want to use Picovoice products for), and you immediately get an access key to start playing on any platform. And the product isn't fully open-source either (only the API bindings are). But at first glance it seems that most of the processing (if not all, with the exception of authentication) happens on-device - and that's a big selling point.

Most of all, the hotword models are really good. After a bit of plumbing with sounddevice, I've managed to implement a real-time hotword detection on Platypush that works really well.

The accuracy is comparable to that of Google Assistant's, while supporting many more hotwords and being completely offline. Latency is very low, and the CPU usage is minimal even on a Raspberry Pi 4.

I also like the modular architecture of the project. You can use single components (Porcupine for hotword detection, Cheetah for speech detection from stream, Leopard for speech transcription, Rhino for intent parsing...) in order to customize your assistant with the features that you want.

I'm now putting together a new Picovoice integration for Platypush that, rather than having separate integrations for hotword detection and STT, wires everything together, enables intent detection and provides TTS rendering too (it depends on what's the current state of the TTS products on Picovoice).

I'll write a new blog article when ready. In the meantime, you can follow the progress on the Picovoice branch.