Hopes and promises for open-source voice assistants

March 21, 2023

This article was contributed by Koen Vervloesem

At the end of 2022, Paulus Schoutsen declared 2023 "the year of voice" for Home Assistant, the popular open-source home-automation project that he founded nine years ago. The project's goal this year is to let users control their home with voice commands in their own language, using offline processing instead of sending data to the cloud. Offline voice control has been the holy grail of open-source home-automation systems for years. Several projects have tried and failed. But with Rhasspy's developer Mike Hansen spearheading Home Assistant's voice efforts, this time things could be different.

Science fiction shows and movies have sold us on the idea of spaceships and homes we can talk to. In recent years, voice control at home has become possible thanks to the so-called "smart speakers" from Google, Amazon, and Apple. However, there's nothing smart about these devices: their intelligence is almost completely in the cloud, where the user's voice recordings are processed and translated into sentences and meaning.

This is a complex and computationally intensive task, and these companies make us believe that their services are required to be able to use voice control. Of course this comes with downsides: users don't have any control over what's happening with their voice recordings, which is a big privacy risk. But, fundamentally, the problem lies even deeper. It just makes no sense for users to have their voices make a long detour through the internet just to turn on a light in the same room.

The challenges of offline voice control

Luckily, there have been some projects working on offline voice control for years now, some of them partially or even fully open-source. Of course, a voice assistant running on a home server will not have the same performance as those general-purpose smart speakers making use of servers in the cloud. However, it's possible to have a reasonably working voice-control system, even on a Raspberry Pi, if its purpose is limited to some specific domain, for example opening and closing blinds, turning on and off lights, asking what time it is, or whether the door is closed.

A voice-control software stack consists of a lot of parts. It all starts with wake-word detection: the voice assistant listens to an audio stream from a microphone and activates when it recognizes a wake word or phrase, such as "Hey Rhasspy". After activation, the microphone records audio until it detects that the user has stopped talking.

After that, a speech-to-text module transcribes what the user said into text, such as "What's the temperature outside?". This text is processed by an intent parser, which figures out what the user means: the intent. The result is then processed by an intent handler, which reads the temperature from a sensor at home or gets it using a web API, then returns text like "It's 20 degrees outside". A text-to-speech module then converts this text into audio using a synthesized voice, which is played on the speaker to reply to the user's request.

It's a challenge to bring all of these parts together to create a working and user-friendly voice-control system. It is even more difficult if the system needs to be completely open source and work offline. This article takes a look at some projects that were promising in the past, but failed. It also examines the currently most promising one: Rhasspy as part of Home Assistant.

Broken promises

A few years ago, Snips was a promising startup in the voice domain. The French company was founded in 2013 by three applied mathematicians with a mission to put artificial intelligence (AI) into every device, while respecting privacy. In 2016 they decided to focus on creating a voice assistant that processes the user's voice recordings offline, thereby offering privacy by design. The Snips voice assistant was able to run on a Raspberry Pi; much of the software was open source, under the permissive Apache 2.0 license.

Snips managed to attract a small but dedicated community of people that created Snips skills, which are intent-handler scripts (often written in Python) that react to intents recognized by Snips. I was one of those people, believing that Snips would help us reach the vision of an open-source offline voice assistant. There was also a Home Assistant integration to trigger actions based on the user's voice commands.

Snips had (and still has) a FAQ on its web site that explained that the proprietary part of their software would become open source "soon", and that's why I decided to become active in the community. However, that FAQ item kept saying "soon" for a long time. In early 2019, I interviewed the Snips CTO for a magazine and asked about the company's plans to open the rest of their source code. His answers were vague and didn't give me much confidence that it would actually happen. That's when I decided to throw in the towel and leave the Snips community. My feeling turned out to be correct: at the end of 2019, Snips was acquired by Sonos. The company's web-based Snips Console that was needed to train the voice assistant was shut down a few months later.

Death by patents

Another promising project was Mycroft. The company started around the same time as the Snips voice assistant. The developers worked on free software for most parts of the voice stack and released their software under the GPL license in May 2016 (and relicensed under Apache 2.0 in October 2017). The company created its own smart-speaker hardware, the Mark 1 and later the Mark II. Both devices were funded through successful Kickstarter campaigns, although the delivery of the Mark II was severely delayed after problems with the company's hardware partner.

While Snips had the problem that essential parts of its voice stack weren't open source, Mycroft had another problem: by default audio was sent to Google's speech-to-text services for speech recognition. Mycroft acted as a proxy, though, so Google only saw requests coming from Mycroft's servers, not from individual users—though, of course, the Mycroft servers still saw the original requests. Enterprising users could always swap the cloud-based speech-to-text service for an open source offline solution such as Mozilla DeepSpeech or Kaldi.

Mycroft started to have some real problems in 2020 when Voice Tech Corporation filed a patent-infringement lawsuit. Eventually, all of Voice Tech's claims were invalidated, but earlier this year Mycroft CEO Michael Lewis published "The End of the Campaign" on Kickstarter with some bad news: the company didn't have the funds to continue meaningful operations. The future looked bleak:

Since starting here in early 2020 I’ve had to make some of the toughest decisions I’ve ever faced, and none more so than at the end of last year. At the end of November, just after the Mark II entered production, I was faced with the reality that I had to lay off most of the Mycroft staff. At present, our staff is two developers, one customer service agent and one attorney. Moreover, without immediate new investment, we will have to cease development by the end of the month.
[...]So what went wrong? The single most expensive item that I could not predict was our ongoing litigation against the non-practicing patent entity that has never stopped trying to destroy us. If we had that million dollars we would be in a very different state right now.

Lewis also posted to the company blog at the end of January, with much of the same text as the Kickstarter article, but ending with: "There is much more to be said and many other topics that I will cover in future posts over the coming days." However, there has been no update to the blog at the time of this writing.

So is Mycroft dead now? It may live on in OpenVoiceOS, which started as a Linux-based operating system to run Mycroft. Over time, it forked Mycroft's core to add functionality that wasn't accepted upstream. A few weeks ago, the developers announced a plan to become a non-profit association under Dutch law and started a GoFundMe campaign to raise money to support their initiative. The OpenVoiceOS developers have been active in the Mycroft community for years; they have kept their fork compatible with Mycroft's skills, so a lot of users will probably move to OpenVoiceOS.

Rhasspy

Another project, and the one that I started contributing to after leaving the Snips community, is the MIT-licensed Rhasspy. It's an open-source, modular set of voice-assistant services that can function completely disconnected from the internet and it works well with home-automation software. Moreover, it's not limited to English, but supports many human languages.

Rhasspy went through a couple of architectural rewrites, and has benefited a lot from the shutdown of Snips. Around 2019-2020, there was an influx of people from the Snips community searching for an alternative voice assistant. Mike Hansen, Rhasspy's developer, saw an opportunity and broke the monolithic Rhasspy Python application into multiple services. The result was Rhasspy 2.5, with services communicating over Message Queuing Telemetry Transport (MQTT) using an extended version of the Hermes protocol from Snips. This modular approach allowed plugging in different implementations for wake-word detection, speech-to-text, intent recognition, and text-to-speech.

The use of the Hermes protocol allowed contributors to write Rhasspy skills in any programming language that can speak MQTT and JSON. For example, I wrote a helper library to create voice apps for Rhasspy in Python. Hansen also created a Rhasspy add-on for Home Assistant. Another popular project in the community is ESP32-Rhasspy-Satellite, which lets users run a Rhasspy "satellite" on an ESP32 microcontroller board for audio input and output that is then streamed over MQTT to and from the Raspberry Pi or other computer running Rhasspy's core services.

However, when Hansen joined Mycroft in November 2021, it looked like another open-source voice-assistant project might die a slow death. Mycroft hired Hansen to help get its Mark II smart speaker over the finish line; he explained on the Rhasspy forum that he would change Mycroft's dependence on a cloud server. He also said that he wouldn't abandon Rhasspy, and that he would like to merge both communities at some point.

Understandably, Rhasspy's development slowed down after this. But a year later, Hansen made another surprise announcement: he would join Nabu Casa, the company that is developing Home Assistant, as Voice Engineering Lead to work on Rhasspy full time.

The year of voice

Rhasspy's revival as part of Home Assistant means that it's getting another architectural rewrite. Hansen is now working on Rhasspy 3, which he calls "a very early developer preview". The main goals are still the same: it works completely offline, has broad language support, and is completely customizable. There's a tutorial on how to set up Rhasspy 3, but many of the manual steps will be replaced with something more user-friendly in the future.

Instead of using MQTT, Rhasspy 3 now has all its services communicating using the new Wyoming protocol; the services communicate using standard input and output, which lowers the barrier for programs to talk to Rhasspy. Essentially, the Wyoming protocol is JSON Lines (JSONL) with an optional binary payload.

So, if Rhasspy needs to send chunks of audio data to a speech-to-text program, it sends a single line of JSON with a type field telling the receiving program it's sending an audio chunk, an optional data field with parameters like the sample rate and the number of channels, as well as a payload length. After this, it sends the audio chunk of the given length. Of course, existing programs don't use this Wyoming protocol, but small Python programs can be written as adapters. For example, a speech-to-text program that accepts raw audio on stdin and sends the recognized text to stdout should be used with the asr_adapter_raw2text.py adapter.

Voice commands at home

Rhasspy 3 is focused on Home Assistant's use case: voice commands to control devices. For intent handling, Rhasspy 3 uses the Assist feature introduced in Home Assistant 2023.2. Intent recognition is powered by HassIL. This matches the user's input against sentence templates. For example, a template such as:

    (turn | switch) on [the] {area} lights

matches inputs like "turn on kitchen lights" and "switch on the kitchen lights".

Home Assistant supports a small list of built-in intents. For example, the HassTurnOn intent turns on a device. If the previous template to turn on the kitchen lights is defined as a template for this HassTurnOn intent, the intent Handler Assist will turn on the kitchen lights if the user's Home Assistant configuration has lights defined and assigned to the kitchen area.

The Home Assistant Intents project has the goal to create sentence templates for all possible home-automation intents in every possible language, released under the CC-BY-4.0 license. At the moment, 156 people have contributed sentences for 52 languages. The project has started with support for six intents to keep the work manageable; the goal is to increase this number slowly.

Conclusion

An offline, open-source voice assistant in one's own language is important for a lot of people. However, the technical challenges to doing so are rather high. Snips and Mycroft were able to attract a community, but failed to build a successful business. Rhasspy was quite successful in the small crowd of people that like to tinker and build their own voice assistant around the flexible services that the project offered, but the core was developed mainly by one person and the project wasn't backed financially. The good news is that Rhasspy will be tightly integrated with Home Assistant, which is one of the most active projects on GitHub, and its development is funded by Nabu Casa. So there's hope that we will finally reach those science fiction dreams. We will be able to control our homes with a user-friendly voice assistant that is both privacy respecting and made from open-source software.

Index entries for this article
GuestArticles	Vervloesem, Koen

Hopes and promises for open-source voice assistants

Posted Mar 21, 2023 18:34 UTC (Tue) by flussence (guest, #85566) [Link] (3 responses)

While reading this I was thinking: computers that respond to voice is a sci-fi idea over half a century old, based on the technology available at the time. Nowadays we have mm-wave radar sensors; is anyone working on a way to do this with hand gestures (and maybe sign language recognition) instead of voice commands? The megacorps sure aren't trying… I guess dragnet-collected training data that doesn't easily translate back to generative output is too hard to monetise.

Hand gestures

Posted Mar 21, 2023 18:45 UTC (Tue) by corbet (editor, #1) [Link] (1 responses)

Google played with that idea in the Pixel line, but it seems to have faded away fairly quickly. I suspect I'm not the only one who isn't looking for more radiation from his devices...

Hand gestures

Posted Mar 21, 2023 19:56 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> I suspect I'm not the only one who isn't looking for more radiation from his devices...

Millimeter-wave radiation is non-ionizing. It's just barely on the microwave side of the terahertz gap (the other side is infrared). While some effects on health are possible, the main concern is heating, which can cause problems for some bodily tissues such as the retina. But a sensor should not be operating at an amplitude where that is likely to become a concern. mm-wave does not propagate (in atmosphere) farther than a few kilometers, and won't go through walls (or most other obstructions) at all, so "spillage" from your neighbors is rather unlikely to make a difference.

In short: It's mostly harmless.

(Disclaimer: I work for Google, but I had nothing to do with Pixel.)

Hopes and promises for open-source voice assistants

Posted Mar 21, 2023 19:50 UTC (Tue) by excors (subscriber, #95769) [Link]

They did try gesture recognition a decade ago - Microsoft had Kinect and wanted to make it a big part of Xbox One, Intel had RealSense, Samsung TVs added gesture recognition, etc, but it all seems to have flopped.

I think there are some basic issues with the technology compared to microphones, e.g. the sensors are quite expensive and take quite a lot of space, and they have a limited field of view. But even if the technology was perfect there are some fundamental limitations with gesture control, e.g. most users aren't going to learn an entire sign language, they're just going to learn a handful of specific gestures, so a product can only expose a tiny number of features (whereas voice can be used intuitively for thousands of different features with no user training).

You can't express a movie title or song name with gestures (or at least not easily; nobody wants to play charades with their TV). Gestures are more physically tiring, and make you feel a bit silly. Voice control can be used while you're doing something else with your hands (like setting an alarm while cooking). People are more concerned about the privacy risks of cameras than microphones. Etc. I think voice control beat gesture control because it's better in almost every way.

Hopes and promises for open-source voice assistants

Posted Mar 21, 2023 18:52 UTC (Tue) by armijn (subscriber, #3653) [Link]

I am not sure how active it still is but SeeedStudio had the "ReSpeaker" open hardware project.

Hopes and promises for open-source voice assistants

Posted Mar 29, 2023 2:13 UTC (Wed) by avaloncalls (guest, #164333) [Link]

This was entirely worth subscribing for. Great information about all the different voice assistants, including one I had never heard of (Snips). I also thought that Rhasspy was based on Mycroft, so the revelation that it was its own project was very welcome with Mycroft ceasing operations and all.

Hopes and promises for open-source voice assistants

Posted Apr 21, 2023 5:25 UTC (Fri) by da4089 (subscriber, #1195) [Link]

Mycroft's delivery of the Mark II was not only delayed, but incomplete: many backers received nothing.