Mycroft: an open-source voice assistant

By John Coggeshall
July 24, 2020

Mycroft is a free and open-source software project aimed at providing voice-assistant technology, licensed under the Apache 2.0 license. It is an interesting alternative to closed-source commercial offerings such as Amazon Alexa, Google Home, or Apple Siri. Use of voice assistants has become common among consumers, but the privacy concerns surrounding them are far-reaching. There have been multiple instances of law enforcement's interest in the data these devices produce for use against their owners. Mycroft claims to offer a privacy-respecting, open-source alternative, giving users a choice on how much of their personal data is shared and with whom.

The Mycroft project is backed by the Mycroft AI company. The company was originally funded by a successful one-million-dollar crowdfunding campaign involving over 1,500 supporters. In recent years, it has developed two consumer-focused "smart speaker" devices: the Mark 1 and Mark 2. Both devices were funded through successful Kickstarter campaigns, with the most recent Mark 2 raising $394,572 against a $50,000 goal.

In the press, the company has indicated its intention is to focus on the enterprise market for its commercial offerings, while keeping the project free to individual users and developers. On the subject of developers, contributors are expected to sign a contributor license agreement (CLA) to participate in the project. The actual CLA was unavailable at the time of publication, but the project claims it grants the project a license to the contributed code, while retaining ownership of the contribution to the developer.

Voice-assistant technology is complicated, with many different components that must come together to form a usable product. There is far too much to cover in a single article, making this the first of a series on the project, providing a high-level understanding of the project components.

Mycroft is broken down into multiple modules: core, wake word, speech to text, intent parsing, and text to speech. Except core, the functionality provided by each module supports a variety of implementations, with the Mycroft project providing its own implementation(s) for each. Estimating the number of contributors to the entire project as a whole is a challenge, as each sub-project attracts its own set of contributors. Looking at core, GitHub reports that project has had 128 releases from its 139 contributors — the latest release occurring at the end of May 2020.

The modular architecture of the project allows for maximum flexibility to the end user, who can choose to change the text-to-speech provider (for example) if the default doesn't meet their needs. In a way similar to commercial products like Alexa, Mycroft core exposes an API for the development of "skills". These skills are plugins that can integrate with third-party services to do everything from playing music to turning on light bulbs. Anyone can write a skill and release it for others to use, with a reasonable collection of skills provided in the Mycroft Marketplace.

To best understand how Mycroft works, it may be easiest to simply walk through the various actions of a single command: "Hey Mycroft, tell me about the weather"

At the heart is the Mycroft core, which provides the fundamental mechanisms that power the entire architecture. This includes audio output services, a message bus, and the skills-integration API. It works with three modules to fulfill the request: wake word, speech to text, and the intent parser.

Wake words

The first step in the processing of any command from a voice assistant is the "wake word". This is the word or phrase detected by Mycroft that triggers Mycroft to start recording audio to process as a command. This detection happens on the local device. Mycroft offers two options for wake-word detection: PocketSphinx, and the project's own Precise implementation (the default).

The differences between PocketSphinx and Precise have to do with how the wake word itself is detected. PocketSphinx is based on English speech and takes a speech-to-text (STT) approach to identifying the wake word. Precise, on the other hand, is a neural network implementation that is trained to recognize the sound of the wake word rather than a specific word (making it suitable for more than English). The Mycroft project provides pre-trained Precise models for "Hey Mycroft", "Christopher", "Hey Ezra", and "Hey Jarvis" as wake words. Creating your own wake word is certainly possible, but requires extensive audio samples in order to train the model.

Speech to text

Once the wake word has activated Mycroft, the microphone records the audio that follows until it detects that the user has stopped talking. This audio is then processed by the speech-to-text module, transcribing it into text to be processed. This aspect of voice assistants is by far the most contentious from a privacy perspective, as these audio files can (and often do) include private data. Since this audio data is typically sent to a third-party server to be processed, it makes it (and the corresponding transcription) ripe for abuse, compromise, or otherwise used in a way undesired by the user.

Unfortunately, currently Mycroft does not provide an ideal option in this regard. In fact, Mycroft itself does not provide a speech-to-text solution at all. Instead, by default, Mycroft uses Google's speech-to-text services proxied through Mycroft servers. Per the documentation on this module, Mycroft acts as a proxy for Google in part to provide a layer of privacy to end users:

In order to provide an additional layer of privacy for our users, we proxy all STT requests through Mycroft's servers. This prevents Google's service from profiling Mycroft users or connecting voice recordings to their identities. Only the voice recording is sent to Google, no other identifying information is included in the request. Therefore Google's STT service does not know if an individual person is making thousands of requests, or if thousands of people are making a small number of requests each.

While anonymizing requests to Google to avoid providing the user's IP address is a step in the right direction, it is far from a perfect solution — the audio of a user's voice alone could be enough to make an identification. According to the project, there simply is not a reliable open STT alternative that is on par with closed-source alternatives — hopefully that won't stay that way for long. As it turns out, privacy is not the only reason Mycroft proxies these requests — Mycroft is working with Mozilla's DeepSpeech project to build an open-source alternative to Google's STT service. Users who opt-in can contribute their audio to help train the DeepSpeech AI model. To the project's credit, this is clearly spelled out for the user during device registration:

The data we collect from those that opt in to our open dataset is not sold to anyone. [...] Mycroft's voices and services can only improve with your help. By joining our open dataset, you agree to allow Mycroft AI to collect data related to your interactions with devices running Mycroft's voice assistant software. We pledge to use this contribution in a responsible way. [...] Your data will also be made available to other researchers in the voice AI space with values that align with our own, like Mozilla Common Voice. As part of their agreement with Mycroft AI to access this data, they will be required to honor your request to remove any trace of your contributions if you decide to opt out.

If a user agrees, their audio data is added to Mycroft's open dataset and is sent to the DeepSpeech project, in addition to sending it to the selected STT service. According to Mycroft AI founder Joshua Montgomery in a Reddit AMA: "[...] we only use data from people who explicitly opt-in, this represents about 15% of our customers." Since Mycroft is being used by both individual users (for free) and paying enterprise customers, it is unclear if that 15% opt-in figure cited by Montgomery refers only to actual customers or the entire user base.

It is noteworthy that Montgomery was explicitly asked in the same AMA: "[...] do you have a 'warrant canary' for any 'National Security letters/warrants' you may have received instructing you to turn over private user data to a 3-letter agency under the PATRIOT Act?", Montgomery declined to respond. Since Mycroft AI is a US-based company this may be of importance to privacy-concerned users.

While the project doesn't feel that DeepSpeech is ready, using DeepSpeech is supported. DeepSpeech isn't the only alternative to Google's STT available, either. There is support for many options, starting with Kaldi, which is open source, along with a handful of proprietary providers: GoVivace, Houndify, IBM Cloud, Microsoft Azure, and Wit.ai.

Intent parser

Once audio has been transcribed into text, it must then be processed for intent. That is to say, Mycroft now has to process the command you gave it and figure out what you want it to do. This can be complicated in its own right, as even with a perfect transcription humans have many different ways of expressing an intent: "Hey Mycroft, tell me about the weather", "Hey Mycroft, what's the weather?", "Hey Mycroft, what is it like outside?"

To address intents, Mycroft has two separate projects. The default, Adapt, provides a more programmatic approach to intent parsing, while Padatious is a neural-network-based parser. The intention of the Mycroft project is to eventually replace Adapt with Padatious, which is reported to be under active development. It is noteworthy, however, that the project has had only six commits this year.

While both Adapt and Padatious accomplish the same goal, they do so in very different ways. Adapt (implemented in Python) is designed to run on embedded systems with limited resources, meaning it can run directly on the local device. It is a keyword-based intent parser, so its capabilities are somewhat limited — though it can be useful to implement Mycroft skills that have a small number of commands (in Mycroft terms, "Utterances"). Padatious, on the other hand, takes an AI-based approach to intent parsing where the entire sentence is examined to ascertain the intent. In this approach, a neural network is trained on whole phrases and the generated model is used to parse the intent. This involves more up-front work training the model for a skill, but, according to the documentation, the training of Padatious models "require a relatively small amount of data." Like Adapt, Padatious is implemented in Python.

Once an intent is parsed, regardless of the technology used, it is then mapped (hopefully) to a skill that knows how to execute the intended action — such as fetching and responding with the latest weather for the region.

Text to speech

The text-to-speech module provided by Mycroft is the final key swappable implementation of the Mycroft project. This module, as its name implies, takes text and converts it into audio to be played. Generally this would be in response to a command generated by the intent parser but, unlike other commercial offerings (e.g. Amazon Alexa), it can also be used to "push" audio to the voice assistant without it being strictly in response to a wake-word command. This can be useful when integrating Mycroft with a project like Home Assistant, as a way to provide audio notifications.

Mycroft offers multiple options for TTS. Specifically, Mycroft has the Mimic and Mimic2 projects. The default Mimic is a TTS engine based on Flite that runs directly on the device to synthesize a voice from text by concatenating sounds together to form words. This produces a less-than-natural voice, but in testing it wasn't bad. Mimic2, on the other hand, is a cloud-based TTS engine based on Tacotron that uses a neural network to produce a much higher-quality, synthesized voice. Mycroft does support using Google's TTS offering, though it is not enabled by default. Using Google's TTS service results in a voice synthesis very similar to that found in Google Home devices.

More to come

Mycroft's expansive portfolio of projects is impressive, but perhaps even more impressive was how easy it was to get started. The company has developed two consumer-focused devices (one on pre-order) that are built on the project, but also provides the ability to build your own Raspberry Pi or Linux-based equivalent. In our next installment, we will take a closer look into the project in action to see how well it stacks up to its proprietary competition.

Mycroft: an open-source voice assistant

Posted Jul 25, 2020 11:45 UTC (Sat) by jonquark (guest, #45554) [Link] (1 responses)

As someone who has taken part in MyCroft's CrowdFunding rounds by preordering[1] a Mark II device - this article could be clearer.

Their second consumer device has been very significantly delayed and whilst early prototypes exist - those of us who took part in the crowd funding for the Mark II won't get a device in 2020 and (from memory) I think the original estimated delivery for the device was in 2018. So the grammatical tense in "In recent years, it has developed two consumer-focused "smart speaker" devices" doesn't read quite correctly to me - it is developing the second device.

Despite the delays - I'm really hopefully that MyCroft can develop into a sustainable business that can spear-head the development of cutting edge open source home assistant.

[1] (or whatever the word is in a crowdfunding campaign with the level of risk involved)

Mycroft: an open-source voice assistant

Posted Jul 27, 2020 5:11 UTC (Mon) by da4089 (subscriber, #1195) [Link]

The company has also replaced its CEO and significantly changed its focus: from being essentially an "open Alexa device" to the current focus on enterprise software sales.

The Mark II device has evolved from being a custom-built platform, building on their experience with the Mark I, to the current plans where it is essentially a Raspberry Pi case with a custom microphone array board.

I'm increasingly dubious that they'll deliver anything approaching what was originally claimed, let alone a device that's competitive with commercial products that have enjoyed several additional years of development while the Mycroft hardware has stagnated.

Mycroft: an open-source voice assistant

Posted Jul 25, 2020 14:42 UTC (Sat) by FloatingBoater (subscriber, #67237) [Link]

to Me, the key difference between the Mycroft platform and other voice assistants is the open-source nature of the whole outfit.

To connect a Mycroft Mark One (a Raspberry Pi with an Arduino I/O board for a face) to openHAB home automation takes a MQTT server on the local LAN, and not a cloud server thousands of miles away.

To fix an issue with a skill (such as when the core software added support for account-based config), I could SSH into the RPi and edit the Python code directly with 'vim __init__.py'.

The original Mark One was designed for hardware hacking with multiple I/O pins broken out on the back of the unit.

Is the result as polished as a Google Nest Max or Alexa? Nope.
Which platform do I get the most use out of? Mycroft (outside of watching YouTube content on a kitchen counter device).

The still under development Mark Two seems to have had a few stumbles apparently with sub-contractors burning time and money on custom embedded uC and LCD screen drivers which didn't pan out. The latest iteration is back to a RPi with a series of custom PCBs.

Surprisingly, as well as open-sourcing the software stack (build your own PiCroft), Joshua Montgomery has started publishing engineering team meetings on YouTube. The commitment to openness is stellar, as the three sessions I've watched are warts and all. I've been engineering software for 30 years, so watching how some one else manages complex issues with scarce resources is fascinating - even if occasionally you'd love to come off mute and interject! :-)

Mycroft: an open-source voice assistant

Posted Jul 25, 2020 20:19 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

I've been following Mycroft development for a few years and I have a Mycroft device (and pre-ordered the second one). It's definitely not polished at all, but it works fine for simple commands like "open/close window blinds".

One area where I'm looking for improvements is multi-language support. Right now it's in shambles :(

Mycroft: an open-source voice assistant

Posted Jul 26, 2020 19:01 UTC (Sun) by willy (subscriber, #9762) [Link]

Kathy Reid had a great talk on Mycroft at LCA 2019 called "Open Source AI and speech recognition". You can find it on YouTube and probably a few other places.

speech-to-text privacy

Posted Jul 26, 2020 19:18 UTC (Sun) by sumanah (guest, #59891) [Link]

Thanks so much for this article. I keep meaning to look into automated virtual assistants like Mycroft and Almond (there was a GUADEC talk about Almond this year) and this article helps me understand the privacy tradeoffs involved with Mycroft's speech-to-text functionality. When I think about the privacy concerns involved in using a microphone-capable virtual assistant, I ask: do I trust that, by default, the device is only listening for the wakeword? who gets to hear the raw audio of my requests/instructions, and read/analyze the parsed text of them and what data/response is elicited? So I appreciate the details in this article.

Mycroft: an open-source voice assistant

Posted Jul 27, 2020 3:27 UTC (Mon) by k8to (guest, #15413) [Link] (7 responses)

I don't really understand why voice recognition can't be done locally. Is it really that hard?

Mycroft: an open-source voice assistant

Posted Jul 27, 2020 4:03 UTC (Mon) by gdt (subscriber, #6284) [Link]

Voice recognition via AI techniques has a number of "embarrassingly parallel" steps. So if you can offload the problem to a large number of servers then responses are timely. As the parameters of the STT problem are better understood then it becomes economic for GPU designers (who are increasingly "parallel processor" designers) to offer tuned hardware for local processing. At the moment research groups are getting good results from high-end NVIDIA GPUs, which have instruction sets for highly-parallel tensor processing.

Mycroft: an open-source voice assistant

Posted Jul 27, 2020 5:50 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Yes, it's very hard. For anything realistic and near real-time you need a powerful GPU and probably several of them. And then you need to do command extraction.

Mycroft: an open-source voice assistant

Posted Jul 29, 2020 11:39 UTC (Wed) by nilsmeyer (guest, #122604) [Link] (2 responses)

How is this economical? Looking at the price of commercial devices (some of which are even free) it does not seem very profitable considering the (stated) compute requirements.

Mycroft: an open-source voice assistant

Posted Jul 29, 2020 12:45 UTC (Wed) by excors (subscriber, #95769) [Link]

With cloud processing, the cost of the hardware is amortised over a large number of users. Maybe it requires multiple GPUs to process an utterance and compute a response before the user's patience runs out; but the average user is only doing that a few times per day, so many thousands of users can share that set of GPUs, which is a big advantage over local processing. And those GPUs can remain permanently powered on, with gigabytes of data in RAM for instant access, which is infeasible in a local device that is mostly idle and needs low idle power consumption.

Also, why assume it's economical? Amazon, Google and Apple can afford to spend literally billions of dollars to gain market share, based on a prediction that the market may become highly profitable a decade later (after further hardware/software advancements to reduce costs, and new business models to increase revenues). Better to risk wasting a billion becoming the leader of a new market that ultimately fails, than to risk missing out on a new market that ends up being wildly successful.

And the voice assistants never need to be profitable by themselves, they just need to not lose too much money while driving users towards those companies' offerings in other markets. Amazon will sell you a smart speaker for $0.99 with a subscription to their $10/month music subscription service, and it also integrates with their e-commerce site ("Alexa, buy a television"), and other companies can easily implement Alexa skills on AWS servers, etc. Google does similar with YouTube Music, Apple wants to encourage you to buy iPhones and apps, etc. Voice assistants might never make economical sense as a standalone service, because the technology is so complicated that no user would be willing to pay what it really costs; they might only make sense as an interface for other high-profit-margin services.

Mycroft: an open-source voice assistant

Posted Jul 29, 2020 16:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

You might use your voice assistant maybe 10 times a day, resulting in maybe a minute or so of total CPU/GPU time. So a single Alexa server can probably serve at least a thousand or so of users.

And if its costs is $10000 then it works out to mere $10 per user, which can be easily built into the cost of the hardware or recouped by selling out your data to a highest ad bidder.

Mycroft: an open-source voice assistant

Posted Jul 30, 2020 14:15 UTC (Thu) by robbe (guest, #16131) [Link] (1 responses)

Mozilla DeepSpeech claims to run faster than real time on a RPi 4.
https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas...

If I understand correctly, this does *not* use the Raspi’s GPU core.

Mycroft: an open-source voice assistant

Posted Aug 6, 2020 5:31 UTC (Thu) by donbarry (guest, #10485) [Link]

Not just faster than real time, but that on one core, leaving the other three available for making sense of the data.

I haven't played with it yet but am looking forward to in the coming year. Mycroft would probably find more interested developers if it ditches the cloud, as it could really distinguish itself as being a privacy-respecting local provider from start to finish.

Mycroft: an open-source voice assistant

Posted Jul 30, 2020 13:25 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

"The actual CLA was unavailable at the time of publication."

One of the best ways to discourage new contributors to any open source project is to have a high-friction process for making contributions, but this really goes far beyond 'high friction'.