Mycroft: an open-source voice assistant
Mycroft is a free and open-source software project aimed at providing voice-assistant technology, licensed under the Apache 2.0 license. It is an interesting alternative to closed-source commercial offerings such as Amazon Alexa, Google Home, or Apple Siri. Use of voice assistants has become common among consumers, but the privacy concerns surrounding them are far-reaching. There have been multiple instances of law enforcement's interest in the data these devices produce for use against their owners. Mycroft claims to offer a privacy-respecting, open-source alternative, giving users a choice on how much of their personal data is shared and with whom.
The Mycroft project is backed by the Mycroft AI company. The company was originally funded by a successful one-million-dollar crowdfunding campaign involving over 1,500 supporters. In recent years, it has developed two consumer-focused "smart speaker" devices: the Mark 1 and Mark 2. Both devices were funded through successful Kickstarter campaigns, with the most recent Mark 2 raising $394,572 against a $50,000 goal.
In the press, the company has indicated its intention is to focus on the enterprise market for its commercial offerings, while keeping the project free to individual users and developers. On the subject of developers, contributors are expected to sign a contributor license agreement (CLA) to participate in the project. The actual CLA was unavailable at the time of publication, but the project claims it grants the project a license to the contributed code, while retaining ownership of the contribution to the developer.
Voice-assistant technology is complicated, with many different components that must come together to form a usable product. There is far too much to cover in a single article, making this the first of a series on the project, providing a high-level understanding of the project components.
Mycroft is broken down into multiple modules: core, wake word, speech to text, intent parsing, and text to speech. Except core, the functionality provided by each module supports a variety of implementations, with the Mycroft project providing its own implementation(s) for each. Estimating the number of contributors to the entire project as a whole is a challenge, as each sub-project attracts its own set of contributors. Looking at core, GitHub reports that project has had 128 releases from its 139 contributors — the latest release occurring at the end of May 2020.
The modular architecture of the project allows for maximum flexibility to the end user, who can choose to change the text-to-speech provider (for example) if the default doesn't meet their needs. In a way similar to commercial products like Alexa, Mycroft core exposes an API for the development of "skills". These skills are plugins that can integrate with third-party services to do everything from playing music to turning on light bulbs. Anyone can write a skill and release it for others to use, with a reasonable collection of skills provided in the Mycroft Marketplace.
To best understand how Mycroft works, it may be easiest to simply walk through the various actions of a single command: "Hey Mycroft, tell me about the weather"
At the heart is the Mycroft core, which provides the fundamental mechanisms that power the entire architecture. This includes audio output services, a message bus, and the skills-integration API. It works with three modules to fulfill the request: wake word, speech to text, and the intent parser.
Wake words
The first step in the processing of any command from a voice assistant is the "wake word". This is the word or phrase detected by Mycroft that triggers Mycroft to start recording audio to process as a command. This detection happens on the local device. Mycroft offers two options for wake-word detection: PocketSphinx, and the project's own Precise implementation (the default).
The differences between PocketSphinx and Precise have to do with how the wake word itself is detected. PocketSphinx is based on English speech and takes a speech-to-text (STT) approach to identifying the wake word. Precise, on the other hand, is a neural network implementation that is trained to recognize the sound of the wake word rather than a specific word (making it suitable for more than English). The Mycroft project provides pre-trained Precise models for "Hey Mycroft", "Christopher", "Hey Ezra", and "Hey Jarvis" as wake words. Creating your own wake word is certainly possible, but requires extensive audio samples in order to train the model.
Speech to text
Once the wake word has activated Mycroft, the microphone records the audio that follows until it detects that the user has stopped talking. This audio is then processed by the speech-to-text module, transcribing it into text to be processed. This aspect of voice assistants is by far the most contentious from a privacy perspective, as these audio files can (and often do) include private data. Since this audio data is typically sent to a third-party server to be processed, it makes it (and the corresponding transcription) ripe for abuse, compromise, or otherwise used in a way undesired by the user.
Unfortunately, currently Mycroft does not provide an ideal option in this regard. In fact, Mycroft itself does not provide a speech-to-text solution at all. Instead, by default, Mycroft uses Google's speech-to-text services proxied through Mycroft servers. Per the documentation on this module, Mycroft acts as a proxy for Google in part to provide a layer of privacy to end users:
In order to provide an additional layer of privacy for our users, we proxy all STT requests through Mycroft's servers. This prevents Google's service from profiling Mycroft users or connecting voice recordings to their identities. Only the voice recording is sent to Google, no other identifying information is included in the request. Therefore Google's STT service does not know if an individual person is making thousands of requests, or if thousands of people are making a small number of requests each.
While anonymizing requests to Google to avoid providing the user's IP address is a step in the right direction, it is far from a perfect solution — the audio of a user's voice alone could be enough to make an identification. According to the project, there simply is not a reliable open STT alternative that is on par with closed-source alternatives — hopefully that won't stay that way for long. As it turns out, privacy is not the only reason Mycroft proxies these requests — Mycroft is working with Mozilla's DeepSpeech project to build an open-source alternative to Google's STT service. Users who opt-in can contribute their audio to help train the DeepSpeech AI model. To the project's credit, this is clearly spelled out for the user during device registration:
The data we collect from those that opt in to our open dataset is not sold to anyone. [...] Mycroft's voices and services can only improve with your help. By joining our open dataset, you agree to allow Mycroft AI to collect data related to your interactions with devices running Mycroft's voice assistant software. We pledge to use this contribution in a responsible way. [...] Your data will also be made available to other researchers in the voice AI space with values that align with our own, like Mozilla Common Voice. As part of their agreement with Mycroft AI to access this data, they will be required to honor your request to remove any trace of your contributions if you decide to opt out.
If a user agrees, their audio data is added to Mycroft's open dataset and is
sent to the DeepSpeech project, in addition to sending it to the selected STT
service. According to Mycroft AI founder Joshua Montgomery in a
Reddit AMA: "[...] we only use data from people who explicitly
opt-in, this represents about 15% of our customers.
" Since Mycroft is
being used by both individual users (for free) and paying enterprise
customers, it is unclear if that 15% opt-in figure cited by Montgomery refers
only to actual customers or the entire user base.
It is noteworthy that Montgomery was explicitly
asked in the same AMA: "[...] do you have a 'warrant canary' for
any 'National Security letters/warrants' you may have received instructing
you to turn over private user data to a 3-letter agency under the PATRIOT
Act?
", Montgomery declined to respond. Since Mycroft AI is a US-based
company this may be of importance to privacy-concerned users.
While the project doesn't feel that DeepSpeech is ready, using DeepSpeech is supported. DeepSpeech isn't the only alternative to Google's STT available, either. There is support for many options, starting with Kaldi, which is open source, along with a handful of proprietary providers: GoVivace, Houndify, IBM Cloud, Microsoft Azure, and Wit.ai.
Intent parser
Once audio has been transcribed into text, it must then be processed for intent. That is to say, Mycroft now has to process the command you gave it and figure out what you want it to do. This can be complicated in its own right, as even with a perfect transcription humans have many different ways of expressing an intent: "Hey Mycroft, tell me about the weather", "Hey Mycroft, what's the weather?", "Hey Mycroft, what is it like outside?"
To address intents, Mycroft has two separate projects. The default, Adapt, provides a more programmatic approach to intent parsing, while Padatious is a neural-network-based parser. The intention of the Mycroft project is to eventually replace Adapt with Padatious, which is reported to be under active development. It is noteworthy, however, that the project has had only six commits this year.
While both Adapt and Padatious accomplish the same goal, they do so in very
different ways. Adapt (implemented in Python) is designed to run on embedded
systems with limited resources, meaning it can run directly on the local
device. It is a keyword-based intent parser, so its capabilities are somewhat
limited — though it can be useful to implement Mycroft skills that have a
small number of commands (in Mycroft terms, "Utterances"). Padatious, on the
other hand, takes an AI-based approach to intent parsing where the entire
sentence is examined to ascertain the intent. In this approach, a neural
network is trained on whole phrases and the generated model is used to parse
the intent. This involves more up-front work training the model for a skill,
but, according
to the documentation, the training of Padatious models "require a
relatively small amount of data.
" Like Adapt, Padatious is implemented
in Python.
Once an intent is parsed, regardless of the technology used, it is then mapped (hopefully) to a skill that knows how to execute the intended action — such as fetching and responding with the latest weather for the region.
Text to speech
The text-to-speech module provided by Mycroft is the final key swappable implementation of the Mycroft project. This module, as its name implies, takes text and converts it into audio to be played. Generally this would be in response to a command generated by the intent parser but, unlike other commercial offerings (e.g. Amazon Alexa), it can also be used to "push" audio to the voice assistant without it being strictly in response to a wake-word command. This can be useful when integrating Mycroft with a project like Home Assistant, as a way to provide audio notifications.
Mycroft offers multiple options for TTS. Specifically, Mycroft has the Mimic and Mimic2 projects. The default Mimic is a TTS engine based on Flite that runs directly on the device to synthesize a voice from text by concatenating sounds together to form words. This produces a less-than-natural voice, but in testing it wasn't bad. Mimic2, on the other hand, is a cloud-based TTS engine based on Tacotron that uses a neural network to produce a much higher-quality, synthesized voice. Mycroft does support using Google's TTS offering, though it is not enabled by default. Using Google's TTS service results in a voice synthesis very similar to that found in Google Home devices.
More to come
Mycroft's expansive portfolio of projects is impressive, but perhaps even more impressive was how easy it was to get started. The company has developed two consumer-focused devices (one on pre-order) that are built on the project, but also provides the ability to build your own Raspberry Pi or Linux-based equivalent. In our next installment, we will take a closer look into the project in action to see how well it stacks up to its proprietary competition.
Posted Jul 25, 2020 11:45 UTC (Sat)
by jonquark (guest, #45554)
[Link] (1 responses)
Their second consumer device has been very significantly delayed and whilst early prototypes exist - those of us who took part in the crowd funding for the Mark II won't get a device in 2020 and (from memory) I think the original estimated delivery for the device was in 2018. So the grammatical tense in "In recent years, it has developed two consumer-focused "smart speaker" devices" doesn't read quite correctly to me - it is developing the second device.
Despite the delays - I'm really hopefully that MyCroft can develop into a sustainable business that can spear-head the development of cutting edge open source home assistant.
[1] (or whatever the word is in a crowdfunding campaign with the level of risk involved)
Posted Jul 27, 2020 5:11 UTC (Mon)
by da4089 (subscriber, #1195)
[Link]
The Mark II device has evolved from being a custom-built platform, building on their experience with the Mark I, to the current plans where it is essentially a Raspberry Pi case with a custom microphone array board.
I'm increasingly dubious that they'll deliver anything approaching what was originally claimed, let alone a device that's competitive with commercial products that have enjoyed several additional years of development while the Mycroft hardware has stagnated.
Posted Jul 25, 2020 14:42 UTC (Sat)
by FloatingBoater (subscriber, #67237)
[Link]
To connect a Mycroft Mark One (a Raspberry Pi with an Arduino I/O board for a face) to openHAB home automation takes a MQTT server on the local LAN, and not a cloud server thousands of miles away.
To fix an issue with a skill (such as when the core software added support for account-based config), I could SSH into the RPi and edit the Python code directly with 'vim __init__.py'.
The original Mark One was designed for hardware hacking with multiple I/O pins broken out on the back of the unit.
Is the result as polished as a Google Nest Max or Alexa? Nope.
The still under development Mark Two seems to have had a few stumbles apparently with sub-contractors burning time and money on custom embedded uC and LCD screen drivers which didn't pan out. The latest iteration is back to a RPi with a series of custom PCBs.
Surprisingly, as well as open-sourcing the software stack (build your own PiCroft), Joshua Montgomery has started publishing engineering team meetings on YouTube. The commitment to openness is stellar, as the three sessions I've watched are warts and all. I've been engineering software for 30 years, so watching how some one else manages complex issues with scarce resources is fascinating - even if occasionally you'd love to come off mute and interject! :-)
Posted Jul 25, 2020 20:19 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
One area where I'm looking for improvements is multi-language support. Right now it's in shambles :(
Posted Jul 26, 2020 19:01 UTC (Sun)
by willy (subscriber, #9762)
[Link]
Posted Jul 26, 2020 19:18 UTC (Sun)
by sumanah (guest, #59891)
[Link]
Posted Jul 27, 2020 3:27 UTC (Mon)
by k8to (guest, #15413)
[Link] (7 responses)
Posted Jul 27, 2020 4:03 UTC (Mon)
by gdt (subscriber, #6284)
[Link]
Posted Jul 27, 2020 5:50 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Jul 29, 2020 11:39 UTC (Wed)
by nilsmeyer (guest, #122604)
[Link] (2 responses)
Posted Jul 29, 2020 12:45 UTC (Wed)
by excors (subscriber, #95769)
[Link]
Also, why assume it's economical? Amazon, Google and Apple can afford to spend literally billions of dollars to gain market share, based on a prediction that the market may become highly profitable a decade later (after further hardware/software advancements to reduce costs, and new business models to increase revenues). Better to risk wasting a billion becoming the leader of a new market that ultimately fails, than to risk missing out on a new market that ends up being wildly successful.
And the voice assistants never need to be profitable by themselves, they just need to not lose too much money while driving users towards those companies' offerings in other markets. Amazon will sell you a smart speaker for $0.99 with a subscription to their $10/month music subscription service, and it also integrates with their e-commerce site ("Alexa, buy a television"), and other companies can easily implement Alexa skills on AWS servers, etc. Google does similar with YouTube Music, Apple wants to encourage you to buy iPhones and apps, etc. Voice assistants might never make economical sense as a standalone service, because the technology is so complicated that no user would be willing to pay what it really costs; they might only make sense as an interface for other high-profit-margin services.
Posted Jul 29, 2020 16:43 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
And if its costs is $10000 then it works out to mere $10 per user, which can be easily built into the cost of the hardware or recouped by selling out your data to a highest ad bidder.
Posted Jul 30, 2020 14:15 UTC (Thu)
by robbe (guest, #16131)
[Link] (1 responses)
If I understand correctly, this does *not* use the Raspi’s GPU core.
Posted Aug 6, 2020 5:31 UTC (Thu)
by donbarry (guest, #10485)
[Link]
I haven't played with it yet but am looking forward to in the coming year. Mycroft would probably find more interested developers if it ditches the cloud, as it could really distinguish itself as being a privacy-respecting local provider from start to finish.
Posted Jul 30, 2020 13:25 UTC (Thu)
by kpfleming (subscriber, #23250)
[Link]
One of the best ways to discourage new contributors to any open source project is to have a high-friction process for making contributions, but this really goes far beyond 'high friction'.
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
Which platform do I get the most use out of? Mycroft (outside of watching YouTube content on a kitchen counter device).
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
Thanks so much for this article. I keep meaning to look into automated virtual assistants like Mycroft and Almond (there was a GUADEC talk about Almond this year) and this article helps me understand the privacy tradeoffs involved with Mycroft's speech-to-text functionality. When I think about the privacy concerns involved in using a microphone-capable virtual assistant, I ask: do I trust that, by default, the device is only listening for the wakeword? who gets to hear the raw audio of my requests/instructions, and read/analyze the parsed text of them and what data/response is elicited? So I appreciate the details in this article.
speech-to-text privacy
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas...
Mycroft: an open-source voice assistant
Mycroft: an open-source voice assistant
