| Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. |
Voice computing has long been a staple of science fiction, but it has only relatively recently made its way into fairly common mainstream use. Gadgets like mobile phones and "smart" home assistant devices (e.g. Amazon Echo, Google Home) have brought voice-based user interfaces to the masses. The voice processing for those gadgets relies on various proprietary services "in the cloud", which generally leaves the free-software world out in the cold. There have been FOSS speech-recognition efforts over the years, but Mozilla's recent announcement of the release of its voice-recognition code and voice data set should help further the goal of FOSS voice interfaces.
There are two parts to the release, DeepSpeech, which is a speech-to-text (STT) engine and model, and Common Voice, which is a set of voice data that can be used to train voice-recognition systems. While DeepSpeech is available for those who simply want to do some kind of STT task, Common Voice is meant for those who want to create their own voice-recognition system—potentially one that does even better (or better for certain types of applications) than DeepSpeech.
The DeepSpeech project is based on two papers from Chinese web-services company Baidu; it uses a neural network implemented using Google's TensorFlow. As detailed in a blog post by Reuben Morais, who works in the Machine Learning Group at Mozilla Research, several data sets were used to train DeepSpeech, including transcriptions of TED talks, LibriVox audio books from the LibriSpeech corpus, and data from Common Voice; two proprietary data sets were also mentioned, but it is not clear how much of that was used in the final DeepSpeech model. The goal was to have a word error rate of less than 10%, which was met; "Our word error rate on LibriSpeech's test-clean set is 6.5%, which not only achieves our initial goal, but gets us close to human level performance."
The blog post goes into a fair amount of detail that will be of interest to those who are curious about machine learning. It is clear that doing this kind of training is not for the faint of heart (or those with small wallets). It is a computationally intensive task that takes a fairly sizable amount of time even using specialized hardware:
We started with a single machine running four Titan X Pascal GPUs, and then bought another two servers with 8 Titan XPs each. We run the two 8 GPU machines as a cluster, and the older 4 GPU machine is left independent to run smaller experiments and test code changes that require more compute power than our development machines have. This setup is fairly efficient, and for our larger training runs we can go from zero to a good model in about a week.
A "human level" word error rate is 5.83%, according to the Baidu papers, Morais said, so 6.5% is fairly impressive. Running the model has reasonable performance as well, though getting it to the point where it can run on a Raspberry Pi or mobile device is desired.
Because the machine-learning group had trouble in finding quality data sets for training DeepSpeech, Mozilla started the Common Voice project to help create one. The first release of data from the project is the subject of a blog post from Michael Henretty. The data, which was collected from volunteers and has been released into the public domain, is quite expansive: "This collection contains nearly 400,000 recordings from 20,000 different people, resulting in around 500 hours of speech." In fact, it is the second largest publicly available data set; it is also growing daily as people add and validate new speech samples.
The initial release is only for the English language, but there are plans to support adding speech in other languages. The announcement noted that a diversity of voices is important for Common Voice:
To this end, while we've started with English, we are working hard to ensure that Common Voice will support voice donations in multiple languages beginning in the first half of 2018.
The Common Voice site has links to other voice data sets (also all in English, so far). There is also a validation application on the home page, which allows visitors to listen to a sentence to determine if the speaker accurately pronounced the words. There are no real guidelines for how forgiving one should be (and just simple "Yes" and "No" buttons), but crowdsourcing the validation should help lead to a better data set. In addition, those interested can record their own samples on the web site.
A blog post announcing the Common Voice project (but not the data set, yet) back in July outlines some of the barriers to entry for those wanting to create STT applications. Each of the major browsers has its own API for supporting STT applications; as might be guessed, Mozilla is hoping that browser makers will instead rally around the W3C Web Speech API. That post also envisions a wide array of uses for STT technology:
It's fun to think about where this work might lead. For instance, how might we use silent speech interfaces to keep conversations private? If your phone could read your lips, you could share personal information without the person sitting next to you at a café or on the bus overhearing. Now that's a perk for speakers and listeners alike.
While applications for voice interfaces abound (even if only rarely used by ever-increasing Luddites such as myself), there are, of course, other problems to be solved before we can throw away our keyboard and mouse. Turning speech into text is useful, but there is still a need to derive meaning from the words. Certain applications will be better suited than others to absorb voice input, and Mozilla's projects will help them do so. Text to speech has been around for some time, and there are free-software options for that, but full-on, general purpose voice interfaces will probably need a boost from artificial intelligence—that is likely still a ways out.
Mozilla releases tools and data for speech recognition
Posted Dec 7, 2017 4:52 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
Now I'm looking at automating music playing. The direct approach doesn't really work well with lots of albums. For example, "Imaginerum" by Nightwish can't be recognized since it's not a word in its corpus.
So there needs to be a way to plug in your own matchers into the lex model, seems fairly straightforward...
Mozilla releases tools and data for speech recognition
Posted Dec 7, 2017 17:19 UTC (Thu) by drag (guest, #31333) [Link]
Fantastic stuff.
programming with STT
Posted Dec 7, 2017 10:05 UTC (Thu) by meskio (subscriber, #100774) [Link]
programming with STT
Posted Dec 7, 2017 12:36 UTC (Thu) by AndreiG (subscriber, #90359) [Link]
/s/tr[1-9][0-9]+_[bcek]/g
😂
programming with STT
Posted Dec 7, 2017 13:10 UTC (Thu) by nix (subscriber, #2304) [Link]
This is laborious as hell but essential if you're doing significant amounts of programming by voice. (Why would you do that? Bad enough RSI. A number of leading lights in the community have been forced onto that sort of thing over the years, usually using Dragon Naturally Speaking, but that's non-free, so a free alternative is hugely significant for people stuck with that.)
programming with STT
Posted Dec 12, 2017 18:23 UTC (Tue) by metasequoia (subscriber, #119065) [Link]
***There needs to be a way of protecting the huge public domain of health knowledge.***
Either buy it in capsule form and empty them out, or buy it as 50% trans-resveratrol powder. You don't need to take a lot of it at all after the beginning. It may take awhile.
Inosine also may help regenerate damaged axons in your nerves. But there you need to keep the levels up in your bloodstream by taking small amounts of it frequently. Inosine monohydrate, I think is what I used. Its a purine.
Just put it in your drinking water, like they do with lab rats. But you have to stir it (or shake it) when you drink, because it doesn't dissolve much in water. It has a bland chalky taste. Don't take it if you have gout. It could make gout worse.
PubMed has info on this, but you need to look under its Russian name for most of it, which I now am forgetting!
programming with STT
Posted Dec 14, 2017 12:58 UTC (Thu) by nix (subscriber, #2304) [Link]
(FYI, my mum tried it for arthritis and mouse-induced RSI and got nausea but no joint improvements. She switched chairs and switched to a trackball, and the RSI went away. Again, a mere anecdote...)
I'll stick with my ergonomic keyboard. It works, and it's obvious *why* it works (the shape). Equally, for voice recognition, it's obvious that not using your hands to type for a while will probably do them good if you've been overusing them before.
programming with STT
Posted Dec 7, 2017 19:41 UTC (Thu) by iabervon (subscriber, #722) [Link]
programming with STT
Posted Dec 8, 2017 11:41 UTC (Fri) by epa (subscriber, #39769) [Link]
programming with STT
Posted Dec 8, 2017 18:45 UTC (Fri) by iabervon (subscriber, #722) [Link]
"Surely anyone speaking lisp is able to pause for exactly the desired number of close parentheses after lowering their pitch on the preceding syllable to indicate the end of a statement, right?"
I think Perl, on the other hand, would be a tonal language, where you say your variable names in different tones to put different sigils on them.
Mozilla releases tools and data for speech recognition
Posted Dec 8, 2017 11:40 UTC (Fri) by epa (subscriber, #39769) [Link]
From the rest of the paragraph I think it means that the real-time factor is the computing time divided by the length of the recording. So a lower number is better; "0.5x means that you can transcribe 1 second of audio in 0.5 seconds".
Are accents allowed or denied?
Posted Dec 9, 2017 16:12 UTC (Sat) by sasha (subscriber, #16070) [Link]
My mother tongue is not English, and I speak English with an accent. I do not "accurately pronounce" English words, I can't. I wonder with part of the claims is really true...
Are accents allowed or denied?
Posted Dec 9, 2017 17:01 UTC (Sat) by rahulsundaram (subscriber, #21946) [Link]
Everybody speaks with an accent. Some accents are just more common than others. Shouldn't be deterministic of accuracy of pronouncing words at a broad level
Are accents allowed or denied?
Posted Dec 9, 2017 18:51 UTC (Sat) by Felix (subscriber, #36445) [Link]
For further reference:
- https://discourse.mozilla.org/t/instructions-for-validati...
- https://github.com/mozilla/voice-web/issues/273
Are accents allowed or denied?
Posted Dec 10, 2017 23:10 UTC (Sun) by giraffedata (subscriber, #1954) [Link]
And I wonder, if they can get 93.5% accuracy with a broad range of accents, couldn't they get much better if they had a narrower training set that covers one target user?Since there was no mention of it, I assume it doesn't actually recognize accents and use prior pronunciation to inform current interpretations. But that would be great.
I don't think I personally get 93.5% in real time, and with foreign accents, I'm probably more like 75%. I could probably use a machine with this technology to subtitle live speech for me.
Are accents allowed or denied?
Posted Dec 10, 2017 23:50 UTC (Sun) by roblucid (subscriber, #48964) [Link]
It also seems the "training" uses up a huge amount of resources, therefore not what you want on your neat little mobile device or Raspberry Pi.
Are accents allowed or denied?
Posted Dec 11, 2017 0:11 UTC (Mon) by giraffedata (subscriber, #1954) [Link]
Just to be clear, I'm not suggesting a voice recognition system be trained to a particular user's voice - just that it selectively use parts of the giant database these guys created. E.g. you could choose "midwestern United States" from a menu and it would then never be influenced by its training on people from Australia or Brooklyn or non-native speakers. Or alternatively, it could sense from your first few sentences that you're using midwestern United States pronunciation and then do the same thing.
Are accents allowed or denied?
Posted Dec 11, 2017 2:42 UTC (Mon) by roblucid (subscriber, #48964) [Link]
Are accents allowed or denied?
Posted Dec 11, 2017 3:15 UTC (Mon) by giraffedata (subscriber, #1954) [Link]
I don't follow. What does traveling have to do with it?
Are accents allowed or denied?
Posted Dec 11, 2017 4:24 UTC (Mon) by pabs (subscriber, #43278) [Link]
Are accents allowed or denied?
Posted Dec 11, 2017 16:53 UTC (Mon) by giraffedata (subscriber, #1954) [Link]
OK, well my comments are limited to the (quite substantial) use case where a single user speaks to the VR system, wondering if a system in that application could take advantage of knowing what pronunciation that user uses to recognize his words more accurately.
Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds