User: Password:
|
|
Subscribe / Log in / New account

Mozilla releases tools and data for speech recognition

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jake Edge
December 6, 2017

Voice computing has long been a staple of science fiction, but it has only relatively recently made its way into fairly common mainstream use. Gadgets like mobile phones and "smart" home assistant devices (e.g. Amazon Echo, Google Home) have brought voice-based user interfaces to the masses. The voice processing for those gadgets relies on various proprietary services "in the cloud", which generally leaves the free-software world out in the cold. There have been FOSS speech-recognition efforts over the years, but Mozilla's recent announcement of the release of its voice-recognition code and voice data set should help further the goal of FOSS voice interfaces.

There are two parts to the release, DeepSpeech, which is a speech-to-text (STT) engine and model, and Common Voice, which is a set of voice data that can be used to train voice-recognition systems. While DeepSpeech is available for those who simply want to do some kind of STT task, Common Voice is meant for those who want to create their own voice-recognition system—potentially one that does even better (or better for certain types of applications) than DeepSpeech.

DeepSpeech

The DeepSpeech project is based on two papers from Chinese web-services company Baidu; it uses a neural network implemented using Google's TensorFlow. As detailed in a blog post by Reuben Morais, who works in the Machine Learning Group at Mozilla Research, several data sets were used to train DeepSpeech, including transcriptions of TED talks, LibriVox audio books from the LibriSpeech corpus, and data from Common Voice; two proprietary data sets were also mentioned, but it is not clear how much of that was used in the final DeepSpeech model. The goal was to have a word error rate of less than 10%, which was met; "Our word error rate on LibriSpeech's test-clean set is 6.5%, which not only achieves our initial goal, but gets us close to human level performance."

The blog post goes into a fair amount of detail that will be of interest to those who are curious about machine learning. It is clear that doing this kind of training is not for the faint of heart (or those with small wallets). It is a computationally intensive task that takes a fairly sizable amount of time even using specialized hardware:

Deep Speech has over 120 million parameters, and training a model this large is a very computationally expensive task: you need lots of GPUs if you don't want to wait forever for results. We looked into training on the cloud, but it doesn't work financially: dedicated hardware pays for itself quite quickly if you do a lot of training. The cloud is a good way to do fast hyperparameter explorations though, so keep that in mind.

We started with a single machine running four Titan X Pascal GPUs, and then bought another two servers with 8 Titan XPs each. We run the two 8 GPU machines as a cluster, and the older 4 GPU machine is left independent to run smaller experiments and test code changes that require more compute power than our development machines have. This setup is fairly efficient, and for our larger training runs we can go from zero to a good model in about a week.

A "human level" word error rate is 5.83%, according to the Baidu papers, Morais said, so 6.5% is fairly impressive. Running the model has reasonable performance as well, though getting it to the point where it can run on a Raspberry Pi or mobile device is desired.

On a MacBook Pro, using the GPU, the model can do inference at a real-time factor of around 0.3x, and around 1.4x on the CPU alone. (A real-time factor of 1x means you can transcribe 1 second of audio in 1 second.)

Common Voice

Because the machine-learning group had trouble in finding quality data sets for training DeepSpeech, Mozilla started the Common Voice project to help create one. The first release of data from the project is the subject of a blog post from Michael Henretty. The data, which was collected from volunteers and has been released into the public domain, is quite expansive: "This collection contains nearly 400,000 recordings from 20,000 different people, resulting in around 500 hours of speech." In fact, it is the second largest publicly available data set; it is also growing daily as people add and validate new speech samples.

The initial release is only for the English language, but there are plans to support adding speech in other languages. The announcement noted that a diversity of voices is important for Common Voice:

Too often existing speech recognition services can't understand people with different accents, and many are better at understanding men than women — this is a result of biases within the data on which they are trained. Our hope is that the number of speakers and their different backgrounds and accents will create a globally representative dataset, resulting in more inclusive technologies.

To this end, while we've started with English, we are working hard to ensure that Common Voice will support voice donations in multiple languages beginning in the first half of 2018.

The Common Voice site has links to other voice data sets (also all in English, so far). There is also a validation application on the home page, which allows visitors to listen to a sentence to determine if the speaker accurately pronounced the words. There are no real guidelines for how forgiving one should be (and just simple "Yes" and "No" buttons), but crowdsourcing the validation should help lead to a better data set. In addition, those interested can record their own samples on the web site.

A blog post announcing the Common Voice project (but not the data set, yet) back in July outlines some of the barriers to entry for those wanting to create STT applications. Each of the major browsers has its own API for supporting STT applications; as might be guessed, Mozilla is hoping that browser makers will instead rally around the W3C Web Speech API. That post also envisions a wide array of uses for STT technology:

Voice-activated computing could do a lot of good. Home hubs could be used to provide safety and health monitoring for ill or elderly folks who want to stay in their homes. Adding Siri-like functionality to cars could make our roads safer, giving drivers hands-free access to a wide variety of services, like direction requests and chat, so eyes stay on the road ahead. Speech interfaces for the web could enhance browsing experiences for people with visual and physical limitations, giving them the option to talk to applications instead of having to type, read or move a mouse.

It's fun to think about where this work might lead. For instance, how might we use silent speech interfaces to keep conversations private? If your phone could read your lips, you could share personal information without the person sitting next to you at a café or on the bus overhearing. Now that's a perk for speakers and listeners alike.

While applications for voice interfaces abound (even if only rarely used by ever-increasing Luddites such as myself), there are, of course, other problems to be solved before we can throw away our keyboard and mouse. Turning speech into text is useful, but there is still a need to derive meaning from the words. Certain applications will be better suited than others to absorb voice input, and Mozilla's projects will help them do so. Text to speech has been around for some time, and there are free-software options for that, but full-on, general purpose voice interfaces will probably need a boost from artificial intelligence—that is likely still a ways out.


(Log in to post comments)

Mozilla releases tools and data for speech recognition

Posted Dec 7, 2017 4:52 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

I added it in my self-made offline-only home automation and results are truly impressive. I can use my intercom with its fairly useless mic to reliably switch lights, set temperature and look up the state of various objects ("Housekeeper, is the garage door open?"). I had the trigger word detection built already (don't ask) and was planning to plug in AWS Lex service, but this is so much better.

Now I'm looking at automating music playing. The direct approach doesn't really work well with lots of albums. For example, "Imaginerum" by Nightwish can't be recognized since it's not a word in its corpus.

So there needs to be a way to plug in your own matchers into the lex model, seems fairly straightforward...

Mozilla releases tools and data for speech recognition

Posted Dec 7, 2017 17:19 UTC (Thu) by drag (guest, #31333) [Link]

This is extremely awesome. There have been a couple notable attempts to create AI systems like 'Mycroft AI', but they always depended on these proprietary services to get basic voice functionality.

Fantastic stuff.

programming with STT

Posted Dec 7, 2017 10:05 UTC (Thu) by meskio (subscriber, #100774) [Link]

Will be fun to hook it into text editors to be able to program like:
https://www.youtube.com/watch?v=8SkdfdXWYaI

programming with STT

Posted Dec 7, 2017 12:36 UTC (Thu) by AndreiG (subscriber, #90359) [Link]

Whaaa ?
Do you hear yourself yapping out loud stuff like:

/s/tr[1-9][0-9]+_[bcek]/g

😂

programming with STT

Posted Dec 7, 2017 13:10 UTC (Thu) by nix (subscriber, #2304) [Link]

Yes. You do it with words for punctuation (or, more generally, words that mean single letters) and (generally) two "escape words", one for saying "the next word is a name for a letter" and the other for saying "the next run of words until we say the escape word again are all names for letters". (You add the escape word itself to a document by saying it twice.)

This is laborious as hell but essential if you're doing significant amounts of programming by voice. (Why would you do that? Bad enough RSI. A number of leading lights in the community have been forced onto that sort of thing over the years, usually using Dragon Naturally Speaking, but that's non-free, so a free alternative is hugely significant for people stuck with that.)

programming with STT

Posted Dec 12, 2017 18:23 UTC (Tue) by metasequoia (subscriber, #119065) [Link]

Try resveratrol buccally (let it absorb in to the membranes in your mouth) for RSI. I've been doing this for years just at modest levels but many years ago I found this out, and it works like a charm. I've told hundreds of people and have heard back lots of positive things. I have no RSI at all now and I spend many hours a day on the computer. It helps *repair* many joint issues and helps improve bone density in older people and it seems also likely to help astronauts in microgravity maintain bone mass. Its also good for osteoporosis in older people. Note that these are all slightly different. It also may be helpful in dental implantation. Its really good for a number of different kinds of bone, connective tissue, spine and cartilage issues. It's very good and its also quite cheap. It's a subject of a great deal of research in dozens of different areas right now.

***There needs to be a way of protecting the huge public domain of health knowledge.***

Either buy it in capsule form and empty them out, or buy it as 50% trans-resveratrol powder. You don't need to take a lot of it at all after the beginning. It may take awhile.

Inosine also may help regenerate damaged axons in your nerves. But there you need to keep the levels up in your bloodstream by taking small amounts of it frequently. Inosine monohydrate, I think is what I used. Its a purine.

Just put it in your drinking water, like they do with lab rats. But you have to stir it (or shake it) when you drink, because it doesn't dissolve much in water. It has a bland chalky taste. Don't take it if you have gout. It could make gout worse.

PubMed has info on this, but you need to look under its Russian name for most of it, which I now am forgetting!

programming with STT

Posted Dec 14, 2017 12:58 UTC (Thu) by nix (subscriber, #2304) [Link]

The Wikipedia article on resveratrol is not encouraging, with fairly unpleasant side effects and no positive effects noted: no effects on osteoporosis, no effects on joint function, no effects on bone density. Those PubMed searches I have done suggest that the only studies that have shown any positive effects at all have had tiny sample sizes. To me, this says "selection bias": things often get better on their own, and if you're taking something at the same time, you'll attribute the improvement to the thing you're taking.

(FYI, my mum tried it for arthritis and mouse-induced RSI and got nausea but no joint improvements. She switched chairs and switched to a trackball, and the RSI went away. Again, a mere anecdote...)

I'll stick with my ergonomic keyboard. It works, and it's obvious *why* it works (the shape). Equally, for voice recognition, it's obvious that not using your hands to type for a while will probably do them good if you've been overusing them before.

programming with STT

Posted Dec 7, 2017 19:41 UTC (Thu) by iabervon (subscriber, #722) [Link]

Actually, hooking it into emacs allows for the possibility that all the things you say are commands that interact with the state of the buffer and are based on context, rather than having you run self-insert-command once per character in the text you want. It also seems like there would be an opportunity for using voice commands for everything other than self-insert (C-x C-s isn't easier to type than "save" is to say). And I could imagine really nice features like: when I say "page up", scroll the window I'm looking at (based on eye tracking), not the one with keyboard focus.

programming with STT

Posted Dec 8, 2017 11:41 UTC (Fri) by epa (subscriber, #39769) [Link]

Perhaps you would invent a tonal language for speaking Lisp.

programming with STT

Posted Dec 8, 2017 18:45 UTC (Fri) by iabervon (subscriber, #722) [Link]

I think the distinctive thing about lisp as compared to other languages is that its syntax is extremely uniform. So it wouldn't be spoken as a tonal language, but probably with a lot of significance to prosody and stress. (In tonal languages, the pitch contour matters to determining which words you're using; in non-tonal languages like English, it often matters to the sentence structure instead, since it's available for that purpose.)

"Surely anyone speaking lisp is able to pause for exactly the desired number of close parentheses after lowering their pitch on the preceding syllable to indicate the end of a statement, right?"

I think Perl, on the other hand, would be a tonal language, where you say your variable names in different tones to put different sigils on them.

programming with STT

Posted Dec 8, 2017 3:29 UTC (Fri) by bokr (subscriber, #58369) [Link]

Here is some prior art ;-)

https://www.youtube.com/watch?v=Qf_TDuhk3No

Mozilla releases tools and data for speech recognition

Posted Dec 8, 2017 11:40 UTC (Fri) by epa (subscriber, #39769) [Link]

That explanation of a real-time factor (from the quoted blog post) isn't very helpful. "1x means you can transcribe 1 second of audio in 1 second." That could mean the factor is the length of the audio recording divided by how long it takes to process it -- or it could mean the other way round!

From the rest of the paragraph I think it means that the real-time factor is the computing time divided by the length of the recording. So a lower number is better; "0.5x means that you can transcribe 1 second of audio in 0.5 seconds".

Are accents allowed or denied?

Posted Dec 9, 2017 16:12 UTC (Sat) by sasha (subscriber, #16070) [Link]

They start with "the number of speakers and their different backgrounds and accents will create a globally representative dataset". But then they "allow visitors to listen to a sentence to determine if the speaker accurately pronounced the words".

My mother tongue is not English, and I speak English with an accent. I do not "accurately pronounce" English words, I can't. I wonder with part of the claims is really true...

Are accents allowed or denied?

Posted Dec 9, 2017 17:01 UTC (Sat) by rahulsundaram (subscriber, #21946) [Link]

> My mother tongue is not English, and I speak English with an accent.

Everybody speaks with an accent. Some accents are just more common than others. Shouldn't be deterministic of accuracy of pronouncing words at a broad level

Are accents allowed or denied?

Posted Dec 9, 2017 18:51 UTC (Sat) by Felix (subscriber, #36445) [Link]

Yes the web site really lacks some review guidelines. But some speakers repeated the sentence a few times or stopped to recording too early. These are clearly invalid.

For further reference:
- https://discourse.mozilla.org/t/instructions-for-validati...
- https://github.com/mozilla/voice-web/issues/273

Are accents allowed or denied?

Posted Dec 10, 2017 23:10 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

And I wonder, if they can get 93.5% accuracy with a broad range of accents, couldn't they get much better if they had a narrower training set that covers one target user?

Since there was no mention of it, I assume it doesn't actually recognize accents and use prior pronunciation to inform current interpretations. But that would be great.

I don't think I personally get 93.5% in real time, and with foreign accents, I'm probably more like 75%. I could probably use a machine with this technology to subtitle live speech for me.

Are accents allowed or denied?

Posted Dec 10, 2017 23:50 UTC (Sun) by roblucid (subscriber, #48964) [Link]

It would be better, but that would mean your phone / PC would do much worse when you first get it and have not broken the VR in! I'm studying a new foreign language and practising one I used to be fluent in at present and it's quite nice to match the VR in those tongues as well as I can dictate in English. The VR could improve with time, but what if I am a student speaking incorrectly? Correcting faults could result with a specially adapted system, with worse comprehension.

It also seems the "training" uses up a huge amount of resources, therefore not what you want on your neat little mobile device or Raspberry Pi.

Are accents allowed or denied?

Posted Dec 11, 2017 0:11 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

Just to be clear, I'm not suggesting a voice recognition system be trained to a particular user's voice - just that it selectively use parts of the giant database these guys created. E.g. you could choose "midwestern United States" from a menu and it would then never be influenced by its training on people from Australia or Brooklyn or non-native speakers. Or alternatively, it could sense from your first few sentences that you're using midwestern United States pronunciation and then do the same thing.

Are accents allowed or denied?

Posted Dec 11, 2017 2:42 UTC (Mon) by roblucid (subscriber, #48964) [Link]

Well that's a really bad idea, haven't you noticed how many people travel around?

Are accents allowed or denied?

Posted Dec 11, 2017 3:15 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

I don't follow. What does traveling have to do with it?

Are accents allowed or denied?

Posted Dec 11, 2017 4:24 UTC (Mon) by pabs (subscriber, #43278) [Link]

If you are a traveller using voice recognition, language translation and voice synthesis to hold a conversation with someone else despite not knowing their language, you need to also recognise the voice of your conversation partner, in their language and accent. So the VR system needs to differentiate between multiple speakers to choose the right model to train rather than just training the one model using data from all speakers providing input to the VR system.

Are accents allowed or denied?

Posted Dec 11, 2017 16:53 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

OK, well my comments are limited to the (quite substantial) use case where a single user speaks to the VR system, wondering if a system in that application could take advantage of knowing what pronunciation that user uses to recognize his words more accurately.


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds