|
|
Log in / Subscribe / Register

Mozilla's initiatives for non-creepy deep learning

By Jonathan Corbet
February 6, 2019

LCA
Jack Moffitt started off his 2019 linux.conf.au talk by calling attention to Facebook's "Portal" device. It is, he said, a cool product, but raises an important question: why would anybody in their right mind put a surveillance device made by Facebook in their kitchen? There are a lot of devices out there — including the Portal — using deep-learning techniques; they offer useful functionality, but also bring a lot of problems. We as a community need to figure out a way to solve those problems; he was there to highlight a set of Mozilla projects working toward that goal.

He defined machine learning as the process of making decisions and/or predictions by modeling from input data. Systems using these techniques can perform all kinds of tasks, including language detection and (bad) poetry generation. The classic machine-learning task is spam filtering, based on the idea that certain words tend to appear more often in spam and can be used to detect unwanted email. With more modern neural networks, though, there is no need to do that sort of feature engineering; the net itself can figure out what the interesting features are. It is, he said, "pretty magical".

Moffitt gave a quick overview of some of the structures used for contemporary deep learning, including neural networks, convolutional networks, and recurrent networks. The last of those are useful for speech recognition and synthesis tasks; they are used a lot at Mozilla. See the video (linked at the bottom) for more details about how these different [Jack Moffitt] types of networks perform their magic. Regardless of the architecture, the overall technique used to train these networks is the same: present them with input data, then tweak the network's parameters to bring the output closer to what is desired. Do that enough times with enough data, and the network should get good at performing the intended task.

One nice feature of these networks is that it is possible to take a trained model and use it for purposes other than the intended one. A network that has been trained to recognize objects in general, for example, can be pressed into service as the starting point for a face detector. This approach is especially useful in settings where there aren't vast amount of data available to train the network with. Another useful technique is "generative adversarial networks", where two independent networks are trained against each other. If one network generates fake images and another one detects fakes, both can be improved by pitting one against the other.

The dark side

There are many interesting applications of deep learning, he said, but also a dark side. Open-source software can, in general, be used for any purpose regardless of whether the author approves; it can be used to create weapons, for example. Deep-learning applications have their own set of uses that we should all be concerned about, he said.

For example, neural networks have an infinite appetite for data; the more data you can train a system with, the better it will learn its task. That gives huge companies an incentive to acquire as much data as they possibly can. Ostensibly this is done to create better products, but we have to trust these companies that they are not using this data for other purposes. As an example, smart assistants like Alexa will get better at speech recognition as they are trained with more data, so they save a copy of everything that is ever said to them (and sometimes things that are not). That is, he said, "scary".

Deep-learning systems are computationally expensive; it typically takes a huge farm of GPUs to perform the training. Running them is cheaper, but they still don't really fit onto edge devices, with the result that processing moves to the cloud — and all of that input data moves with it. Efficiency does not really appear to be a concern for the people who are designing and building these systems.

There are introspection issues; how does one diagnose problems with a deep-learning system when one doesn't really understand how it comes to its conclusions in the first place? Mistakes are bound to happen, and some of them may have severe consequences. Many of these issues can be solved with more input data, of course, but training data can have unknown biases in it. It will always be possible to get "weird results" from deep-learning systems, and there is no easy way to figure out why when that happens.

Then there is the issue of bias in general. He called out the famous case of Google Photos labeling black faces as belonging to gorillas. Such errors are the result of poor training data and a lack of comprehensive testing; he suggested that perhaps this case shows that Google does not have enough black employees. Word embedding is a useful technique for language processing that tracks the "distance" between related words. A word-embedding system trained on web text is much more likely to associate the word "doctor" with "man" than "woman". Some biases, such as gender-related problems, can be corrected with a technique called "reprojection", but others, such as race, are harder to deal with.

Deep learning at Mozilla

Mozilla has the desire to use these technologies and to make them available to others. But, at the same time, there is a strong desire to avoid the above problems. Moffitt listed a number of projects that, Mozilla hopes, will meet those goals.

The DeepSpeech project is building a speech-to-text system, focused on both recognition and data collection. Existing applications in this space are all owned by big companies; using them involves paying money and sending data to the cloud. DeepSpeech is meant to allow more people to play around in this space. To that end, DeepSpeech has been implemented using TensorFlow. It is able to run in real time on mobile devices, so there is no need to send data to some cloud server. With an error rate of 6.48%, it is the highest-quality open engine available and is close to the natural human error rate of 5.83%.

DeepSpeech currently has models for the English language, mostly because there is a great wealth of suitable data available (free audio books, for example, which allow the speech-to-text output to be compared against the original). Other languages are harder to support, but Mozilla wants to try. The Common Voice project is working to get sample text in other languages, with 20 languages targeted at the outset. It has collected about 1,800 hours of data so far. (See also: LWN's coverage of DeepSpeech and Common Voice from late 2017.)

Another experimental system is called "deepproof", which is a spelling and grammar checker for Firefox. The Grammarly extension for Firefox will do that now, but there is a little problem: it is essentially a key logger, sending everything the user types into the browser to a central server. That's not the kind of extension one might want to install, but Grammarly has a huge number of users, which is scary, he said.

Mozilla has set out to create a replacement that can run entirely within the browser on the user's device. It learns its corrections by example rather than through lots of rules, which is more scalable and requires less language-specific tweaking. The core technique used is to take text from Wikipedia, mutate it in some fashion, then set the system to correcting it; that allows it to learn without the need for language-specific experts. The result "seems to work" but needs more time before it will be production-ready. There are plans for a federated learning system that allows learning from everybody's mistakes but which doesn't require actually sharing everybody's text.

Finally, there is LPCNet, which is a text-to-speech system. These systems tend to be written as end-to-end applications, converting characters to audio spectrograms which are then converted to audio. A lot of systems use an algorithm called Griffin-Lim, but the results don't sound all that great. The WaveNet neural network produces better output, but requires "tens of gigaflops" of computing power to run; WaveRNN is faster than WaveNet, but it is still too expensive to run on a mobile device. Something much more efficient is needed if the objective is to run on end-user systems.

LPCNet works by performing a digital signal-processing pass over the data before feeding it to the neural net; this pass can predict a lot of the resulting output. That allows the network itself to be much smaller, to the point that it can run on a mobile device. Large-network systems like WaveRNN are probably performing a similar sort of filtering, he said, but nobody can know for sure since it's all coded into the network itself. The result "works really well" on mobile hardware and turns out to be useful for a number of other tasks, including speech compression, noise suppression, time stretching, and packet-loss concealment.

At that point Moffitt concluded his talk. For those wanting all of the details, a video of the talk is available; it can be seen on YouTube as well.

[Thanks to linux.conf.au and the Linux Foundation for supporting my travel to the event.]

Index entries for this article
Conferencelinux.conf.au/2019


to post comments

Mozilla's initiatives for non-creepy deep learning

Posted Feb 6, 2019 19:25 UTC (Wed) by flussence (guest, #85566) [Link] (6 responses)

>The WaveNet neural network produces better output, but requires "tens of gigaflops" of computing power to run; WaveRNN is faster than WaveNet, but it is still too expensive to run on a mobile device. Something much more efficient is needed if the objective is to run on end-user systems.

I have a > 1 TFLOPs mid-range GPU in my desktop PC. Why aren't Mozilla showing any initiative to allow people to self-host this stuff instead of having ever-more-overpowered dumb clients beholden to a single vendor?

Mozilla's initiatives for non-creepy deep learning

Posted Feb 7, 2019 11:35 UTC (Thu) by jezuch (subscriber, #52988) [Link]

Mobile devices were mentioned a couple of times in the talk. I think it's reasonable for them to target the more ambitious goal of running on them; what works on a mobile, works on a desktop.

Besides, I think it's also a reasonable goal to reduce the wastefulness of these tasks. Megacorps might not care, but billions of GPUs do make a difference.

All IMO, of course.

Mozilla's initiatives for non-creepy deep learning

Posted Feb 7, 2019 12:06 UTC (Thu) by amck (subscriber, #7270) [Link] (3 responses)

I think you misunderstand. WaveRNN is expected to be run on end-users mobiles: hence the problem with "tens of Gigaflops" of power. The problem is to get the code down to lower power requirements.

Mozilla's initiatives for non-creepy deep learning

Posted Feb 7, 2019 12:42 UTC (Thu) by pj (subscriber, #4506) [Link] (2 responses)

I think he understands just fine and is just suggesting that if the power requirements can't be lowered enough for on-mobile use to be feasible, an alternative might be to allow self-hosting, since the described power requirement is within range of a single gpu-enabled compute server.

Mozilla's initiatives for non-creepy deep learning

Posted Feb 7, 2019 20:19 UTC (Thu) by t-v (guest, #112111) [Link] (1 responses)

Doesn't WaveRNN paper specifically discussed Snapdragon performance with an eye on mobile phones?

Mozilla's initiatives for non-creepy deep learning

Posted Feb 9, 2019 21:45 UTC (Sat) by jmspeex (guest, #51639) [Link]

LPCNet (which the presentation refers to) requires 20% of an x86 core to run in real-time. We got it real-time on a single core of an iPhone 6. So real-time on mobile devices is definitely possible.

Mozilla's initiatives for non-creepy deep learning

Posted Feb 14, 2019 12:07 UTC (Thu) by mips (guest, #105013) [Link]

FWIW Mozilla's DeepSpeech stuff at least does allow you to use a local GPU: https://github.com/mozilla/DeepSpeech/#cuda-dependency

Mozilla's initiatives for non-creepy deep learning

Posted Feb 6, 2019 20:56 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> The Common Voice project is working to get sample text in other languages, with 20 languages targeted at the outset. It has collected about 1,800 hours of data so far. (See also: LWN's coverage of DeepSpeech and Common Voice from late 2017.)
OK, about this.

It seems that Common Voice is a huge management failure. It's been two years that I've been waiting for Russian language to appear in the list.

It seems this is not going to happen. Let's recap what happened:
1) For a language to be supported everything has to be localized.
2) The project took too long to add localization support (more than a year!).
3) Then they SWITCHED the localization engine.
4) Yet volunteers still wasted a lot of time and translated the strings, so now Common Voice UI is fully localized in Russian.

Should be good to go, right?

Not quite. The only missing part are the sentences to read. One heroic volunteer did a lot of work and created a list of 500 sentences that were proof-read and corrected by another heroic volunteer.

Should be good to go NOW, right? We have sentences, localization, everything.

Nope. Common Voice in its eternal wisdom decided to require 5000 sentences for a language to be supported. As a result, Russian is not supported and nobody is working on it (heroic volunteers can move you only so far).

And it's not only Russian, there's no Hindi, Arabic or Mandarin. You know, the languages of 2/3 of the world.

So the project spent a huge amount of developer time during the last 3 years on various redesigns, achieving almost NOTHING at all. Can somebody at Mozilla look into this?

Mozilla's initiatives for non-creepy deep learning

Posted Feb 14, 2019 22:19 UTC (Thu) by ftyers (guest, #130437) [Link]

Hi,

1) Common Voice is great, and it is easy to get new languages into. I have helped to add: Chuvash, Breton, Turkish, Tatar, Kyrgyz, Hakha Chin, among others.
2) Localising the interface takes about 1 day of work, or a week maximum. There really aren't that many strings.
3) Getting sentences for Russian is easy. You just need to find public domain sentences that are not too long without too many numbers in. Finding such sentences for Russian will be trivial, it has a long literary tradition with a lot of public domain text.
4) Have you considered that the reason that the "big languages" are not supported is because they already have very good speech recognition technology available "in the cloud" and that gives less incentive for "heroic volunteers" to participate.
5) If you really want to get Russian into Common Voice, you should get active on Discouse and IRC: #machinelearning on irc.mozilla.org. I don't imagine it would take more than a week to get it done. I'm around on IRC most days and will be happy to help out.
6) For me, a researcher working on lesser-resourced languages, Common Voice is a massive success. I was able to collect data and build the first speech recognition systems for Chuvash and Tatar in comparatively little time. Certainly less than it would have taken if I had to design and implement all the machinery myself.

Three cheers to Mozilla and the Common Voice and DeepSpeech teams for making it ridiculously easy to collect speech data and build speech recognition systems using only free/open-source software.

Всего хорошего!

Mozilla's initiatives for non-creepy deep learning

Posted Feb 7, 2019 9:06 UTC (Thu) by NAR (subscriber, #1313) [Link] (3 responses)

"why would anybody in their right mind put a surveillance device made by Facebook in their kitchen?"

Come on. I guess most of us (probably including the speaker) carries a surveillance device with ourselves all day... Those might not be made by Facebook, but every business wants to know as much as it can about its customers. My phone (and I think my previous phone too) could be controlled remotely.

Mozilla's initiatives for non-creepy deep learning

Posted Feb 8, 2019 8:57 UTC (Fri) by nilsmeyer (guest, #122604) [Link] (2 responses)

It's a question of incentive. What incentive does Apple have to spy on me? A lot less than Facebook I bet.

Mozilla's initiatives for non-creepy deep learning

Posted Feb 8, 2019 10:31 UTC (Fri) by NAR (subscriber, #1313) [Link]

It's not just incentives. See this FaceTime bug. Apple itself intentionally used to collect data. Of course, the OS of most spy-enabled phones is made by Google, who have just as much incentive to spy on you as Facebook.

Mozilla's initiatives for non-creepy deep learning

Posted Feb 12, 2019 20:43 UTC (Tue) by ssmith32 (subscriber, #72404) [Link]

There are many more Androids than Apples ..

Mozilla's initiatives for non-creepy deep learning

Posted Feb 14, 2019 22:18 UTC (Thu) by robbe (guest, #16131) [Link]

> Other languages are harder to support, but Mozilla wants to try. The Common Voice
> project is working to get sample text in other languages, with 20 languages targeted
> at the outset. It has collected about 1,800 hours of data so far.

LibriVox (.org) has four more languages (de, fr, es, it) with more than 100 audiobooks each. The diversity of speakers may be a problem for "smaller" languages, though.

I cannot get the Common Voice website to work for me, so no idea where they stand right now.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds