Mozilla's initiatives for non-creepy deep learning
He defined machine learning as the process of making decisions and/or predictions by modeling from input data. Systems using these techniques can perform all kinds of tasks, including language detection and (bad) poetry generation. The classic machine-learning task is spam filtering, based on the idea that certain words tend to appear more often in spam and can be used to detect unwanted email. With more modern neural networks, though, there is no need to do that sort of feature engineering; the net itself can figure out what the interesting features are. It is, he said, "pretty magical".
Moffitt gave a quick overview of some of the structures used for
contemporary deep learning, including neural networks, convolutional
networks, and recurrent networks. The last of those are useful for speech
recognition and synthesis tasks; they are used a lot at Mozilla. See the
video (linked at the bottom) for more details about how these different
types of networks perform their magic. Regardless of the
architecture, the overall technique used to train these networks is the
same: present them with input data, then tweak the network's parameters to
bring the output closer to what is desired. Do that enough times with
enough data, and the network should get good at performing the intended
task.
One nice feature of these networks is that it is possible to take a trained model and use it for purposes other than the intended one. A network that has been trained to recognize objects in general, for example, can be pressed into service as the starting point for a face detector. This approach is especially useful in settings where there aren't vast amount of data available to train the network with. Another useful technique is "generative adversarial networks", where two independent networks are trained against each other. If one network generates fake images and another one detects fakes, both can be improved by pitting one against the other.
The dark side
There are many interesting applications of deep learning, he said, but also a dark side. Open-source software can, in general, be used for any purpose regardless of whether the author approves; it can be used to create weapons, for example. Deep-learning applications have their own set of uses that we should all be concerned about, he said.
For example, neural networks have an infinite appetite for data; the more data you can train a system with, the better it will learn its task. That gives huge companies an incentive to acquire as much data as they possibly can. Ostensibly this is done to create better products, but we have to trust these companies that they are not using this data for other purposes. As an example, smart assistants like Alexa will get better at speech recognition as they are trained with more data, so they save a copy of everything that is ever said to them (and sometimes things that are not). That is, he said, "scary".
Deep-learning systems are computationally expensive; it typically takes a huge farm of GPUs to perform the training. Running them is cheaper, but they still don't really fit onto edge devices, with the result that processing moves to the cloud — and all of that input data moves with it. Efficiency does not really appear to be a concern for the people who are designing and building these systems.
There are introspection issues; how does one diagnose problems with a deep-learning system when one doesn't really understand how it comes to its conclusions in the first place? Mistakes are bound to happen, and some of them may have severe consequences. Many of these issues can be solved with more input data, of course, but training data can have unknown biases in it. It will always be possible to get "weird results" from deep-learning systems, and there is no easy way to figure out why when that happens.
Then there is the issue of bias in general. He called out the famous case of Google Photos labeling black faces as belonging to gorillas. Such errors are the result of poor training data and a lack of comprehensive testing; he suggested that perhaps this case shows that Google does not have enough black employees. Word embedding is a useful technique for language processing that tracks the "distance" between related words. A word-embedding system trained on web text is much more likely to associate the word "doctor" with "man" than "woman". Some biases, such as gender-related problems, can be corrected with a technique called "reprojection", but others, such as race, are harder to deal with.
Deep learning at Mozilla
Mozilla has the desire to use these technologies and to make them available to others. But, at the same time, there is a strong desire to avoid the above problems. Moffitt listed a number of projects that, Mozilla hopes, will meet those goals.
The DeepSpeech project is building a speech-to-text system, focused on both recognition and data collection. Existing applications in this space are all owned by big companies; using them involves paying money and sending data to the cloud. DeepSpeech is meant to allow more people to play around in this space. To that end, DeepSpeech has been implemented using TensorFlow. It is able to run in real time on mobile devices, so there is no need to send data to some cloud server. With an error rate of 6.48%, it is the highest-quality open engine available and is close to the natural human error rate of 5.83%.
DeepSpeech currently has models for the English language, mostly because there is a great wealth of suitable data available (free audio books, for example, which allow the speech-to-text output to be compared against the original). Other languages are harder to support, but Mozilla wants to try. The Common Voice project is working to get sample text in other languages, with 20 languages targeted at the outset. It has collected about 1,800 hours of data so far. (See also: LWN's coverage of DeepSpeech and Common Voice from late 2017.)
Another experimental system is called "deepproof", which is a spelling and grammar checker for Firefox. The Grammarly extension for Firefox will do that now, but there is a little problem: it is essentially a key logger, sending everything the user types into the browser to a central server. That's not the kind of extension one might want to install, but Grammarly has a huge number of users, which is scary, he said.
Mozilla has set out to create a replacement that can run entirely within the browser on the user's device. It learns its corrections by example rather than through lots of rules, which is more scalable and requires less language-specific tweaking. The core technique used is to take text from Wikipedia, mutate it in some fashion, then set the system to correcting it; that allows it to learn without the need for language-specific experts. The result "seems to work" but needs more time before it will be production-ready. There are plans for a federated learning system that allows learning from everybody's mistakes but which doesn't require actually sharing everybody's text.
Finally, there is LPCNet, which is a text-to-speech system. These systems tend to be written as end-to-end applications, converting characters to audio spectrograms which are then converted to audio. A lot of systems use an algorithm called Griffin-Lim, but the results don't sound all that great. The WaveNet neural network produces better output, but requires "tens of gigaflops" of computing power to run; WaveRNN is faster than WaveNet, but it is still too expensive to run on a mobile device. Something much more efficient is needed if the objective is to run on end-user systems.
LPCNet works by performing a digital signal-processing pass over the data before feeding it to the neural net; this pass can predict a lot of the resulting output. That allows the network itself to be much smaller, to the point that it can run on a mobile device. Large-network systems like WaveRNN are probably performing a similar sort of filtering, he said, but nobody can know for sure since it's all coded into the network itself. The result "works really well" on mobile hardware and turns out to be useful for a number of other tasks, including speech compression, noise suppression, time stretching, and packet-loss concealment.
At that point Moffitt concluded his talk. For those wanting all of the details, a video of the talk is available; it can be seen on YouTube as well.
[Thanks to linux.conf.au and the Linux Foundation for supporting my travel
to the event.]
| Index entries for this article | |
|---|---|
| Conference | linux.conf.au/2019 |
