FLOSS speech recognition
Peter Grasch spoke on the second day of Akademy about the status of FLOSS speech recognition. The subtitle of his talk asked "why aren't we there yet?" and Grasch provided some reasons for that, while also showing that we aren't all that far behind the proprietary alternatives. Some focused work from more contributors could make substantial progress on the problem.
![[Peter Grasch]](https://static.lwn.net/images/2013/akad-grasch-sm.jpg)
"Speech recognition" is not really a useful term in isolation, he said. It is like saying you want to develop an "interaction method" without specifying what the user would be interacting with. In order to make a "speech recognition" system, you need to know what it will be used for.
Grasch presented three separate scenarios for speech recognition, which were the subject of a poll that he ran on his blog. The poll was to choose which of the three he would focus his efforts on for his Akademy talk. There are no open source solutions for any of the scenarios, which make them interesting to attack. He promised to spend a week—which turned into two—getting a demo ready for the conference.
The first option (and the winning choice) was "dictation", turning spoken words into text. Second was "virtual assistant" technology, like Apple's Siri. The last option was "simultaneous translation" from one spoken language to another.
Basics
Speech recognition starts with a microphone that picks up some speech, such as "this is a test". That speech gets recorded by the computer and turned into a representation of its waveform. Then, recognition is done on the waveform ("some magic happens there") to turn it into phonemes, which are, essentially, the individual sounds that make up words in a language. Of course, each speaker says things differently, and accents play a role, which makes phoneme detection a fairly complex, probabilistic calculation with "a lot of wiggle room".
From the phonemes, there can be multiple interpretations, "this is a test" or "this is attest", for example. Turning waveforms into phonemes uses an "acoustic model", but a "language model" is required to differentiate multiple interpretations of the phonemes. The language model uses surrounding words to choose the more likely collection of words based on the context. The final step is to get the recognized text into a text editor, word processor, or other program, which is part of what his project, Simon, does. (LWN reviewed Simon 0.4.0 back in January.)
In free software, there are pretty good algorithms available for handling the models, from CMU Sphinx for example, but there are few free models available. Grasch said that open source developers do not seem interested in creating acoustic or language models. For his work, he used CMU Sphinx for the algorithms, though there are other free choices available. For the models, he wanted open source options because proprietary models are "no fun"—and expensive.
For the acoustic model, he used VoxForge. Creating an acoustic model requires recordings of "lots and lots" of spoken language. It requires a learning system that gets fed all of this training data in order to build the acoustic model. VoxForge has a corpus of training data, which Grasch used for his model.
The language model is similar, in that it requires a large corpus of training data, but it is all text. He built a corpus of text by combining data from Wikipedia, news groups (that contained a lot of spam, which biased his model some), US Congress transcriptions (which don't really match his use case), and so on. It ended up as around 15G of text that was used to build the model.
Demos
Before getting to the demo of his code, Grasch showed two older videos of users trying out the speech recognition in Windows Vista, with one showing predictably comical results; the other worked well. When Vista was introduced six years ago, its speech recognition was roughly where Simon is today, Grasch said. It is popular to bash Microsoft, but its system is pretty good, he said; live demos of speech recognition often show "wonky" behavior.
To prove that point, Grasch moved on to his demos. The system he was showing is not something that he recommends for general use. It is, to some extent, demo-ware, that he put together in a short time to try to get people excited about speech recognition so they would "come to the BoF later in the week."
He started by using the KNotes notepad application and said "Hello KDE developers period" into the microphone plugged into his laptop. After a one second or so pause, "Hi Leo kde developers." showed up in the notepad. He then said "new paragraph this actually works better than Microsoft's system I am realizing period" which showed up nearly perfectly (just missing the possessive in "Microsoft's") to loud applause—applause that resulted in a gibberish "sentence" appearing a moment or two later.
Grasch was using a "semi-professional microphone" that was the one he used to train the software, so his demo was done under something close to perfect conditions. Under those conditions, he gets a word error rate (WER) of 13.3%, "which is pretty good". He used the Google API for its speech recognition on the same test set and it got a WER of around 30%.
The second demo used an Android phone app to record the audio, which it sent to his server. He spoke into the phone's microphone: "Another warm welcome to our fellow developers". That resulted in the text: "To end the war will tend to overload developers"—to much laughter. But Grasch was unfazed: "Yes! It recognized 'developers'", he said with a grin. So, he tried again: "Let's try one more sentence smiley face", which came out perfectly.
Improvements
To sum up, his system works pretty well in optimal conditions and it "even kinda works on stage, but not really", Grasch said. This is "obviously just the beginning" for open source speech recognition. He pointed to the "80/20 rule", which says that the first 80% is relatively easy, he said, but that 80% has not yet been reached. That means there is a lot of "payoff for little work" at this point. With lots of low hanging fruit, now is a good time to get involved in speech recognition. Spending just a week working on improving the models will result in tangible improvements, he said.
Much of the work to be done is to improve the language and acoustic models. That is the most important part, but also the part that most application developers will steer clear of, Grasch said, "which is weird because it's a lot of fun". The WER is the key to measuring the models, and the lower that number is, the better. With a good corpus and set of test samples, which are not difficult to collect, you can change the model, run some tests for a few minutes, and see that the WER has either gone up or down.
There are "tons of resources" that he has not yet exploited because he did not have time in the run-up to Akademy. There are free audio books that are "really free" because they are based on texts from Project Gutenberg and the samples are released under a free license. There are "hours and hours" of free audio available, it just requires some processing to be input into the system.
Another way to acquire samples is to use the prototype for some task. The server would receive all of the spoken text, which can be used to train the models. This is what the "big guys" are doing, he said. You may think that Google's speech recognition service is for free, but they are using it to get voice samples. This is "not evil" as long as it is all anonymously stored, he said.
There are also recordings of conference talks that could be used. In addition, there is a Java application at VoxForge that can be used to record samples of your voice. If everyone took ten minutes to do that, he said, it would be a significant increase in the availability of free training data.
For the language model, more text is needed. Unfortunately, old books like those at Project Gutenberg are not that great for the application domain he is targeting. Text from sources like blogs would be better, he said. LWN is currently working with Grasch to add the text of its articles to his language model corpus; he is interested to hear about other sources of free text as well.
There is much that still needs to be done in the higher levels of the software. For example, corrections (e.g. "delete that") are rudimentary at the moment. Also, the performance of the recognition could be increased. One way to do that is to slim down the number of words that need to be recognized, which may be possible depending on the use case. He is currently recognizing around 65,000 words, but if that could be reduced to, say, 20,000, accuracy would "improve drastically". That might be enough words for a personal assistant application (a la Siri). Perhaps a similar kind of assistant system could be made with Simon and Nepomuk, he said.
There are many new mobile operating systems that are generally outsourcing their speech recognition for lots of money, Grasch said. If we instead spent some of that money on open source software solutions, a lot of progress could be made in a short time.
Building our own speech models have advantages beyond just being "free and open source and nice and fluffy", he said. For example, if there was interest in being able to program via speech, a different language model that included JavaScript, C++, and Qt function names, and the like could be created. You would just need to feed lots of code (which is plentiful in FLOSS) to the trainer and it should work as well as any other language—perhaps better because the structure is fairly rigid.
Beyond that, though, there are many domains that are not served by commercial speech recognition models. Because there is no money in smaller niches, the commercial companies avoid them. FLOSS solutions might not. Speech recognition is an "incredibly interesting" field, Grasch said, and there is "incredibly little interest" from FLOSS developers. He is the only active Simon developer, for example.
A fairly lively question-and-answer session followed the talk. Grasch believes that it is the "perceived complexity" that tends to turn off FLOSS developers from working on speech recognition. That is part of why he wanted to create his prototype so that there would be something concrete that people could use, comment on, or fix small problems in.
Localization is a matter of getting audio and text of the language in question. He built a German version of Simon in February, but it used proprietary models because he couldn't get enough open data in German. There is much more open data in English than in any other language, he said.
He plans to publish the models that he used right after the conference. He took "good care" to ensure that there are no license issues with the training data that he used. Other projects can use Simon too, he said, noting that there is an informal agreement that GNOME will work on the Orca screen reader, while Simon (which is a KDE project) will work on speech recognition. It doesn't make sense for there to be another project doing open source speech recognition, he said.
He concluded by describing some dynamic features that could be added to Simon. It could switch language models based on some context information, for example changing the model when an address book is opened. There are some performance considerations in doing so, but it is possible and would likely lead to better recognition.
[Thanks to KDE e.V. for travel assistance to Bilbao for Akademy.]
Index entries for this article | |
---|---|
Conference | Akademy/2013 |
Posted Jul 25, 2013 10:07 UTC (Thu)
by stijn (subscriber, #570)
[Link] (2 responses)
Posted Jul 25, 2013 14:03 UTC (Thu)
by jmspeex (subscriber, #51639)
[Link] (1 responses)
Posted Jul 25, 2013 15:01 UTC (Thu)
by stijn (subscriber, #570)
[Link]
Posted Jul 25, 2013 12:00 UTC (Thu)
by zmower (subscriber, #3005)
[Link] (4 responses)
Posted Jul 25, 2013 18:19 UTC (Thu)
by jimparis (guest, #38647)
[Link] (2 responses)
I think fat could bee a bad idea. Sum people might try two purposely mess up duh model just too sea what happens.
Posted Jul 25, 2013 20:17 UTC (Thu)
by oever (guest, #987)
[Link] (1 responses)
Posted Jul 26, 2013 7:23 UTC (Fri)
by xanni (subscriber, #361)
[Link]
Posted Aug 1, 2013 23:21 UTC (Thu)
by douglasbagnall (subscriber, #62736)
[Link]
Written English is grammatically quite different from spoken English, but conversational comments like your one are (I speculate) less so than the articles. So using just the comments might lead to a better model of spoken English than including the articles would.
Following similar logic for a slightly different purpose, I have previously used Wikipedia talk pages for language models. You get good informal constructs there, particularly on pop culture topics.
Posted Aug 1, 2013 23:07 UTC (Thu)
by douglasbagnall (subscriber, #62736)
[Link] (1 responses)
A part of the problem (reasonably) not mentioned by Peter Grasch is the pronunciation model—a dictionary mapping phoneme sequences to words. Creating a new pronunciation model is neither trivial nor interesting. The only truly free one in English is the General American CMU dictionary. Wiktionary is a bit too messy and full of holes. The various text-to-speech engines can spit out phonemic transcriptions, but checking the output is a problem.
Nevertheless, Peter Grasch demonstrates that a voxforge model tuned to a single non-US speaker can do surprisingly well. It is the multiple speaker models that really suffer.
While I'm about it, I might as well mention another problem with Voxforge for general purpose models: its voices almost all belong to non-elderly adult males—the usual free software demographic. Its models won't perform well for children, women, or elderly men.
Anyway, great projects, Simon, Sphinx, and Voxforge; but great problems also, and they aren't software problems.
Posted Aug 2, 2013 20:48 UTC (Fri)
by jgd (guest, #19925)
[Link]
FLOSS speech recognition
FLOSS speech recognition
FLOSS speech recognition
FLOSS speech recognition
"LWN is currently working with Grasch to add the text of its articles to his language model corpus"
And the comments on the articles? I wouldn't mind myself.
FLOSS speech recognition
> And the comments on the articles? I wouldn't mind myself.
FLOSS speech recognition
You make me sed.
FLOSS speech recognition
FLOSS speech recognition
>
> And the comments on the articles? I wouldn't mind myself.
data for acoustic models
I prefer speaker-dependent models for accuracy and control