FLOSS speech recognition

By Jake Edge
July 24, 2013

Akademy 2013

Peter Grasch spoke on the second day of Akademy about the status of FLOSS speech recognition. The subtitle of his talk asked "why aren't we there yet?" and Grasch provided some reasons for that, while also showing that we aren't all that far behind the proprietary alternatives. Some focused work from more contributors could make substantial progress on the problem.

"Speech recognition" is not really a useful term in isolation, he said. It is like saying you want to develop an "interaction method" without specifying what the user would be interacting with. In order to make a "speech recognition" system, you need to know what it will be used for.

Grasch presented three separate scenarios for speech recognition, which were the subject of a poll that he ran on his blog. The poll was to choose which of the three he would focus his efforts on for his Akademy talk. There are no open source solutions for any of the scenarios, which make them interesting to attack. He promised to spend a week—which turned into two—getting a demo ready for the conference.

The first option (and the winning choice) was "dictation", turning spoken words into text. Second was "virtual assistant" technology, like Apple's Siri. The last option was "simultaneous translation" from one spoken language to another.

Basics

Speech recognition starts with a microphone that picks up some speech, such as "this is a test". That speech gets recorded by the computer and turned into a representation of its waveform. Then, recognition is done on the waveform ("some magic happens there") to turn it into phonemes, which are, essentially, the individual sounds that make up words in a language. Of course, each speaker says things differently, and accents play a role, which makes phoneme detection a fairly complex, probabilistic calculation with "a lot of wiggle room".

From the phonemes, there can be multiple interpretations, "this is a test" or "this is attest", for example. Turning waveforms into phonemes uses an "acoustic model", but a "language model" is required to differentiate multiple interpretations of the phonemes. The language model uses surrounding words to choose the more likely collection of words based on the context. The final step is to get the recognized text into a text editor, word processor, or other program, which is part of what his project, Simon, does. (LWN reviewed Simon 0.4.0 back in January.)

In free software, there are pretty good algorithms available for handling the models, from CMU Sphinx for example, but there are few free models available. Grasch said that open source developers do not seem interested in creating acoustic or language models. For his work, he used CMU Sphinx for the algorithms, though there are other free choices available. For the models, he wanted open source options because proprietary models are "no fun"—and expensive.

For the acoustic model, he used VoxForge. Creating an acoustic model requires recordings of "lots and lots" of spoken language. It requires a learning system that gets fed all of this training data in order to build the acoustic model. VoxForge has a corpus of training data, which Grasch used for his model.

The language model is similar, in that it requires a large corpus of training data, but it is all text. He built a corpus of text by combining data from Wikipedia, news groups (that contained a lot of spam, which biased his model some), US Congress transcriptions (which don't really match his use case), and so on. It ended up as around 15G of text that was used to build the model.

Demos

Before getting to the demo of his code, Grasch showed two older videos of users trying out the speech recognition in Windows Vista, with one showing predictably comical results; the other worked well. When Vista was introduced six years ago, its speech recognition was roughly where Simon is today, Grasch said. It is popular to bash Microsoft, but its system is pretty good, he said; live demos of speech recognition often show "wonky" behavior.

To prove that point, Grasch moved on to his demos. The system he was showing is not something that he recommends for general use. It is, to some extent, demo-ware, that he put together in a short time to try to get people excited about speech recognition so they would "come to the BoF later in the week."

He started by using the KNotes notepad application and said "Hello KDE developers period" into the microphone plugged into his laptop. After a one second or so pause, "Hi Leo kde developers." showed up in the notepad. He then said "new paragraph this actually works better than Microsoft's system I am realizing period" which showed up nearly perfectly (just missing the possessive in "Microsoft's") to loud applause—applause that resulted in a gibberish "sentence" appearing a moment or two later.

Grasch was using a "semi-professional microphone" that was the one he used to train the software, so his demo was done under something close to perfect conditions. Under those conditions, he gets a word error rate (WER) of 13.3%, "which is pretty good". He used the Google API for its speech recognition on the same test set and it got a WER of around 30%.

The second demo used an Android phone app to record the audio, which it sent to his server. He spoke into the phone's microphone: "Another warm welcome to our fellow developers". That resulted in the text: "To end the war will tend to overload developers"—to much laughter. But Grasch was unfazed: "Yes! It recognized 'developers'", he said with a grin. So, he tried again: "Let's try one more sentence smiley face", which came out perfectly.

Improvements

To sum up, his system works pretty well in optimal conditions and it "even kinda works on stage, but not really", Grasch said. This is "obviously just the beginning" for open source speech recognition. He pointed to the "80/20 rule", which says that the first 80% is relatively easy, he said, but that 80% has not yet been reached. That means there is a lot of "payoff for little work" at this point. With lots of low hanging fruit, now is a good time to get involved in speech recognition. Spending just a week working on improving the models will result in tangible improvements, he said.

Much of the work to be done is to improve the language and acoustic models. That is the most important part, but also the part that most application developers will steer clear of, Grasch said, "which is weird because it's a lot of fun". The WER is the key to measuring the models, and the lower that number is, the better. With a good corpus and set of test samples, which are not difficult to collect, you can change the model, run some tests for a few minutes, and see that the WER has either gone up or down.

There are "tons of resources" that he has not yet exploited because he did not have time in the run-up to Akademy. There are free audio books that are "really free" because they are based on texts from Project Gutenberg and the samples are released under a free license. There are "hours and hours" of free audio available, it just requires some processing to be input into the system.

Another way to acquire samples is to use the prototype for some task. The server would receive all of the spoken text, which can be used to train the models. This is what the "big guys" are doing, he said. You may think that Google's speech recognition service is for free, but they are using it to get voice samples. This is "not evil" as long as it is all anonymously stored, he said.

There are also recordings of conference talks that could be used. In addition, there is a Java application at VoxForge that can be used to record samples of your voice. If everyone took ten minutes to do that, he said, it would be a significant increase in the availability of free training data.

For the language model, more text is needed. Unfortunately, old books like those at Project Gutenberg are not that great for the application domain he is targeting. Text from sources like blogs would be better, he said. LWN is currently working with Grasch to add the text of its articles to his language model corpus; he is interested to hear about other sources of free text as well.

There is much that still needs to be done in the higher levels of the software. For example, corrections (e.g. "delete that") are rudimentary at the moment. Also, the performance of the recognition could be increased. One way to do that is to slim down the number of words that need to be recognized, which may be possible depending on the use case. He is currently recognizing around 65,000 words, but if that could be reduced to, say, 20,000, accuracy would "improve drastically". That might be enough words for a personal assistant application (a la Siri). Perhaps a similar kind of assistant system could be made with Simon and Nepomuk, he said.

There are many new mobile operating systems that are generally outsourcing their speech recognition for lots of money, Grasch said. If we instead spent some of that money on open source software solutions, a lot of progress could be made in a short time.

Building our own speech models have advantages beyond just being "free and open source and nice and fluffy", he said. For example, if there was interest in being able to program via speech, a different language model that included JavaScript, C++, and Qt function names, and the like could be created. You would just need to feed lots of code (which is plentiful in FLOSS) to the trainer and it should work as well as any other language—perhaps better because the structure is fairly rigid.

Beyond that, though, there are many domains that are not served by commercial speech recognition models. Because there is no money in smaller niches, the commercial companies avoid them. FLOSS solutions might not. Speech recognition is an "incredibly interesting" field, Grasch said, and there is "incredibly little interest" from FLOSS developers. He is the only active Simon developer, for example.

A fairly lively question-and-answer session followed the talk. Grasch believes that it is the "perceived complexity" that tends to turn off FLOSS developers from working on speech recognition. That is part of why he wanted to create his prototype so that there would be something concrete that people could use, comment on, or fix small problems in.

Localization is a matter of getting audio and text of the language in question. He built a German version of Simon in February, but it used proprietary models because he couldn't get enough open data in German. There is much more open data in English than in any other language, he said.

He plans to publish the models that he used right after the conference. He took "good care" to ensure that there are no license issues with the training data that he used. Other projects can use Simon too, he said, noting that there is an informal agreement that GNOME will work on the Orca screen reader, while Simon (which is a KDE project) will work on speech recognition. It doesn't make sense for there to be another project doing open source speech recognition, he said.

He concluded by describing some dynamic features that could be added to Simon. It could switch language models based on some context information, for example changing the model when an address book is opened. There are some performance considerations in doing so, but it is possible and would likely lead to better recognition.

[Thanks to KDE e.V. for travel assistance to Bilbao for Akademy.]

Index entries for this article
Conference	Akademy/2013

FLOSS speech recognition

Posted Jul 25, 2013 10:07 UTC (Thu) by stijn (subscriber, #570) [Link] (2 responses)

I wonder, is there a patent mine field around speech recognition, and if so, are the ramifications similar to other fields of software engineering? There must be considerable commercial interests in this particular area, and patents are the obvious levers businesses will try to push.

FLOSS speech recognition

Posted Jul 25, 2013 14:03 UTC (Thu) by jmspeex (subscriber, #51639) [Link] (1 responses)

Patents are much less of an issue for speech recognition because there's no issue of compatibility. Standards attract minefields because with a single patent you can hold a company for ransom: pay us or else you can't be compatible with the standard and your product breaks. For something like speech recognition, it's hard to even tell someone is infringing (you don't know what the product is doing because it's not a public standard) and even if you did, there's no holding up for ransom because any details in the algorithm can be changed at will with no compatibility issue.

FLOSS speech recognition

Posted Jul 25, 2013 15:01 UTC (Thu) by stijn (subscriber, #570) [Link]

There are a lot of patents not to do with standards but just with functionality. For example Apple's 'bounce back' patent. There must be tens of thousands of them with a good number leading to lawsuits. I think Steve Jobs asserted Apple patented everything around Siri that they could. In compression there are a lot of patents, and the gzip authors had to be very careful around them, at a time when the current-day deluge of patents was presumably still some way off. The fear is that some stupid overly-general feature has been granted a patent.

FLOSS speech recognition

Posted Jul 25, 2013 12:00 UTC (Thu) by zmower (subscriber, #3005) [Link] (4 responses)

"LWN is currently working with Grasch to add the text of its articles to his language model corpus"

And the comments on the articles? I wouldn't mind myself.

FLOSS speech recognition

Posted Jul 25, 2013 18:19 UTC (Thu) by jimparis (guest, #38647) [Link] (2 responses)

> > "LWN is currently working with Grasch to add the text of its articles to his language model corpus"
> And the comments on the articles? I wouldn't mind myself.

I think fat could bee a bad idea. Sum people might try two purposely mess up duh model just too sea what happens.

FLOSS speech recognition

Posted Jul 25, 2013 20:17 UTC (Thu) by oever (guest, #987) [Link] (1 responses)

> Sum people might try two purposely mess up duh model just too sea what happens.
You make me sed.

FLOSS speech recognition

Posted Jul 26, 2013 7:23 UTC (Fri) by xanni (subscriber, #361) [Link]

Like you can't awk.

FLOSS speech recognition

Posted Aug 1, 2013 23:21 UTC (Thu) by douglasbagnall (subscriber, #62736) [Link]

> "LWN is currently working with Grasch to add the text of its articles to his language model corpus"
>
> And the comments on the articles? I wouldn't mind myself.

Written English is grammatically quite different from spoken English, but conversational comments like your one are (I speculate) less so than the articles. So using just the comments might lead to a better model of spoken English than including the articles would.

Following similar logic for a slightly different purpose, I have previously used Wikipedia talk pages for language models. You get good informal constructs there, particularly on pop culture topics.

data for acoustic models

Posted Aug 1, 2013 23:07 UTC (Thu) by douglasbagnall (subscriber, #62736) [Link] (1 responses)

While Voxforge has quite a lot of English speech, it is almost all in a US accent, so it can only make models thereof. Other Englishes are quite different, generally using more phonemes, lacking rhoticity, and having different mappings between phonemes and words. For example, in General American English (as it is called) “father” rhymes with “bother” but does not sound identical to “farther”; this is reversed in most other English dialects. The vowel used in “bother” in those dialects does not exist in General American.

A part of the problem (reasonably) not mentioned by Peter Grasch is the pronunciation model—a dictionary mapping phoneme sequences to words. Creating a new pronunciation model is neither trivial nor interesting. The only truly free one in English is the General American CMU dictionary. Wiktionary is a bit too messy and full of holes. The various text-to-speech engines can spit out phonemic transcriptions, but checking the output is a problem.

Nevertheless, Peter Grasch demonstrates that a voxforge model tuned to a single non-US speaker can do surprisingly well. It is the multiple speaker models that really suffer.

While I'm about it, I might as well mention another problem with Voxforge for general purpose models: its voices almost all belong to non-elderly adult males—the usual free software demographic. Its models won't perform well for children, women, or elderly men.

Anyway, great projects, Simon, Sphinx, and Voxforge; but great problems also, and they aren't software problems.

I prefer speaker-dependent models for accuracy and control

Posted Aug 2, 2013 20:48 UTC (Fri) by jgd (guest, #19925) [Link]

I would rather that each person wanting voice recognition run software trained to their voice on their own hardware. They should always have the opportunity to see what was recognized and confirm or edit it before it is used. The system should also retain second and third choices until the user has confirmed recognition so a menu of alternatives can be offered.