Peter Grasch spoke on the second day of Akademy about the status of FLOSS
speech recognition. The subtitle of his talk asked "why aren't we there
yet?" and Grasch provided some reasons for that, while also showing that we
aren't all that far behind the proprietary alternatives. Some focused work
from more contributors could make substantial progress on the problem.
"Speech recognition" is not really a useful term in isolation, he said. It
is like saying you want to develop an "interaction method" without
specifying what the user would be interacting with. In order to make a
"speech recognition" system, you need to know what it will be used for.
Grasch presented three separate scenarios for speech recognition, which
were the subject of a poll that he
ran on his blog. The poll was to choose which of the three he would focus
his efforts on for his Akademy talk. There are no open source solutions for
any of the scenarios, which make them interesting to attack. He promised
to spend a week—which turned into two—getting a demo ready for the conference.
The first option (and the winning
choice) was "dictation", turning spoken words into text. Second was
"virtual assistant" technology, like Apple's Siri. The last option
was "simultaneous translation" from one spoken language to another.
Basics
Speech recognition starts with a microphone that picks up some speech, such
as "this is a test". That speech gets recorded by the computer and turned
into a representation of its waveform. Then, recognition is done on the
waveform ("some magic
happens there") to turn it into
phonemes, which are,
essentially, the
individual sounds that make up words in a language. Of course, each
speaker says things differently, and accents play a role, which makes
phoneme detection a fairly complex, probabilistic calculation with "a lot
of wiggle room".
From the phonemes, there can be multiple interpretations, "this is a test"
or "this is attest", for example. Turning waveforms into phonemes uses an
"acoustic model",
but a "language model" is required to differentiate multiple
interpretations of the phonemes. The language model uses surrounding words
to choose the more likely collection of words based on the context. The
final step is to get the recognized text into a text editor, word
processor, or other program, which is part of what his project, Simon, does. (LWN reviewed Simon 0.4.0 back in January.)
In free software, there are pretty good algorithms available for handling
the models, from CMU Sphinx for
example, but there are few free models available. Grasch said that open
source developers do not seem interested in creating acoustic or language
models. For his work, he used CMU Sphinx for the algorithms, though there
are other free choices available. For the models, he wanted open source
options because proprietary models are "no fun"—and expensive.
For the acoustic model, he used VoxForge. Creating an acoustic model
requires recordings of "lots and lots" of spoken
language. It requires a learning system that gets fed all of this training
data in order to build the acoustic model. VoxForge has a corpus of
training data,
which Grasch used for his model.
The language model is similar, in that it requires a large corpus of
training data, but it is all text. He built a corpus of text by combining
data from Wikipedia, news groups (that contained a lot of spam, which
biased his model some), US Congress transcriptions (which don't really
match his use case), and so on. It ended up as around 15G of text that was
used to build the model.
Demos
Before getting to the demo of his code, Grasch showed two older videos of users
trying out the speech recognition in Windows Vista, with one showing predictably
comical results; the other worked well. When Vista was introduced six years
ago, its speech
recognition was
roughly where Simon is today, Grasch said. It is popular to bash
Microsoft, but its system is pretty good, he said; live demos of speech
recognition often show "wonky" behavior.
To prove that point, Grasch moved on to his demos. The system he was
showing is not something that he recommends for general use. It is, to
some extent, demo-ware, that he put together in a short time to try to get
people excited about speech recognition so they would "come to the BoF
later in the week."
He started by using the KNotes notepad application and said "Hello KDE
developers period" into the microphone plugged into his laptop. After a
one second or so pause, "Hi Leo kde developers." showed up in the notepad.
He then said "new paragraph this actually works better than Microsoft's
system I am realizing period" which showed up nearly perfectly (just
missing the possessive in "Microsoft's") to loud applause—applause that
resulted in a
gibberish "sentence" appearing a moment or two later.
Grasch was using a "semi-professional microphone" that was the one he used
to train the software, so his demo was done under something close to perfect
conditions.
Under those conditions, he gets a word error rate (WER) of 13.3%, "which is
pretty good". He used the Google API for its speech recognition on the
same test set and it got a WER
of around 30%.
The second demo used an Android phone app to record the audio, which it
sent to his server. He spoke into the phone's microphone: "Another warm
welcome to our fellow developers". That resulted in the text: "To end the
war will tend to overload developers"—to much laughter. But Grasch was
unfazed: "Yes! It recognized 'developers'", he said with a grin. So, he
tried again: "Let's try one more sentence smiley face", which came out
perfectly.
Improvements
To sum up, his system works pretty well in optimal conditions and it "even kinda
works on stage, but not really", Grasch said. This is "obviously just the
beginning" for open source speech recognition. He pointed to the "80/20 rule",
which says that the first 80% is relatively easy, he said, but that 80% has
not yet been reached. That means there is a lot of "payoff for little
work" at this point. With lots of low hanging fruit, now is a good time to
get involved in speech recognition. Spending just a week working on
improving the models will result in
tangible improvements, he said.
Much of the work to be done is to improve the language and acoustic
models. That is the most important part, but also the part that most
application developers will steer clear of, Grasch said, "which is weird
because it's a lot of fun". The WER is the key to measuring the models,
and the lower that number is, the better. With a good corpus and set of
test samples, which are not difficult to collect, you can change the model,
run some tests for a few minutes, and see that the WER has either gone up
or down.
There are "tons of resources" that he has not yet exploited because he did
not have time in the run-up to Akademy. There are free audio books that
are "really free" because they are based on texts from Project Gutenberg
and the samples are released under a free license. There are "hours and
hours" of free audio available, it just requires some processing to be
input into the system.
Another way to acquire samples is to use the prototype for some task. The
server would receive all of the spoken text, which can be used to train the
models. This is what the "big guys" are doing, he said. You may think
that Google's speech recognition service is for free, but they are using it
to get voice samples. This is "not evil" as long as it is all anonymously
stored, he said.
There are also recordings of conference talks that could be used. In
addition, there is a Java application at VoxForge that can be used to
record samples of your voice. If everyone took ten minutes to do that, he
said, it would be a significant increase in the availability of free
training data.
For the language model, more text is needed. Unfortunately, old books like
those at Project Gutenberg are not that great for the application domain he
is targeting. Text from sources like blogs would be better, he said. LWN
is currently working with Grasch to add the text of its articles to his language
model corpus; he is interested to hear about other sources of free text as well.
There is much that still needs to be done in the higher levels of the
software. For example, corrections (e.g. "delete that") are rudimentary at
the moment. Also, the performance of the recognition could be increased.
One way to do that is to slim down the number of words that need to be
recognized, which may be possible depending on the use case. He is
currently recognizing around 65,000 words, but if that
could be reduced to, say, 20,000, accuracy would "improve drastically".
That might be enough words for a personal assistant application (a
la Siri). Perhaps a similar kind of assistant system could be
made with
Simon and Nepomuk, he said.
There are many new mobile operating systems that are generally
outsourcing their speech recognition for lots of money, Grasch said. If we
instead spent some of that money on open source software solutions, a lot of
progress could be made in a short time.
Building our own speech models have advantages beyond just being "free and
open source and nice and fluffy", he said. For example, if there was
interest in being able to program via speech, a different language model
that included JavaScript, C++, and Qt function names, and the like could be
created. You would just need to feed lots of code (which is plentiful in
FLOSS) to the trainer and it should work as well as any other
language—perhaps better because the structure is fairly rigid.
Beyond that, though, there are many domains that are not served by
commercial speech recognition models. Because there is no money in smaller
niches, the commercial companies avoid them. FLOSS solutions might not.
Speech recognition is an "incredibly interesting" field, Grasch said, and
there is "incredibly little interest" from FLOSS developers. He is the
only active Simon developer, for example.
A fairly lively question-and-answer session followed the talk. Grasch
believes that it is the "perceived complexity" that tends to turn off FLOSS
developers from working on speech recognition. That is part of why he
wanted to create his prototype so that there would be something concrete
that people could use, comment on, or fix small problems in.
Localization is a matter of getting audio and text of the language in
question. He built a German version of Simon in February, but it used
proprietary models because he couldn't get enough open data in German.
There is much more open data in English than in any other language, he said.
He plans to publish the models that he used right after the conference. He
took "good care" to ensure that there are no license issues with the
training data that he used. Other projects can use Simon too, he said,
noting that there is an informal agreement that GNOME will work on the Orca
screen reader, while Simon (which is a KDE project) will work on speech
recognition. It doesn't
make sense for there to be another project doing open source speech
recognition, he said.
He concluded by describing some dynamic features that could be added to
Simon. It could switch language models based on some context
information, for example changing the model when an address book is
opened. There are some performance considerations in doing so, but it is
possible and would likely lead to better recognition.
[Thanks to KDE e.V. for travel assistance to Bilbao for Akademy.]
(
Log in to post comments)