By Nathan Willis
January 9, 2013
The open source speech recognition project Simon unveiled version 0.4.0 on December 30, 2012, after two years of development. The
new release boasts some significant architectural changes, so the
project advises users not to replace existing versions on production
systems. But the changes make Simon noticeably easier to work with,
which will please new users. Conversing freely with one's Linux PC is
still a ways off, but speech recognition with free software is no
longer the exclusive domain of laboratory research.
"Speech recognition" can encompass a range of different projects,
such as dictation (e.g., transcribing audio content) or detecting
stress in a human voice. Simon is designed to function as a voice
interface to the desktop computer; it listens to live audio input,
picks out keywords intended as commands, and pipes them to other
applications.
Categorical imperatives
Beginning with the 0.3.0 series released in 2010, Simon has based its
command-recognition framework on the idea of separate "scenarios" for
each application or use case. Scenarios can be as specific as the
developer wishes to make them; a general web-browsing scenario for
Firefox may be designed to handle only opening links and scrolling
through pages, but another could be tailored to work with GMail
functionality and keyboard shortcuts. Simon 0.4.0 builds on this
approach by adding context awareness: it will activate and deactivate
different scenarios depending on which applications the user has open
and which have focus. The scenarios still need to be manually
installed beforehand, though, so there is little risk Simon will start
erasing your hard drives if you happen to walk by and utter the word
"partition."
Simon can use any of several back-ends to perform the
speech-recognition part of the puzzle. Earlier releases relied on
either the BSD-licensed Julius or the
better — but non-free licensed — Hidden Markov Model Toolkit
(HTK). Version 0.4.0 adds support for another free software
recognition toolkit, CMU
Sphinx.
The Sphinx engine is highly regarded for its quality, and
provides functions that Julius does not, such as the ability to create
one's own acoustic speech
model. An acoustic model is the statistical representation of the
sounds that correspond to the parts of speech that the engine is
trying to recognize; it depends on both a "corpus" of audio samples of
the speaker or speakers and on a grammar model for the language being
spoken. Free sources for acoustic speech models have historically
been hard to come by, because most were created by proprietary
projects or had no clear licensing at all.
Luckily this situation is changing; the Voxforge project collects GPL-licensed
speech models and enables users to create and upload their own. Like
a lot of less-well-known free data projects, it could always use more
contributions, but it is possible to download decent base models for a
variety of languages. Simon 0.4.0 introduces a new internal format for
its speech base models, but it is Voxforge compatible, and the English
Voxforge model is included in the download. Simon 0.4.0
also includes tools allowing users to create and upload their own
speech models to Voxforge.
Say what?
Despite being voice controlled, Simon comes with a graphical
front-end for setting up the framework, managing scenarios, and
working with speech models. The front-end is KDE-based, and building
Simon pulls in a lot of KDE package dependencies. Packages for 0.4.0
have yet to appear, but compiling from source is straightforward. It
is important to have CMU Sphinx installed beforehand in order to build
a completely free Simon framework, though. Simon's modularity means
the build script will simply compile Simon without Sphinx support if
the engine is not found.
At first run, the Simon setup window will walk users through the
process of installing speech models and scenarios, as well as testing
microphone input settings and related details. Speech models and
scenarios are tracked using the Get Hot New Stuff (GHNS)
system, so the available options can be searched through and installed
directly within Simon itself. The scenarios currently available
include general desktop utilities like window management and cursor
control, applications like Firefox, Marble, and Amarok, and a smattering of
individual tasks like taking a screenshot. Installing them is easy,
and Simon's interface allows each to be activated or deactivated with
a single click.
Arguably the biggest hurdle is finding the model one wants; they
are language-dependent and only English, Dutch, and German scenarios
appear to be published, plus there are frequently several options for
each application with essentially the same description. Some
descriptions are detailed enough to indicate that they were built with
a specific acoustic model (Voxforge or HTK), but some are clearly old
enough that they may have compatibility problems (such as the
OpenOffice.org scenarios that come from the Simon 0.3.0 days). Some,
like the Firefox scenario, also require installing other software
(e.g., a Firefox add-on).
The main Simon window shows which scenarios are active and which
acoustic speech models are loaded, and it displays the microphone volume
level and the most recently recognized spoken words. The latter two
items are useful for debugging. By default, the setup wizard steers
the user toward a generic Voxforge speech model, but to really get
good results the user needs to devote some time to training Simon.
Most of the scenarios come with a bundled "training text" for this
purpose: a list of words that the scenario is listening for. At any
time, the user can click on Simon's "Start training" button and record
new samples of the important words. These recordings are ingested by
the speech recognition engine and added to a user-specific speech
model. Simon layers this user-specific model over the base model,
hopefully improving the results.
Word to the wise
The training interface is painless and provides a lot of
hand-holding for new users. This is good news, since it is clear that
at least a few training sessions are to be expected before Simon 0.4.0
is usable for daily tasks — even for those of us with perfect
elocution. There are simply a lot of variables in human speech, and
even more when one throws in the vagaries of cheap PC sound cards and
microphones. The trainer prompts the user to speak each of the
keywords, reports instantly whether the speaker's voice is too loud or
too soft to be useful, and does the rest of the computation in the
background.
The nicest thing about Simon 0.4.0, though, is that it moves speech
control out of the "theoretical only" realm, where experienced
researchers and laboratory conditions are required, and at least makes
it possible for everyday users to get started. There is still a long
way to go before speech control can offer a constant user interface
option as it is depicted in Star Trek or (perhaps more troublingly)
in 2001. But the scenario-specific set of commands makes Simon more
usable than other open source speech recognition tools, and Simon's
built-in training interface makes the necessary grunt work (no pun
intended) of tailoring the speech model to one's actual voice about as
painless as it can be.
The research into speech recognition will continue, of course. But
Simon's new-found modularity will make it easier to incorporate
theoretical advances into the desktop application without rewriting
from scratch. For users, the next important stage is some development
work on new scenarios to hook more applications into Simon. The
trickiest part of the stack, though, is likely to remain training the
speech recognition engine to recognize the specific user's voice. But
no amount of software will eliminate that; just a good microphone and
some patience.
(
Log in to post comments)