A few words about Simon 0.4.0

By Nathan Willis
January 9, 2013

The open source speech recognition project Simon unveiled version 0.4.0 on December 30, 2012, after two years of development. The new release boasts some significant architectural changes, so the project advises users not to replace existing versions on production systems. But the changes make Simon noticeably easier to work with, which will please new users. Conversing freely with one's Linux PC is still a ways off, but speech recognition with free software is no longer the exclusive domain of laboratory research.

"Speech recognition" can encompass a range of different projects, such as dictation (e.g., transcribing audio content) or detecting stress in a human voice. Simon is designed to function as a voice interface to the desktop computer; it listens to live audio input, picks out keywords intended as commands, and pipes them to other applications.

Categorical imperatives

Beginning with the 0.3.0 series released in 2010, Simon has based its command-recognition framework on the idea of separate "scenarios" for each application or use case. Scenarios can be as specific as the developer wishes to make them; a general web-browsing scenario for Firefox may be designed to handle only opening links and scrolling through pages, but another could be tailored to work with GMail functionality and keyboard shortcuts. Simon 0.4.0 builds on this approach by adding context awareness: it will activate and deactivate different scenarios depending on which applications the user has open and which have focus. The scenarios still need to be manually installed beforehand, though, so there is little risk Simon will start erasing your hard drives if you happen to walk by and utter the word "partition."

Simon can use any of several back-ends to perform the speech-recognition part of the puzzle. Earlier releases relied on either the BSD-licensed Julius or the better — but non-free licensed — Hidden Markov Model Toolkit (HTK). Version 0.4.0 adds support for another free software recognition toolkit, CMU Sphinx.

The Sphinx engine is highly regarded for its quality, and provides functions that Julius does not, such as the ability to create one's own acoustic speech model. An acoustic model is the statistical representation of the sounds that correspond to the parts of speech that the engine is trying to recognize; it depends on both a "corpus" of audio samples of the speaker or speakers and on a grammar model for the language being spoken. Free sources for acoustic speech models have historically been hard to come by, because most were created by proprietary projects or had no clear licensing at all.

Luckily this situation is changing; the Voxforge project collects GPL-licensed speech models and enables users to create and upload their own. Like a lot of less-well-known free data projects, it could always use more contributions, but it is possible to download decent base models for a variety of languages. Simon 0.4.0 introduces a new internal format for its speech base models, but it is Voxforge compatible, and the English Voxforge model is included in the download. Simon 0.4.0 also includes tools allowing users to create and upload their own speech models to Voxforge.

Say what?

Despite being voice controlled, Simon comes with a graphical front-end for setting up the framework, managing scenarios, and working with speech models. The front-end is KDE-based, and building Simon pulls in a lot of KDE package dependencies. Packages for 0.4.0 have yet to appear, but compiling from source is straightforward. It is important to have CMU Sphinx installed beforehand in order to build a completely free Simon framework, though. Simon's modularity means the build script will simply compile Simon without Sphinx support if the engine is not found.

At first run, the Simon setup window will walk users through the process of installing speech models and scenarios, as well as testing microphone input settings and related details. Speech models and scenarios are tracked using the Get Hot New Stuff (GHNS) system, so the available options can be searched through and installed directly within Simon itself. The scenarios currently available include general desktop utilities like window management and cursor control, applications like Firefox, Marble, and Amarok, and a smattering of individual tasks like taking a screenshot. Installing them is easy, and Simon's interface allows each to be activated or deactivated with a single click.

Arguably the biggest hurdle is finding the model one wants; they are language-dependent and only English, Dutch, and German scenarios appear to be published, plus there are frequently several options for each application with essentially the same description. Some descriptions are detailed enough to indicate that they were built with a specific acoustic model (Voxforge or HTK), but some are clearly old enough that they may have compatibility problems (such as the OpenOffice.org scenarios that come from the Simon 0.3.0 days). Some, like the Firefox scenario, also require installing other software (e.g., a Firefox add-on).

The main Simon window shows which scenarios are active and which acoustic speech models are loaded, and it displays the microphone volume level and the most recently recognized spoken words. The latter two items are useful for debugging. By default, the setup wizard steers the user toward a generic Voxforge speech model, but to really get good results the user needs to devote some time to training Simon. Most of the scenarios come with a bundled "training text" for this purpose: a list of words that the scenario is listening for. At any time, the user can click on Simon's "Start training" button and record new samples of the important words. These recordings are ingested by the speech recognition engine and added to a user-specific speech model. Simon layers this user-specific model over the base model, hopefully improving the results.

Word to the wise

The training interface is painless and provides a lot of hand-holding for new users. This is good news, since it is clear that at least a few training sessions are to be expected before Simon 0.4.0 is usable for daily tasks — even for those of us with perfect elocution. There are simply a lot of variables in human speech, and even more when one throws in the vagaries of cheap PC sound cards and microphones. The trainer prompts the user to speak each of the keywords, reports instantly whether the speaker's voice is too loud or too soft to be useful, and does the rest of the computation in the background.

The nicest thing about Simon 0.4.0, though, is that it moves speech control out of the "theoretical only" realm, where experienced researchers and laboratory conditions are required, and at least makes it possible for everyday users to get started. There is still a long way to go before speech control can offer a constant user interface option as it is depicted in Star Trek or (perhaps more troublingly) in 2001. But the scenario-specific set of commands makes Simon more usable than other open source speech recognition tools, and Simon's built-in training interface makes the necessary grunt work (no pun intended) of tailoring the speech model to one's actual voice about as painless as it can be.

The research into speech recognition will continue, of course. But Simon's new-found modularity will make it easier to incorporate theoretical advances into the desktop application without rewriting from scratch. For users, the next important stage is some development work on new scenarios to hook more applications into Simon. The trickiest part of the stack, though, is likely to remain training the speech recognition engine to recognize the specific user's voice. But no amount of software will eliminate that; just a good microphone and some patience.

Theoretical only?

Posted Jan 10, 2013 10:37 UTC (Thu) by NAR (subscriber, #1313) [Link] (6 responses)

I don't know how useful was at that time, but IBM OS/2 Warp included speech recognition some 17 years ago on the desktop. I think that was definitely outside laboratory research.

Theoretical only?

Posted Jan 10, 2013 13:23 UTC (Thu) by sorpigal (guest, #36106) [Link] (5 responses)

Speech control, not speech recognition. ViaVoice and similar products have allowed speech to text for a long time, with varying levels of accuracy, and while there have been some attempts to use this ability to invoke commands there hasn't been a great deal of effort put in to making something that's good for more than "run this program"--certainly not something that also runs on Linux and is open source.

Theoretical only?

Posted Jan 10, 2013 14:11 UTC (Thu) by n8willis (subscriber, #43041) [Link]

I still have an old RPM copy of ViaVoice for Linux, from Mandrake 8.0. The dictation worked okay after a dedicated marathon of training, but I could never get XVoice to do anything useful ~12 years ago ...or whenever it was.... Shortly thereafter, IBM sold ViaVoice off to some other proprietary speech software company, and there was never another update.

Nate

Theoretical only?

Posted Jan 10, 2013 21:36 UTC (Thu) by man_ls (guest, #15091) [Link] (3 responses)

My Mac did this circa 1996 with PlainTalk. You spoke one of several commands, the computer recognized which one and run it. I never could make it work reliably but I'm not a native English speaker. Frankly it looked cool at the start but really sucked a lot.

It is a pity that Free software is still at this level when puny mobile phones can recognize with great accuracy the name of a street and take you to it. Granted, they send a signature to Google servers and get back the result, but a desktop machine should be comparable. Still, progress is welcome.

Theoretical only?

Posted Jan 11, 2013 9:00 UTC (Fri) by keeperofdakeys (guest, #82635) [Link] (1 responses)

Remote speech recognition works well because you can have a massive database of samples, so you don't have to do much work. Local speech recognition doesn't have this, so you have to get by with a much smaller subset of that data.

Theoretical only?

Posted Jan 17, 2013 14:10 UTC (Thu) by redden0t8 (guest, #72783) [Link]

To elaborate on what you said, the Terms of Service for Google, Siri, etc... make me think that they actually keep incoming voice searches as samples. They can even automatically guess whether they were successful or not based on whether you re-send a similar query.

Theoretical only?

Posted Jan 17, 2013 14:15 UTC (Thu) by redden0t8 (guest, #72783) [Link]

As a native English speaker, I found it remarkably reliable.

That being said, I never found it *useful*. Unless you can perform complex actions à la Siri, it's faster just to use your mouse+keyboard. This kind of relegates simple spoken commands to only being useful as an accessibility aid or in niche situations (ie operating the computer when your hands are otherwise occupied).