By Nathan Willis
January 9, 2013
The open source speech recognition project Simon unveiled version 0.4.0 on December 30, 2012, after two years of development. The
new release boasts some significant architectural changes, so the
project advises users not to replace existing versions on production
systems. But the changes make Simon noticeably easier to work with,
which will please new users. Conversing freely with one's Linux PC is
still a ways off, but speech recognition with free software is no
longer the exclusive domain of laboratory research.
"Speech recognition" can encompass a range of different projects,
such as dictation (e.g., transcribing audio content) or detecting
stress in a human voice. Simon is designed to function as a voice
interface to the desktop computer; it listens to live audio input,
picks out keywords intended as commands, and pipes them to other
applications.
Categorical imperatives
Beginning with the 0.3.0 series released in 2010, Simon has based its
command-recognition framework on the idea of separate "scenarios" for
each application or use case. Scenarios can be as specific as the
developer wishes to make them; a general web-browsing scenario for
Firefox may be designed to handle only opening links and scrolling
through pages, but another could be tailored to work with GMail
functionality and keyboard shortcuts. Simon 0.4.0 builds on this
approach by adding context awareness: it will activate and deactivate
different scenarios depending on which applications the user has open
and which have focus. The scenarios still need to be manually
installed beforehand, though, so there is little risk Simon will start
erasing your hard drives if you happen to walk by and utter the word
"partition."
Simon can use any of several back-ends to perform the
speech-recognition part of the puzzle. Earlier releases relied on
either the BSD-licensed Julius or the
better — but non-free licensed — Hidden Markov Model Toolkit
(HTK). Version 0.4.0 adds support for another free software
recognition toolkit, CMU
Sphinx.
The Sphinx engine is highly regarded for its quality, and
provides functions that Julius does not, such as the ability to create
one's own acoustic speech
model. An acoustic model is the statistical representation of the
sounds that correspond to the parts of speech that the engine is
trying to recognize; it depends on both a "corpus" of audio samples of
the speaker or speakers and on a grammar model for the language being
spoken. Free sources for acoustic speech models have historically
been hard to come by, because most were created by proprietary
projects or had no clear licensing at all.
Luckily this situation is changing; the Voxforge project collects GPL-licensed
speech models and enables users to create and upload their own. Like
a lot of less-well-known free data projects, it could always use more
contributions, but it is possible to download decent base models for a
variety of languages. Simon 0.4.0 introduces a new internal format for
its speech base models, but it is Voxforge compatible, and the English
Voxforge model is included in the download. Simon 0.4.0
also includes tools allowing users to create and upload their own
speech models to Voxforge.
Say what?
Despite being voice controlled, Simon comes with a graphical
front-end for setting up the framework, managing scenarios, and
working with speech models. The front-end is KDE-based, and building
Simon pulls in a lot of KDE package dependencies. Packages for 0.4.0
have yet to appear, but compiling from source is straightforward. It
is important to have CMU Sphinx installed beforehand in order to build
a completely free Simon framework, though. Simon's modularity means
the build script will simply compile Simon without Sphinx support if
the engine is not found.
At first run, the Simon setup window will walk users through the
process of installing speech models and scenarios, as well as testing
microphone input settings and related details. Speech models and
scenarios are tracked using the Get Hot New Stuff (GHNS)
system, so the available options can be searched through and installed
directly within Simon itself. The scenarios currently available
include general desktop utilities like window management and cursor
control, applications like Firefox, Marble, and Amarok, and a smattering of
individual tasks like taking a screenshot. Installing them is easy,
and Simon's interface allows each to be activated or deactivated with
a single click.
Arguably the biggest hurdle is finding the model one wants; they
are language-dependent and only English, Dutch, and German scenarios
appear to be published, plus there are frequently several options for
each application with essentially the same description. Some
descriptions are detailed enough to indicate that they were built with
a specific acoustic model (Voxforge or HTK), but some are clearly old
enough that they may have compatibility problems (such as the
OpenOffice.org scenarios that come from the Simon 0.3.0 days). Some,
like the Firefox scenario, also require installing other software
(e.g., a Firefox add-on).
The main Simon window shows which scenarios are active and which
acoustic speech models are loaded, and it displays the microphone volume
level and the most recently recognized spoken words. The latter two
items are useful for debugging. By default, the setup wizard steers
the user toward a generic Voxforge speech model, but to really get
good results the user needs to devote some time to training Simon.
Most of the scenarios come with a bundled "training text" for this
purpose: a list of words that the scenario is listening for. At any
time, the user can click on Simon's "Start training" button and record
new samples of the important words. These recordings are ingested by
the speech recognition engine and added to a user-specific speech
model. Simon layers this user-specific model over the base model,
hopefully improving the results.
Word to the wise
The training interface is painless and provides a lot of
hand-holding for new users. This is good news, since it is clear that
at least a few training sessions are to be expected before Simon 0.4.0
is usable for daily tasks — even for those of us with perfect
elocution. There are simply a lot of variables in human speech, and
even more when one throws in the vagaries of cheap PC sound cards and
microphones. The trainer prompts the user to speak each of the
keywords, reports instantly whether the speaker's voice is too loud or
too soft to be useful, and does the rest of the computation in the
background.
The nicest thing about Simon 0.4.0, though, is that it moves speech
control out of the "theoretical only" realm, where experienced
researchers and laboratory conditions are required, and at least makes
it possible for everyday users to get started. There is still a long
way to go before speech control can offer a constant user interface
option as it is depicted in Star Trek or (perhaps more troublingly)
in 2001. But the scenario-specific set of commands makes Simon more
usable than other open source speech recognition tools, and Simon's
built-in training interface makes the necessary grunt work (no pun
intended) of tailoring the speech model to one's actual voice about as
painless as it can be.
The research into speech recognition will continue, of course. But
Simon's new-found modularity will make it easier to incorporate
theoretical advances into the desktop application without rewriting
from scratch. For users, the next important stage is some development
work on new scenarios to hook more applications into Simon. The
trickiest part of the stack, though, is likely to remain training the
speech recognition engine to recognize the specific user's voice. But
no amount of software will eliminate that; just a good microphone and
some patience.
Comments (7 posted)
By Nathan Willis
January 9, 2013
In some circles, installing custom or aftermarket firmware like
CyanogenMod on a $200 phone is enough to garner street cred, while in
others, such minor trifles are fit only to be scoffed at. For those
who do not flinch at danger, there is Magic Lantern, a GPL-licensed
replacement firmware for high-end Canon digital SLR cameras. The current
release is version 2.3, which offers a wealth of
improvements for shooting video, plus a growing list of
enhancements for still photographers.
Magic Lantern regularly makes releases for a fixed list of Canon models,
at the moment including most of the models from the EOS 600D and up. The
supported list focuses on cameras using Canon's DIGIC 4 chip and newer
models. Recent DIGIC chips include an embedded ARM core which makes
writing custom software possible, and the cameras can load and run firmware
from an inserted memory card without overwriting the existing firmware.
Consequently, projects like Magic Lantern and CHDK (which targets point-and-shoot
models) can provide firmware that adds new functionality with minimal risk
of bricking the camera — or of voiding the warranty and losing out on
Canon's much-loved hardware service offerings. There is still
risk involved, however, particularly for new camera models.
Magic Lantern was initially focused on improving video
recording functionality. The first model supported by the project was
the EOS 5D Mark II, a camera which started a minor revolution by
allowing high-quality HD recording in a compact form. But for some
budding filmmakers, the stock firmware simply left out too much.
Magic Lantern added usability features like crop marks in the preview
window, more precise control over ISO speed, white balance, and
shutter speed, and a number of miscellaneous add-ons like on-screen
sound meters for the audio input.
The current development work is focused on the EOS 5D Mark III, for
which the third alpha release was unveiled
on January 6. Installation
requires unpacking the build onto a supported Compact Flash or SD
card, making the card bootable, and loading it into the camera. The
download package includes the firmware image plus several folders full
of auxiliary files such as the focusing-screen overlays.
Normally, the card can be set to automatically boot the camera into
Magic Lantern, but this feature has not been enabled in the
pre-release builds for the EOS 5D Mark III.
The 5D Mark III release is still incomplete in other areas as well; a
good portion of the features enabled for other camera models are still
unimplemented for the 5D Mark III. The issue is that some Magic
Lantern features (for example, changes to live preview and information
display) can work without touching any of the camera's persistent
settings, but others require altering properties saved in onboard
memory. The team has simply encountered too many unsolved problems
with accessing and setting the 5D Mark III's stored settings.
Developer a1ex reported
that the stability test froze the camera and required a cold reboot
and clearing all of the camera settings to restore functionality. For
a piece of hardware with a four digit price tag, some caution is
understandable.
Still, there is a long list of features which are enabled in the
5D Mark III builds of 2.3. As is to be expected in light of the
project's emphasis on digital film-making, most are related to video, but not
all of them are so esoteric that a semester of cinematography class is
required. The gradual exposure function, for example, allows the user
to switch from one exposure setting to another while still filming;
Magic Lantern will smoothly transition through the intermediate
shutter and ISO speed settings, so that the change fades in (so to
speak), instead of hitting all at once.
But there are more unusual features, too. The HDR video mode, for
example, shoots twice as many frames as normal, alternating the
exposure of each: one set to properly expose the highlights, and one
set to properly expose the shadows. Combining the results into a
single video stream is not easy, though, and needs to be done in
post-production software. So far no tool exists for Linux users,
although there is a script
using the open source VirtualDub and Enfuse applications.
The majority of the Magic Lantern features enabled for the 5D Mark III at
the moment are of the display or composition aide variety, though. But this is not to
say that they are merely cosmetic; some offer important enhancements.
For instance, the "display gain" feature brightens the live preview
window so that items in frame are visible even if it is pitch black
outside. That allows the user to compose a decent-looking foreground
when doing night shooting or astrophotography, which is a nearly impossible
task otherwise.
As a still photographer, I am more interested in some of Magic
Lantern 2.3's features that are not yet available on the 5D Mark III.
To be honest, though, there are so many features
these days that nearly every user will find some of them useful given
a random subset. That is a testament to the development team's
creativity. More important, of course, is that such aftermarket firmware allows the camera owners to do more (and better) creative work. To Canon's credit, the company has not cracked down on magic Lantern or CHDK — in fact the company adeptly steps around the issue of whether using either project is a warranty violation. Those users with camera models supported by stable
builds of 2.3 should consider giving Magic Lantern a try — but should
do so with open eyes. With a well-tested model, there is relatively
little risk of doing damage to one's camera, but there is virtually no
recourse should something go horribly wrong. Perhaps the best advice
is to say cowboy up, but do your reading first.
Comments (17 posted)
By Nathan Willis
January 9, 2013
Version 12 of the XBMC
media-playback application is currently in
the final stages of development; release candidate 3 was released
on January 3. There are multiple enhancements to the codebase, but
one of the biggest stories is that XBMC v12 will officially add
support for Android. An Android port naturally makes XBMC available
on tablets and handsets, but, just as importantly, it enables running
on numerous set-top boxes, "smart TVs," and the increasingly-popular
smart TV dongle — device classes currently dominated by
proprietary applications produced by entertainment companies.
Binary builds RC3 of XBMC v12 are available for download from
xbmc.org. The Android build is an .apk package that is installable on
any device on which the user has enabled installation of non-Play
Store software. The project site says that XBMC will eventually come
to the Play Store, but not during the pre-release phase. The XBMC
wiki has an Android
hardware page outlining which devices have tested well with which
media types — as one might expect, there is a significantly
higher hardware threshold required to enable 1080p video playback.
The target platform for the initial Android release is set-top
boxes, in particular the Pivos XIOS DS, which is a compact ARM Cortex
A-9 device that the team used as the reference development
platform. The project offers
a few guidelines for assessing the suitability of other devices,
including a note that practically speaking, any Android device that
does not have the NEON-compatible coprocessor (or does not have it enabled) will
probably be unable to play back HD video. Nevertheless, there are
unsupported NEON-free builds linked to from the Android hardware wiki
page. The final caveat is that thus far the porting effort has not
addressed power consumption, so users of battery powered mobile devices
may find XBMC to be quite draining — although the project
assures users that this, too, will be addressed in the future.
Wall-powered set-top boxes, of course, may not find high power
consumption as problematic.
Functionality
I tested the new release on a Nook Tablet running CyanogenMod 7 (CM7),
and the battery-draining issue is indeed no joke. The device boasts a
4000 mAh battery, which XBMC managed to drain completely in a little
over 3 hours, even though video playback only accounted for a small
portion of the time. Granted, CM7 is an unofficial port for this
particular device and comes with its own share of power consumption
problems. Still, it is clear that there is considerable room for
improvement. Nevertheless, even on year-old hardware and a
less-than-up-to-date version of Android, XBMC runs remarkably well.
Feature-wise, the good news is that the Android port is nothing
short of the full XBMC experience — this is not a "light" or
"mobile" version of the software. All of the media formats, network
protocols, and add-ons supported in desktop XBMC are available in the
Android edition. NFS access was missing from some of the early betas
of XBMC v12, but as of now, there are no major gaps in player
functionality. Video playback from standard-definition web sources
was smooth, and a significantly better experience than accessing the
same sites through either the stock Android browser or Firefox. Audio
playback rarely stress-tests modern devices, so it gets less
attention in reviews, but all of the audio add-ons tested worked like
a charm as well.
There are, however, still hiccups to be encountered in individual
plug-ins. To some degree this is unavoidable; a huge subset of the
video playback add-ons, for example, are "screen scraper"-style hacks
to retrieve content from specific Web-based video services, such as
the many cable and broadcast TV channels that offer a subset of their
programming online. The authors of these add-ons must rewrite their
page parsing code every time the target site alters its layout, but
one of XBMC's strengths is that add-ons are installable from within
the XBMC interface, and updates to restore service can be pushed
out quickly.
But reliance on third-party add-on developers has its downside;
there are other add-ons available for desktop Linux XBMC
that do not seem to work for the Android build, such as the D-Bus
based notifications, some of which may never work because of platform
limitations. Still others offer functionality that depends on
external factors, such as the MythBox add-on, which allows XBMC to
play back content from a MythTV back-end. But the add-on only
supports MythTV 0.24, which is two releases out-of-date.
Experience
A far more significant problem with XBMC v12 on Android is
navigating the user interface. XBMC has long had navigation "trap
doors;" spots where it is possible to navigate into a menu or tool,
but it is either impossible to navigate back out, or it is only
possible to navigate back out through different means (for example,
menus where the left-arrow key allows you to enter a screen, but the
screen can only be exited by hitting Escape). These
trapdoors are usability warts under the best of circumstances, but on
an Android device they can literally leave the user stranded if the
device does not have a hardware keyboard. Android phones
might have a keyboard; tablets will not. Some set-top boxes
come with wireless keyboards, although they are largely looked down
on, and there is always the possibility of pairing Bluetooth
keyboards. But users seem to loathe putting down the directional
remote with its single-thumb driveability.
Trapdoors are not the only interface difficulty, however. Many of
XBMC's screens and onscreen controls assume the presence of either a
traditional pointer or a touchscreen. Jumping directly to a specific
point in the timeline of a song or video, for instance, requires a
pointing device to be at least marginally accurate. There may not be a
one-size-fits-all solution, considering the variety of content types
XBMC plays (and the variety of caching/streaming challenges that
accompany them), but some more work will probably be required to
optimize for the Android set-top box, which is often touch-free (and
may be pointer-free as well).
But the bigger question that XBMC needs to answer for potential
Android users is how it offers an improvement over getting at the
same content through other applications. Quite simply, the answer it
gives is "it depends" — entirely on the type of content.
Consuming Internet-delivered video and audio is significantly better
through XBMC than it is through a browser. The difference is not
quite as stark when compared to a dedicated Android application for a
particular service (such as Grooveshark). And XBMC is far less
compelling for content that requires more manual searching and
browsing.
Take podcasts, for example. XBMC supports managing podcasts, but its
interface for subscribing and listening to them is no better than any
other on the market. In fact, when coupled with the difficulties of
using the UI without a keyboard, it may actually be slightly worse.
The same is true for watching or listening to files from local storage
— there is no compelling advantage to using XBMC for this task
over the stock Android tools, and in some places the interface makes
the task more difficult.
As a result, XBMC for Android works well as an Internet content
front-end, where a set-top box must compete against the rapidly
growing stack of commercial streaming boxes from Roku, Netgear, and
everyone else at the big consumer electronics shows. Some of these
commercial products also offer an interface into the owner's local
music and video collection (typically through UPnP/DLNA).
XBMC can match that experience, although with a large enough
collection no DLNA solution is particularly pleasant — all
eventually fall back on scrolling through page after page of track
titles.
Where XBMC has a clear advantage is that it will always be able to
offer access to more online content than these proprietary
competitors, because the community writes its own add-ons and updates
them without the need to call in lawyers and negotiate complex
multi-year distribution deals. This is probably where XBMC will make
the biggest splash, if and when users of commercial Android set-top
boxes can install XBMC through the Google Play store. The
do-it-yourself crowd will probably find a desktop Linux-based XBMC
set-top box both easier to build and more flexible — but the
average consumer may very well discover a new world through seeing
XBMC available as a one-click installation option.
The application may also end up being a handy option on handheld
Android devices (once the power-consumption issues are fixed). There
will probably be more and better options for podcasts and locally
stored content, but XBMC's unified front-end to a wealth of
Internet-delivered services is likely to be a hit even on phones. If
nothing else, it saves users the trouble of scrolling through dozens
and dozens of extra application launchers.
Comments (18 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: Attacking full-disk encryption with Inception; New vulnerabilities in cups, inkscape, mozilla, rails, ...
- Kernel: Per-entity load tracking; Checkpoint/restore and signals; Xtables2 vs. nftables.
- Distributions: Distributions face the MoinMoin and Rails vulnerabilities; Red Hat, Open webOS, Fedora ARM, Debian.
- Development: The namespaces API; systemd 197; GNU Radio 3.6.3; packaged HTML5 apps; ...
- Announcements: New site features, Crowdfunding "Software Wars", events.
Next page:
Security>>