At the beginning of his GUADEC
2012 talk, developer Joaquim Rocha showed an image from Steven
Spielberg's oft-cited 2002 film Minority Report. When
the movie came out, it attracted considerable attention for its
gesture-driven computing. But, Rocha said, we have already surpassed
the film's technology, because the special gloves it depicted are no
longer needed. Rocha's Skeltrack library can leverage Microsoft Kinect or similar
depth-mapping hardware to find users and recognize their positions and
movements. Skeltrack is not an all-in-one hands-free user interface,
but it solves a critical problem in such an application stack.
Rocha's presentation fell on Sunday, July 29, the last "talk" day of
the week-long event. Although GUADEC is a GNOME project event, the
Skeltrack library's primary dependency is GLib, so it should be
useful on non-GNOME platforms as well. Rocha launched Skeltrack in
March, and has released a few updates since. The current version is
0.1.4 from June, and is available on
GitHub. For those who don't follow Microsoft hardware,
the Kinect uses an infrared illuminator to project a dot pattern onto
the scene in front of it, and an infrared sensor reads the distortion
in the pattern to map out a "depth buffer" of the objects or people in
the field of view.
How it works
Like the name suggests, Skeltrack is a library for "skeleton tracking."
It is built to take data from a depth buffer like the one provided by
the Kinect device, locate the image of a (single) human being in the buffer,
and identify the "joints." Currently Skeltrack picks out seven: one
head, two shoulders, two elbows, and two hands. Those joints can then
be used by the application, letting the user manipulate objects, or for further
processing (such as gesture recognition). The Kinect is the primary
hardware device used with Skeltrack, Rocha said (because of its low
price point and simple, hackable USB interface), but the library is hardware
independent. Skeltrack builds on the existing libfreenect library for device
control, and includes GFreenect, a GObject
wrapper library around libfreenect (because, as Rocha quipped "we
really like our APIs in GNOME").
One might be tempted to think that acquiring the 3D depth information
is the tricky part of the process, and that picking a human being out
of the image is not that complicated. But such is not the case.
Libfreenect, Rocha said, cannot tell you whether the depth information
depicts a human being, or a cow, or a monkey, much less identify
joints and poses. There are three proprietary ways to get skeleton
information out of libfreenect depth buffers: the commercial OpenNI
framework, Microsoft's Kinect SDK, and Microsoft's Kinect For
Windows. Despite its name, OpenNI includes many non-free components, the
skeleton-tracking module included. The Kinect SDK is licensed for
non-commercial use only, while Kinect for Windows is a commercial
offering, and only works with the desktop version of the Kinect.
Moreover, the proprietary solutions generally rely on a database of
"poses" against which the depth buffer is compared, in an attempt to
match the image against known patterns. That approach is slow and has
difficulty picking out people of different body shapes, so Rocha
looked for another approach. He found Andreas Baak's paper A
Data-Driven Approach for Real-Time Full Body Pose Reconstruction from
a Depth Camera [PDF]. Baak's algorithm uses pattern matching, too, but it
provided a valuable starting point: locating the mathematical extrema
in the body shape detected, then proceeding to deduce the skeleton.
Heuristics are used to determine which three extrema are most likely
to be the head and shoulders (with the head being in the middle), and
which are hands. Subsequently, a graph is built connecting the points
found, and analyzed to determine which shoulder each hand belongs to
(based on proximity). Elbows are inferred as being roughly halfway
along the path connecting each hand to its shoulder. The result is a
skeleton detected without any "computer vision" techniques, and
without any prior calibration steps. The down side of this approach
is that for the moment it only works for upper-body recognition,
although Rocha said full-body detection is yet to come.
How to use it
object has tweakable parameters for expected shoulder and hand distances, plus
other measurements to modify the algorithm. One of the more important parameters
is smoothing, which helps cope with the jitter often found in skeleton
detection. For starters, Kinect depth data can be quite noisy,
and on top of that, the heuristics used to find joints in the library
result in rapid, tiny changes. Rocha showed a live demo of Skeltrack
on stage, and with the smoothing function deactivated, the result is
entertaining to watch, but would not be pleasant to use when
interacting with one's computer. The down side is that running the
smoothing formula costs CPU cycles; one can maximize smoothing, but
the result is higher latency, which might hamper interactive
Rocha also demonstrated a few poses that can confuse Skeltrack's
algorithm. For example, when standing hands-on-hips, there are no
"hand" extrema to be found, leading the algorithm to conclude
that the elbows are hands. With one hand raised head-height and the
corresponding elbow held at shoulder height (as one might do while
waving), the algorithm cannot find the shoulder, and thus cannot
figure out which of the extrema is the head and which is the hand.
Nevertheless, Skeltrack is quite good at interpreting common motions.
Rocha demonstrated it with a sample program that simply drew the
skeleton on screen, and also with a GNOME 3 desktop control application.
The desktop application is hardcoded to a handful (pun semi-intended) of
actions, rather than a general gesture input framework. There was
also a demo set up at the Igalia (Rocha's employer) expo booth.
Skeltrack provides both an asynchronous and a synchronous API, and it
reports the locations of joints in both "real world" and screen
coordinates — measured in millimeters in the original scene and
pixels in the webcam image. Currently the code is limited to
identifying one person in the
buffer, but there are evidently ways to work around the limitation.
Rocha said that a company in Greece was using OpenCV to recognize
multiple people in the depth buffer, then running Skeltrack separately
on each part of the frame that contained a person. However, the
project in question was not doing the skeleton recognition in
Libfreenect (and thus Skeltrack) is not tied into the XInput input system,
nor is Skeltrack itself bound to a multi-touch application framework.
That is one possible direction for the code to head in the future;
hooking Skeltrack into the same touch event and gesture recognition
libraries as multi-touch pads and touch-screens would make Kinect-style
hardware more accessible to application developers. But that
cannot be the endpoint — depth buffers offer richer information
than 2D touch devices; developers can and will find more (and more
unusual) things to do with this new interface method. Skeltrack is
ahead of the competition (libfreenect lacks
skeleton tracking, but its developers recognize the need for it), and that is a
win not just for GNOME, but for open source software in general.
[The author would like to thank the GNOME Foundation for travel assistance to A Coruña for GUADEC.]
to post comments)