LWN.net Logo

GUADEC: Motion tracking with Skeltrack

By Nathan Willis
August 1, 2012

At the beginning of his GUADEC 2012 talk, developer Joaquim Rocha showed an image from Steven Spielberg's oft-cited 2002 film Minority Report. When the movie came out, it attracted considerable attention for its gesture-driven computing. But, Rocha said, we have already surpassed the film's technology, because the special gloves it depicted are no longer needed. Rocha's Skeltrack library can leverage Microsoft Kinect or similar depth-mapping hardware to find users and recognize their positions and movements. Skeltrack is not an all-in-one hands-free user interface, but it solves a critical problem in such an application stack.

[Rocha at GUADEC]

Rocha's presentation fell on Sunday, July 29, the last "talk" day of the week-long event. Although GUADEC is a GNOME project event, the Skeltrack library's primary dependency is GLib, so it should be useful on non-GNOME platforms as well. Rocha launched Skeltrack in March, and has released a few updates since. The current version is 0.1.4 from June, and is available on GitHub. For those who don't follow Microsoft hardware, the Kinect uses an infrared illuminator to project a dot pattern onto the scene in front of it, and an infrared sensor reads the distortion in the pattern to map out a "depth buffer" of the objects or people in the field of view.

How it works

Like the name suggests, Skeltrack is a library for "skeleton tracking." It is built to take data from a depth buffer like the one provided by the Kinect device, locate the image of a (single) human being in the buffer, and identify the "joints." Currently Skeltrack picks out seven: one head, two shoulders, two elbows, and two hands. Those joints can then be used by the application, letting the user manipulate objects, or for further processing (such as gesture recognition). The Kinect is the primary hardware device used with Skeltrack, Rocha said (because of its low price point and simple, hackable USB interface), but the library is hardware independent. Skeltrack builds on the existing libfreenect library for device control, and includes GFreenect, a GObject wrapper library around libfreenect (because, as Rocha quipped "we really like our APIs in GNOME").

One might be tempted to think that acquiring the 3D depth information is the tricky part of the process, and that picking a human being out of the image is not that complicated. But such is not the case. Libfreenect, Rocha said, cannot tell you whether the depth information depicts a human being, or a cow, or a monkey, much less identify joints and poses. There are three proprietary ways to get skeleton information out of libfreenect depth buffers: the commercial OpenNI framework, Microsoft's Kinect SDK, and Microsoft's Kinect For Windows. Despite its name, OpenNI includes many non-free components, the skeleton-tracking module included. The Kinect SDK is licensed for non-commercial use only, while Kinect for Windows is a commercial offering, and only works with the desktop version of the Kinect.

Moreover, the proprietary solutions generally rely on a database of "poses" against which the depth buffer is compared, in an attempt to match the image against known patterns. That approach is slow and has difficulty picking out people of different body shapes, so Rocha looked for another approach. He found Andreas Baak's paper A Data-Driven Approach for Real-Time Full Body Pose Reconstruction from a Depth Camera [PDF]. Baak's algorithm uses pattern matching, too, but it provided a valuable starting point: locating the mathematical extrema in the body shape detected, then proceeding to deduce the skeleton.

Heuristics are used to determine which three extrema are most likely to be the head and shoulders (with the head being in the middle), and which are hands. Subsequently, a graph is built connecting the points found, and analyzed to determine which shoulder each hand belongs to (based on proximity). Elbows are inferred as being roughly halfway along the path connecting each hand to its shoulder. The result is a skeleton detected without any "computer vision" techniques, and without any prior calibration steps. The down side of this approach is that for the moment it only works for upper-body recognition, although Rocha said full-body detection is yet to come.

How to use it

Skeltrack's SkeltrackSkeleton object has tweakable parameters for expected shoulder and hand distances, plus other measurements to modify the algorithm. One of the more important parameters is smoothing, which helps cope with the jitter often found in skeleton detection. For starters, Kinect depth data can be quite noisy, and on top of that, the heuristics used to find joints in the library result in rapid, tiny changes. Rocha showed a live demo of Skeltrack on stage, and with the smoothing function deactivated, the result is entertaining to watch, but would not be pleasant to use when interacting with one's computer. The down side is that running the smoothing formula costs CPU cycles; one can maximize smoothing, but the result is higher latency, which might hamper interactive applications.

Rocha also demonstrated a few poses that can confuse Skeltrack's algorithm. For example, when standing hands-on-hips, there are no "hand" extrema to be found, leading the algorithm to conclude that the elbows are hands. With one hand raised head-height and the corresponding elbow held at shoulder height (as one might do while waving), the algorithm cannot find the shoulder, and thus cannot figure out which of the extrema is the head and which is the hand. Nevertheless, Skeltrack is quite good at interpreting common motions. Rocha demonstrated it with a sample program that simply drew the skeleton on screen, and also with a GNOME 3 desktop control application. The desktop application is hardcoded to a handful (pun semi-intended) of actions, rather than a general gesture input framework. There was also a demo set up at the Igalia (Rocha's employer) expo booth.

Skeltrack provides both an asynchronous and a synchronous API, and it reports the locations of joints in both "real world" and screen coordinates — measured in millimeters in the original scene and pixels in the webcam image. Currently the code is limited to identifying one person in the buffer, but there are evidently ways to work around the limitation. Rocha said that a company in Greece was using OpenCV to recognize multiple people in the depth buffer, then running Skeltrack separately on each part of the frame that contained a person. However, the project in question was not doing the skeleton recognition in real-time.

Libfreenect (and thus Skeltrack) is not tied into the XInput input system, nor is Skeltrack itself bound to a multi-touch application framework. That is one possible direction for the code to head in the future; hooking Skeltrack into the same touch event and gesture recognition libraries as multi-touch pads and touch-screens would make Kinect-style hardware more accessible to application developers. But that cannot be the endpoint — depth buffers offer richer information than 2D touch devices; developers can and will find more (and more unusual) things to do with this new interface method. Skeltrack is ahead of the competition (libfreenect lacks skeleton tracking, but its developers recognize the need for it), and that is a win not just for GNOME, but for open source software in general.

[The author would like to thank the GNOME Foundation for travel assistance to A Coruña for GUADEC.]


(Log in to post comments)

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds