A uTouch architecture introduction

May 22, 2012

This article was contributed by Chase Douglas

As the Linux desktop increases in popularity, the user interface experience has become increasingly important. For example, most laptops today have multitouch capabilities that have yet to be fully exposed and exploited in the free software ecosystem. Soon we will be carrying around multitouch tablets with a traditional Linux desktop or similar foundation. In order to provide a high-quality and rich experience we must fully exploit multitouch gestures. The uTouch stack developed by Canonical aims to provide a foundation for gestures on the Linux desktop.

uTouch capabilities

The new X.org multitouch features allow for multitouch support in applications. We now have a software stack, uTouch, built on top of this multitouch support that can provide for practically any gesture scenario imaginable.

A "gesture" is normally thought of as a two-dimensional movement made by the user on some sort of input device—a two-finger pinch, for example, or a three-finger downward drag. Teaching a computer to recognize these movements requires a lower-level description, though; in uTouch, this description consists of values like the number of touches, movement thresholds, and timeout values. An application may register a "gesture subscription" describing a specific gesture and be notified when that gesture is recognized by the uTouch subsystem. Those notifications take the form of a sequence of events describing the gesture motion over time.

Key to understanding how uTouch works is knowledge of all the typical gesture use cases. First, we have the concept of gesture primitives: drag, pinch (including both "pinch" and "spread"), rotate, and tap. These primitives make up the foundation of all intuitive gestures. They can be strung together as needed for more complex gestures, such as a double tap. Stroke gestures, such as drawing an ‘M’ to open the mail client, may be recognized as a specific long gesture sequence, or as a sequence of drag gestures. Note, however, that uTouch does not have stroke gesture detection facilities built-in.

Second, there are two fundamental object interaction types: single motion, single interpretation gestures and direct object manipulation. The former involves gestures like a two-touch swipe to go backward and forward through browser history, while the latter involves gestures like a three touch drag to move an application window around the desktop.

The single motion, single interpretation gestures require thresholds and/or timeouts. For example, the colloquially implied difference between a swipe and a drag is that a swipe must be a quick motion in a given direction, whereas a drag may be any motion that manifests in a displacement in space. To put it in uTouch gesture subscription terms, a swipe is a drag primitive gesture with a displacement threshold that must be crossed within a specific amount of time. For example, when implementing browser history gestures a two-touch swipe may be implemented with a threshold of 100 pixels over half of a second. In contrast, direct object manipulation usually implies a zero threshold. For example, as soon as three touches begin on a window, the window should be movable.

Most simple gesture interactions may be handled through gesture subscriptions consisting of the required gesture primitives and the object interaction types. However, there are times when an application needs to have further control over gesture recognition. For example, a bezel drag gesture occurs when the user begins a drag from the bezel of the screen and moves inward. This gesture must be distinguished from the user initiating a touch at the edge of the screen. The problem lies in the fact that both the bezel drag and the direct touch near the edge of the screen look indistinguishable at the beginning of the gesture. The distinguishing aspect is that the bezel drag is perpendicular to the bezel and has a non-zero initial velocity as seen by the touchscreen, whereas the direct touch near the edge of the screen will likely not have an initial velocity and/or may not be moving perpendicular to the bezel. To cater for a client that cares about one of these gestures but not the other, uTouch requires the client to accept or reject every gesture. When a gesture is rejected, the touches may be replayed to the X server, which allows for the mixing of gestures and raw multitouch in the same application.

Another facet of uTouch, as hinted above, is that, by default, it operates through "touch grab" semantics. When used on top of X.org, uTouch gestures are recognized from touches received through touch grabs. One benefit of this approach is the ability to mix gestures and raw multitouch in the same application. However, it also allows for priority handling of gestures. For example, system gestures may be handled by a client listening to touches through a grab on the root window. When gestures are not recognized or are rejected by the uTouch client, the touches are replayed to the next touch grab or selecting client. Thus, global gestures, application gestures, and raw multitouch events are all possible when using uTouch.

The last major feature of uTouch is the ability to recognize multiple simultaneous gestures in the same area. For example, imagine a game where the user pinches bugs on the screen to squash them. The screen is one large gesture input area, but the user may use both hands to pinch bugs. In order to facilitate this interaction mode, whenever new touches begin within the gesture area they are combinatorially matched with other touches that begin within a "glue" time period. In our game example there is a two-touch pinch gesture subscription. If four touches begin in the game area within the glue time period, six combinations of potential gestures will be matched. As touch events are delivered, the state of each matched gesture will be updated and then checked against the threshold and timeout for the gesture subscription. If a gesture meets the threshold and timeout criteria, it will be delivered to the client. The client can then attempt to match up the touches of the gesture against its context to determine whether to accept or reject each gesture. In the example below, there will be four pinch gestures sent to the client:

(Bug icons licensed under LGPL)

There will be potential pinch gestures for: AB, CD, AD, and BC (AC and BD, by virtue of moving in the same direction, are not considered to be potential pinches). The application must determine which gestures make sense. One method would be to hit test the initial centroid of each gesture against the bugs on the screen. All gestures that hit a bug are accepted. Note that uTouch automatically rejects overlapping gestures, so as soon as AB and CD are accepted, AD and BC will be implicitly rejected.

There is a twist to this complex logic, however. Gesture events are received serially. The client may need to know if more gestures are possible for a set of touches. For example, if both one-touch and two-touch drag gestures are subscribed, a two touch drag will cause two one-touch drag gestures and a two-touch drag gesture. If the uTouch client receives a one-touch drag first, it may not realize that a two-touch drag is coming for the touch as well. To handle this issue, a gesture property is provided to denote the finish of gesture construction for all of its touches. When a gesture has finished construction, the client knows that it has received all possible gestures containing the same touches. Thus, in the one- and two-touch drag example the one touch gesture will not emit the gesture construction property until at least the two-touch gesture begin event has been sent to the client.

The uTouch stack was designed to be flexible and provide for all possible gesture use cases. However, it is recognized that not all clients will care about multiple simultaneous gestures. There are plans to create a gesture subscription option that precludes the ability to have multiple simultaneous gestures. This will effectively push some policy into the recognizer, such as a preference for gestures with more touches. This will be particularly useful when subscribing to gestures on an indirect device, like a touchpad, where multiple simultaneous gestures are likely not wanted.

Lastly, uTouch is a complete gesture stack that surpasses the functionality of all available consumer platforms. uTouch works well with both touchscreens and touchpads, and supports both gestures and raw touch events in the same window or region of an application. In contrast, Windows only supports touchscreens and either gestures or raw touch events, but not both, in a given window. OS X supports touchpads but not touchscreens. Mobile platforms are limited to touchscreen support and single-application gestures at a time due to their modal task design. In contrast to each of these platforms, uTouch has been designed from the ground up to support all device types and all known use cases, including multiple applications and windows at the same time.

The technical architecture of uTouch

uTouch consists primarily of three components: uTouch-Frame, uTouch-Grail, and uTouch-Geis. Each of these will be described briefly below.

uTouch-Frame groups touches into units that are easier for uTouch-Grail to operate on. Gestures are recognized per-device and per-window, so touches are grouped into units representing pairs of devices and windows. This is also where all backends for each window system are implemented. uTouch-Frame events are platform independent.

Some window systems, like X11, also have the concept of touch sequence acceptance and rejection. This functionality is provided through uTouch-Frame as well.

Touch sequence acceptance and rejection is a core aspect of the uTouch stack when used for system-level gestures. Imagine a finger painting application listening for raw touch events (not gestures) is open on a desktop environment where three-touch swipes are used to switch between applications. When the user performs such a swipe, uTouch accepts the touch sequences on behalf of the window manager and switches applications. This prevents the painting application from handling (or even seeing) the touches. In contrast, when the user performs a three-touch tap, uTouch rejects the touch sequences because they do not match a known gesture. The painting application then receives the rejected touch sequences.

uTouch-Grail is the gesture recognizer of the uTouch project. It takes the per-device, per-window touch frames from uTouch-Frame and analyzes them for potential gestures.

Grail events are generated by frame events. Rather than duplicate the uTouch-Frame data, grail events contain gesture data and a reference to the frame event that generated it. This allows for uTouch clients to see the full touch data comprising a gesture.

Grail gesture events are comprised of a set of touches, a uniform set of gesture properties, and a list of recognized gesture primitives. Again, the supported primitives are: drag, pinch, rotate, and tap. The gesture properties are:

Gesture ID
Gesture state (begin, update, end)
A list of touch IDs for the touches comprising the gesture
The uTouch-Frame event that generated the Grail event
The original and current centroid position of the touches
The original and current average radius, or distance from the centroid, of the touches
A best-fit 2D affine transformation of the touches from their original positions
A best-fit 2D affine transformation of the touches from their previous positions
A flag denoting whether the gesture construction has finished

Drag, pinch, and rotate properties are encapsulated by the affine transformations. For more detail on how to use 2D affine transformations, please see this excellent Wikipedia article on transformation matrices.

During operation, a pool of recently-begun touches is maintained. In the current implementation this pool includes any touches that have begun within the past 60 milliseconds of "glue" time. When a new touch begins, it is combined in all possible combinations with touches in this pool in order to create potential gestures matching any active subscriptions.

A new gesture instance is created for each combination of touches. Each instance has an event queue, and new instances have one begin event describing the original state of the touches. The events are queued until any gesture primitive is recognized. When frame events are processed, any changes to touches in a gesture instance generate a new grail event. The new touch state is analyzed, and subscription thresholds and timeouts are analyzed to determine if any of the subscription gesture primitives have been recognized. For example, the default rotate threshold is 1/50th of a revolution, and the default rotate timeout is one half second. If the threshold is met before the timeout expires, the rotate gesture primitive is recognized.

When a gesture primitive has been recognized, the grail event queue is flushed to the client. The client must process the gesture events and make a decision on whether to accept or reject each gesture.

uTouch-Geis is the C API layer for the uTouch implementation. uTouch originally began as a private X.org server extension. It has since been updated, bringing it out of the X.org server and into the client side of the X11 system. This required a complete rewrite of uTouch-Frame and uTouch-Grail. However, we have managed to maintain API and ABI compatibility through uTouch-Geis, albeit with a few behavioral changes. uTouch-Geis has two API versions, version 1, a simpler interface, and version 2, an advanced interface. Although both are currently supported, the first version is deprecated in favor of the more flexible second version.

uTouch-Geis also makes gesture event control simpler by wrapping much of the X.org interaction behind an event loop abstraction. The uTouch stack requires careful management of touch grabs and timers. Any client may use uTouch-Frame and uTouch-Grail directly, but uTouch-Geis vastly simplifies incorporating gestures into an application. See the uTouch-Geis API documentation for more information.

Toolkit and application development

uTouch-Geis is nice, but its C API is still a bit cumbersome in certain scenarios. The uTouch team has created a QML plugin called uTouch-QML in order to make gesture integration in QML applications easier. This plugin provides native QML elements for subscribing and handling gestures. It currently uses a legacy gesture handling system in the uTouch stack that does not provide for gesture accept/reject semantics or simultaneous gestures, but we plan to update it to include those features over the next six months.

We also have begun work on a gesture recognition system for the Chromium web browser. There are many potential gesture interactions that we hope to leverage in the browser. An initial implementation was proposed, but a rearchitecture of the gesture plumbing in Chromium required us to refactor it. We hope to merge an implementation into Chromium in the next few months.

Conclusion

Over the past two years the uTouch team has been working hard to bring multitouch gestures to the Linux desktop. We now have a complete stack that rivals, and in many ways surpasses, what is possible on other platforms. We look forward to further integration of uTouch gestures in desktop environments and applications, and we encourage everyone to take a look at what our stack has to offer.

Index entries for this article
GuestArticles	Douglas, Chase

Hardware

Posted May 23, 2012 21:38 UTC (Wed) by krakensden (subscriber, #72039) [Link] (7 responses)

So... if someone wanted to get some hardware to be able to play around with multitouch on X11, what would you recommend? Apple's standalone trackpad? Those laptop/tablet hybrids Lenovo sells?

Hardware

Posted May 24, 2012 3:35 UTC (Thu) by whot (subscriber, #50317) [Link] (3 responses)

The x220t supports 2 fingers only, as do other wacom touch-enabled devices (bamboo, Intuos 5) and many built-in serial tablets in other tablet computers. The Apple trackpad supports more touchpoints (can't remember, 10 maybe?). Note that behaviours aren't the same, the x220t and other built-in devices are direct-touch, touchpads are by their very nature indirect touch devices. Be aware that plenty of tech specs say "multitouch" when they only support dual-touch.

Hardware

Posted Jun 3, 2012 7:31 UTC (Sun) by halla (subscriber, #14185) [Link] (2 responses)

Wacom touch tablets support 16 tracking points.

Hardware

Posted Jun 3, 2012 13:29 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

16 points? Are they going for the "two octopuses" mode? :)

Hardware

Posted Jun 4, 2012 1:37 UTC (Mon) by whot (subscriber, #50317) [Link]

The newest ones do. Anything not last generation is 2 fingers only.

Hardware

Posted May 24, 2012 5:13 UTC (Thu) by tajyrink (subscriber, #2750) [Link]

My Asus Zenbook's Elantech touchpad for example supports multi-touch properly - two finger scrolling, three finger window moving and maximize/restore via pinch, and four fingers brings up Unity's Dash.

Note though that there is another variant used in some Zenbooks from Sentelic, and multi-touch support for that is only in some git repositories, whereas Elantech supports lives for example in the Ubuntu 12.04 LTS kernel.

Hardware

Posted May 24, 2012 12:07 UTC (Thu) by bats999 (guest, #70285) [Link]

How 'bout this?

http://www.ideum.com/products/multitouch/platform/features/

Curiously, "Linux version available in March 2012". I have no idea what that means exactly.

Hardware

Posted May 24, 2012 14:35 UTC (Thu) by cnd (guest, #50542) [Link]

The Apple Magic Trackpads are great because you can add it to any existing computer setup. It is my recommendation for a multitouch trackpad.

Multitouch touchscreens are a different story. Unfortunately, I wouldn't "recommend" any touchscreens you can find on the market today on traditional laptops or monitors. I have heard of monitors with a newer eGalax touchscreen that supports five touches, but I haven't played with one to be sure. The best touchscreens are actually found on Android tablets. They tend to have the Atmel maXTouch chip, which recognizes and tracks many simultaneous touches. However, you have to figure out how to get a Linux desktop running on them first, and they aren't really designed for that.

I personally develop with a laptop with an N-Trig touchscreen. It does a passable job for development purposes, but it often drops touches or emits touches that don't exist.

A uTouch architecture introduction

Posted May 24, 2012 13:08 UTC (Thu) by tshow (subscriber, #6411) [Link] (3 responses)

I may be misreading this, but it seems to me that the major problem of gesture recognition is being punted to the client.

Having spent a fair while doing touch-based work (DS, iOS, effectively similar problems like wii remote gestures and game controller combo recognizers), the main problem beyond gesture detection is determining which gesture the user intended. The classic example is double tap; if you want to accept double tap as a gesture, you cannot process a single tap until you're sure it *isn't* a double tap.

What this means in practice is that unless the double tap's action is compatible with the single tap's action, you have to delay accepting the single tap. In general, this winds up meaning that your interface is laggy; it has to wait for the longest applicable potential gesture before confirming an action.

Unless I'm totally misreading this, uTouch seems to "fix" that by simply shoveling a giant pile of events at the client and saying "meh, you figure it out".

I know this is a hard nut to crack, but surely we can do better than this?

The main solution I've worked with to date on this is in iOS (everything else I've worked on was handrolled gesture recognition built on raw data streams), and the iOS solution isn't great. They let you register gesture recognizers and chain them in priority order, so you can say things like "if you get a tap and it isn't a double tap, give me a single tap" and it will handle the delay internally. They also have a "cancel" mechanism where it will occasionally say "Oops, that touch I told you about a moment ago? It was part of a gesture, you should totally disregard it...".

That system is... ok to work with, I guess. Passable. Usable. That's about as far as I'll go, though.

uTouch doesn't sound to me like an improvement, which is unfortunate.

A uTouch architecture introduction

Posted May 24, 2012 14:27 UTC (Thu) by cnd (guest, #50542) [Link] (1 responses)

You are correct in that uTouch gives you the events and you have to decide what to do with them. However, it is much easier to deal with a "tap" event than to try to detect a tap yourself. Single-touch tap events are fairly simple to recognize, but having uTouch as an abstraction helps. Multitouch tap events, on the other hand, are much more involved.

The other aspect is that a key design goal is leaving total control to the client. You say that iOS may come back at you some time in the future and tell you that a touch was part of a gesture and you should ignore it. At what point do you know that won't happen? When can you commit to an irreversible action based on a touch point?

In order to have that level of control, you have to tell the client a bunch of information and let them decide. There's not much else you can do, unless you only want to cater to trivially simple gesture handling.

A uTouch architecture introduction

Posted May 24, 2012 21:41 UTC (Thu) by tshow (subscriber, #6411) [Link]

Believe me, I'm not trying to defend the iOS model here. It gets a C- at best. Some days it gets a hard F.

My complaint here is that it really *isn't* that hard to detect a tap. Touch point appears, touch point remains within some epsilon of the point at which it first appeared, disappears before some specific time has elapsed. The basic gestures (tap, double tap, pinch/rotate, swipe, move) are all really easy to detect; the scariest thing you need to call is atan2().

I've put a basic touch gesture recognition package together in less than an hour, and I think anyone who isn't terrified of basic vector algebra could do the same; it's not a hard problem. The hard problem is the one this package isn't (?) solving, which is trying to winnow things down to the gestures that the client cares about.

OSX has a weak solution to the problem, which is that gesture recognition "objects" get attached to the touch input stream, and will consume touch events and produce gesture events. It has some basic priority handling, where you can say things like "don't tell me about a single tap unless the double tap fails", but that falls afoul of the input lag problem. uTouch seems to suffer from the same problem holding off on finalizing gestures until all possible matches are in.

Of course, it's quite possible that the input lag problem is intractable in the general case. The problem always comes down to the fact that gestures seem more expressive than they actually are, and the machine can't sense intent. One fundamentally cannot, for instance, tell two simultaneous pinches from two simultaneous overlapping stretches if the two are close to parallel.

If anything, what would be really useful (at least to me) is a substantially more robust gesture system; something with some fuzzy logic in it or somesuch. There was a commercial tool for the wiimote for a while (it may still be out there somewhere) which you could use to do data capture. You would perform a gesture with the wiimote repeatedly, and it would use that data to generate the outline of a gesture; within this T value, the parameters must be within this box/sphere/whatever. You could adjust the slop on it a bit, or play with the bounding curves, the spacial and temporal tolerance, and the result was an arbitrary gesture recognition engine generator.

There's no real difference between accelerometer gesture recognition, mouse position gesture recognition, multitouch gesture recognition, or gamepad combo catchers; it's only a question of the number of parameters.

If I could feed in data saying "here's the signature of an input gesture which I care about, and here's a set of rectangular regions in which I care if it happens", and do so for different gestures in (potentially) overlapping regions, I'd be a happy man.

A uTouch architecture introduction

Posted May 25, 2012 0:49 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

If you look at the event stream for, say, a triple click in GTK, you get all of the events (single, double, and triple). We have this issue in uzbl because if we wire up triple click events, the double click event handler has already been fired. Of course, if there is documentation anywhere on how to handle this situation, directions would be greatly appreciated (I've found GTK's docs could use some examples with some complexity in them; as it is, I'm usually forced to trudging through GTK apps which have behavior that I want for examples).