Believe me, I'm not trying to defend the iOS model here. It gets a C- at best. Some days it gets a hard F.
My complaint here is that it really *isn't* that hard to detect a tap. Touch point appears, touch point remains within some epsilon of the point at which it first appeared, disappears before some specific time has elapsed. The basic gestures (tap, double tap, pinch/rotate, swipe, move) are all really easy to detect; the scariest thing you need to call is atan2().
I've put a basic touch gesture recognition package together in less than an hour, and I think anyone who isn't terrified of basic vector algebra could do the same; it's not a hard problem. The hard problem is the one this package isn't (?) solving, which is trying to winnow things down to the gestures that the client cares about.
OSX has a weak solution to the problem, which is that gesture recognition "objects" get attached to the touch input stream, and will consume touch events and produce gesture events. It has some basic priority handling, where you can say things like "don't tell me about a single tap unless the double tap fails", but that falls afoul of the input lag problem. uTouch seems to suffer from the same problem holding off on finalizing gestures until all possible matches are in.
Of course, it's quite possible that the input lag problem is intractable in the general case. The problem always comes down to the fact that gestures seem more expressive than they actually are, and the machine can't sense intent. One fundamentally cannot, for instance, tell two simultaneous pinches from two simultaneous overlapping stretches if the two are close to parallel.
If anything, what would be really useful (at least to me) is a substantially more robust gesture system; something with some fuzzy logic in it or somesuch. There was a commercial tool for the wiimote for a while (it may still be out there somewhere) which you could use to do data capture. You would perform a gesture with the wiimote repeatedly, and it would use that data to generate the outline of a gesture; within this T value, the parameters must be within this box/sphere/whatever. You could adjust the slop on it a bit, or play with the bounding curves, the spacial and temporal tolerance, and the result was an arbitrary gesture recognition engine generator.
There's no real difference between accelerometer gesture recognition, mouse position gesture recognition, multitouch gesture recognition, or gamepad combo catchers; it's only a question of the number of parameters.
If I could feed in data saying "here's the signature of an input gesture which I care about, and here's a set of rectangular regions in which I care if it happens", and do so for different gestures in (potentially) overlapping regions, I'd be a happy man.