Deferrable timers for user space
Deferrable timers are available within the kernel, but they are not provided to user space. So the timer-related system calls (including timerd_settime(), clock_nanosleep(), and nanosleep()) will all make an effort to notify user space quickly on timer expiration, even if user space could tolerate some delay. That is irritating to developers who are working to improve power performance. The good news for those developers is that, after a couple of false starts, it now appears that deferrable timers may finally be made available to user space.
Some readers will certainly be thinking about the timer slack mechanism, which is available in user space. However, it affects all timers; for some applications, some timers may be more deferrable than others. Timers with slack can also only be deferred for so long, meaning that they might still wake a sleeping CPU when they expire. So deferrable timers may well address some use cases that are not well handled by timer slack.
Back in 2012, Anton Vorontsov sent out a patch set adding deferrable timer support to the timerfd_settime() system call. In putting together this patch, Anton ran into a problem: like all of the timing-related system calls, the timerfd mechanism uses the kernel's high-resolution timers. But high-resolution timers do not support deferrable operation; that functionality is limited to the older "timer wheel" mechanism (described in this article). The timer wheel is old code with a number of problems; it has been slated for removal for years. But nobody has done that work, so the timer wheel remains in place, and it remains the only facility with the deferrable option.
Anton's response was to split timers in the timerfd subsystem across both mechanisms. Regular timer requests would use the high-resolution timer subsystem as usual, but any request with the TFD_TIMER_DEFERRABLE flag set would, instead, be handled by the timer wheel. Among other things, that limited the resolution of deferrable timers to one jiffy (0.001 to 0.01 seconds, depending on the kernel configuration), but, if the timer is deferrable, low resolution is not necessarily a problem. Still, this patch set did not go very far, and Anton appears to have dropped it fairly quickly.
More recently, Alexey Perevalov has picked up this concept and tried to push it forward. He first posted a patch set in January; this posting generated rather more discussion than its predecessor did. John Stultz was concerned that only timerfd timers gained the new functionality; it would be better, he thought, to push it down to a lower level so that all timer-related system calls would benefit. Doing so, he thought, would likely entail adding the deferrable capability to the high-resolution timer subsystem.
Thomas Gleixner was rather more emphatic, stating that 
use of the timer wheel "is not going to happen
".  He suggested
that this functionality should instead be added to high-resolution timers in the
form of a new set of clock IDs.  The clock ID is a parameter provided by
user space describing which timekeeping
regime should be used. CLOCK_REALTIME, for example, corresponds to
the system 
clock; it can go backward should the administrator change the system
time.  The CLOCK_MONOTONIC clock, instead, is guaranteed to always
move forward.  There are a number of other clocks, including
CLOCK_TAI, added in 3.10, which corresponds to international
atomic time.  Thomas tossed out a proof-of-concept patch adding new
versions of all of these clocks (CLOCK_MONOTONIC_DEFERRABLE, for
example) that would provide deferrable operation.
John, however, argued that clock IDs were not the right interface to expose to user space:
The use of separate clock IDs within the kernel as an implementation detail is fine, he said, but the right way for user space to request the feature is with modifier flags. Fortunately, almost all of the relevant system calls already have flag arguments; the big exception is nanosleep(), which is a call some developers would like to see simply vanish anyway. John's argument, when expressed in this way, prevailed with no real dissent.
Alexey posted a couple more versions of his patch set, but, to put it gently, they did not find approval with Thomas, who eventually posted a deferrable timers patch set of his own, showing how he thinks the problem should be solved. In this patch set, clock_nanosleep() gets a new TIMER_IS_DEFERRABLE flag, while timerfd_settime() gets TFD_TIMER_IS_DEFERRABLE. In either case, setting that flag causes the kernel to use one of the new deferrable internal clock IDs. Timers on those IDs never actually program the hardware, so they never generate interrupts and cannot wake the system. Instead, the expiration functions will be executed when the system is awakened for some other task. In the absence of the new flag, these system calls behave as they always have.
Thomas's patch set has not gotten much in the way of comments beyond
low-level implementation details; if that
situation persists for long, silence is likely to indicate consent.  So,
unless some surprise comes along, the kernel will probably offer deferrable
timers to user space around 3.15 or so.
| Index entries for this article | |
|---|---|
| Kernel | Timers | 
      Posted Feb 27, 2014 2:24 UTC (Thu)
                               by josh (subscriber, #17465)
                              [Link] (2 responses)
       
     
    
      Posted Feb 27, 2014 20:11 UTC (Thu)
                               by luto (guest, #39314)
                              [Link] (1 responses)
       
I'd much prefer to have per-timer slack controls than to add a new kind of timer. 
     
    
      Posted Feb 27, 2014 22:38 UTC (Thu)
                               by josh (subscriber, #17465)
                              [Link] 
       
     
      Posted Feb 27, 2014 3:52 UTC (Thu)
                               by wahern (subscriber, #37304)
                              [Link] (4 responses)
       
http://25thandclement.com/~william/projects/timeout.c.html 
In my tests benchmarking my implementation to a min-heap queue, as well as red-black tree, there's simply no comparison, _even_ when you need to cascade timers. I've done sequential, random, all expirations, all cancel, etc, etc. Just... no comparison whatsoever. (I'll be posting my results in the coming weeks, when work allows.) 
The notion of avoiding cascading makes no sense to me, because if a timer is canceled or expired you're still rebalancing the tree (previous LWN article said high-resolution timers are using a red-black tree). A rebalance of a large tree is larger the the number of cascading operations needed (which are usually fixed at a small number--2-6 linked-list operations per timeout for expiration). The only thing I can think of is an issue with jitter. But timing wheels are just so insanely fast... I just can't see it, and jitter is more a concern with hashed timing wheels, not hierarchical wheels. (Hashed wheels also have horrible worst-cast performance.) Plus, cancellations usually happen earlier rather than later. For example a 10 second timeout on a socket is likely to be canceled long before it needs to be moved to the next wheel. And if jitter really, really, really mattered, you can just install the timeouts in a random slot in a lower wheel, to smooth out the workload. This is _especially_ true of high-resolution timers, where no two timeouts could feasibly have the same low-order bits. 
Did anybody in the kernel even bother to benchmark stuff? 
 
     
    
      Posted Feb 27, 2014 14:25 UTC (Thu)
                               by tglx (subscriber, #31301)
                              [Link] (3 responses)
       
See https://lwn.net/Articles/152436/ and https://lwn.net/Articles/152363/ for details. 
> The notion of avoiding cascading makes no sense to me... 
Then go and watch what recascading thousands of timers in a burst does to your system. That's real world, not some random benchmark. 
> Did anybody in the kernel even bother to benchmark stuff? 
No, we all use crystal balls and rely on our gut feelings.  
 
     
    
      Posted Feb 27, 2014 19:42 UTC (Thu)
                               by wahern (subscriber, #37304)
                              [Link] (2 responses)
       
And I have measured cascading. On tens of millions of timers. The issue I saw was that the RB trees just performed so poorly there was simply no comparison. Rebalancing of trees is a cache killer, disregarding the Big-O differences. I usually use RB trees everywhere, preferring them to hashes. But when writing a high-throughput network server, timeouts are just manipulated too darned often. 
I never had any intention of trying to get my timer implementation in the Linux kernel. Why would you think that? I can't get over the Linux reverse naming scheme of go_pick_up_laundry_bob() instead of bob_laundry_pick_up(). That's just not how we roll in user-space. 
But I would like to get an address for the crystal ball emporium. ;) If it told you RB trees had lower latency than hierarchical timing wheels for thousands of timers, then I really have no ground to stand on. 
 
     
    
      Posted Feb 27, 2014 20:10 UTC (Thu)
                               by luto (guest, #39314)
                              [Link] (1 responses)
       
     
    
      Posted Feb 27, 2014 22:41 UTC (Thu)
                               by wahern (subscriber, #37304)
                              [Link] 
       
Min-heap trees are more common, as most people aren't that familiar with hierarchical timing wheels, and it's easier to code or copy a min-heap. Also, a min-heap has O(1) insertion for random timeout values, which can superficially make it look great in benchmarking. 
A hierarchical timing wheel has similarities to a trie. But it's optimized for timeouts. Every operation is O(1) with a small constant factor, including (in my implementation) minimum lookup. And the algorithm works such that timeouts efficiently bubble up to be readily collected for servicing. Insertion and cancellation are a simple linked-list operation, and in many scenarios timeouts are usually cancelled long before they expire. (Think of having a 10 second timeout on an I/O operation which usually completes in a fraction of a second, necessitating a cancellation and another re-scheduling for the next operation. For an RB tree that's just killer in real terms. For high-resolution timers with tiny timeouts--on the order of microseconds--bunching becomes a concern because you're always cascading down a wheel, although in Big-O terms it's nothing. OTOH, if you get that kind of bunching it means you're already rapidly installing and canceling timers on the order of hundreds or thousands, milliseconds or less apart, and for something like an RB tree, as it grows to thousands of tens of thousands, it's not gonna scale in wall clock terms. At least, that's how it looks to me. But apparently I'm missing something.) 
The only potential issue is bunching. In terms of Big-O it doesn't matter, because you're doing the same amount of work whether they're bunched or not (they're not sorted in each slot), but potentially there can be an issue with run-time latency. 
Actually, there are two issues with latency. In traditional wheels there's the latency with the periodic timer, limiting granularity, in addition to latency with cascading bunched timeouts. In my implementation there is no periodic timer necessary because I use word-sized bitmaps to memoize slot residency (and on Haswell something like find-first-set is a mere 3 cycles). 
I operate in user space in a non-real-time setting, so the second issue of latency is of little concern. But, frankly, the implementation is just so much faster than a min-heap (2-3x faster even for random insertion where both are O(1)), let alone an RB tree (ridiculously faster), that I was simply surprised that someone found an RB tree to work better for large collections of timeouts. And if there was an issue, I was curious if some simple smoothing functions or other hacks hadn't been tried. Because I'm still in the middle of comprehensive testing, it's of significant interest to me. 
 
     
      Posted Mar 8, 2014 4:59 UTC (Sat)
                               by gdt (subscriber, #6284)
                              [Link] (1 responses)
       Perhaps that interface isn't widely useful for applications programming. The moment you have a few timers which can all potentially expire at once the question of priorities (and priority inversion) raises its nasty head. That's much better dealt with in the kernel (which can arbitrate between all timers) than with a shim over the timer calls in an application. 
     
    
      Posted Mar 9, 2014 6:12 UTC (Sun)
                               by dlang (guest, #313)
                              [Link] 
       
If an app is going to freak out because some timer are going to expire in a different order than expected, it already has a chance of having problems. This does extend the windows of the problem, but since timers never expire exactly when intended, and the system can sleep for a long time (after which the timers will have expired all at once anyway), this doesn't look like a new problem to me. 
     
      Posted Jul 5, 2014 12:59 UTC (Sat)
                               by Aeyoun (guest, #92213)
                              [Link] 
       
     
    Deferrable timers for user space
      
Deferrable timers for user space
      
Deferrable timers for user space
      
Deferrable timers for user space
      
Deferrable timers for user space
      
Deferrable timers for user space
      
Deferrable timers for user space
      
Deferrable timers for user space
      
Deferrable timers for user space
      Deferrable timers for user space
      
Deferrable timers for user space
      
 
           