|| ||[ANNOUNCE] ktimers subsystem|
|| ||Mon, 19 Sep 2005 18:48:41 +0200 (CEST)|
|| ||email@example.com, firstname.lastname@example.org, email@example.com,
ktimers seperate the "timer API" from the "timeout API". ktimers are
The following text explains the rationale behind ktimers. It contains
- a general analysis of the current Linux time(r) core system and
patches / projects related to it.
- a detailed description of necessary changes to the Linux time(r)
- detailed explanation of the ktimer subsystem
- a short explanation of possible follow up patches to demonstrate the
further possibilities of the ktimers subsystem implementation
Why do we need ktimers ?
Authors: Thomas Gleixner, Ingo Molnar
A lot of discussion took place about Linux timekeeping and timers
recently. The efforts to integrate the High Resolution Timer patches
into the -rt tree gave a deep insight into the big picture and initiated
the ktimers implementation. This document is an analysis of all related
issues, with the goal of inclusion of ktimers into mainline.
Linux time(rs) status, very short summary
The current upstream timer implementation of Linux is based on a
periodic system tick (jiffy). This periodic tick initially had a period
of 10ms. During the 2.5 development series this was changed for some
architectures to 1ms and recently corrected to 4ms.
The upstream "time of day" (tod) timekeeping code builds on top of the
periodic ticks and takes architecture dependent sub-tick resolution
mechanisms into account to provide finer resolution. It also implements
synchronization with external time sources such as NTP.
Patches and projects related to timers and timekeeping, in history order
- UTIME Usec Resolution Timers
- HRT High Resolution Timers
- VST Variable Scheduling Timeouts
- DTCK Dynamic Ticks
- NEWTOD New timeofday core including reworked NTP
- some related architecture specific code already integrated into the
kernel (mostly s390 related time interpolator code)
- a couple of others - rather odd attempts to change timer
resolution. Mostly single purpose patches.
All of those patches have one thing in common. They are restricted to a
few architectures and address only single problems of timers and
UTIME: Microsecond resolution timers
The patch history goes back to Linux 2.0 and is maintained by Dr.
Douglas Niehaus at Kansas University as part of the KURT (Kansas
University Realtime) project.
Supported platforms: x86, (PPC, ARM partial)
Originally the usec resolution support was available for all users of
the timer core system, but during the course of development it turned
out to be a waste of resources and was restricted to realtime processes.
The implementation is restricted to nanosleep and itimers. Posix timer
support is planned. Initially it was implemented on top of the timer
wheel, but recently converted to a seperate list for high resolution
A related and quite interesting point of activity in this project is the
research on fine granular synchronization of machines in a network.
HRT: High Resolution Timers
George Anzinger forked the UTIME parts of KURT some years ago and
started the High Resolution Timer project. The usage is restricted to
posix timers with clock = CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR.
Supported platforms: x86, PPC, PPC, PPC64, SH, ARM partial
High resolution timers are kept in the timer wheel until they reach the
jiffy where they expire. On expiry they are moved to a seperate list and
arm the high resolution timer source. The timer function is handled in a
seperate softirq. The high resolution portion of the timers is managed
as a seperate field in the timer structure which holds the fractional
part. Initially implemented as subjiffies with a given resolution it
changed to a variable holding the time source cycles to reduce
conversion overhead. This has impact on the accuracy of cyclic
schedules and also has some limitations for high resolution time sources
with variable frequencies, e.g. TSC on x86.
Possible changes are work in progress.
VST: Variable Scheduling Timeouts
A patch closely related to and depending on HRT, also maintained by
George Anzinger. It provides the suppression of timer ticks during idle
Supported platforms: x86, (One ARM platform supported, a related patch
snippet recently sneaked into the ARM core interrupt handling code)
Whenever the system goes into idle state the timer list(s) are scanned
to find the next timer and, if it is reasonably far away, VST turns off
the periodic 1/HZ timer interrupts and sets up a timer interrupt at the
expiry time of said timer. VST also provides a callback list which is
used to notify about idle enter / leave events. On the next interrupt,
be it the VST timer or some other interrupt, the periodic 1/HZ timer is
restarted and the elapsed time (ticks) is properly accounted for.
DTCK: Dynamic Ticks
An implementation similar to VST, but not depending on other patches.
Maintained by Con Kolivas. It also provides the suppression of timer
ticks during idle periods.
Supported platforms: x86
Similar to VST, but completely jiffy bound. Very actively maintained in
recent months, with good progress. The core implementation is leaner
than VST. It contains some x86 specific bits which have to be seperated
out. It uses the already existing NO_IDLE_HZ code (s390) instead of
introducing new duplicated functionality. The configuration interface is
integrated into sysfs. A generic notification interface is not available
NEWTOD: New timeofday core including reworked NTP
John Stultz maintains a set of patches which are related, but have been
split for easier review and discussion.
- Reworked implementation of NTP synchronization
- Seperation of time of day timekeeping from the timer core
Supported platforms: x86
The time of day code is completely seperated from the periodic tick. The
code provides a runtime configurable time source selection, which is
intended to be of generic (architecture spanning) use. One of the
possible time sources is the periodic tick of course.
The code gets rid of one of the fundamental flaws of Linux time keeping:
the wrong order of deduction. The current upstream code derives almost
everything except jiffies from the wall clock time (xtime). This is
controversial to almost every time reference in the world. Usually time
references are built on a raw hardware clock which provides a more or
less accurate monotonic time source. This time source is corrected vs.
frequency skew. On top of resulting "constant frequency" monotonic time
source the human time conversions are implemented, e.g. timezones.
John's patch addresses this nicely and builds the correct order of clock
source -> frequency correction -> wall clock adjustment. This is one of
the essential preliminaries to implement nonintrusive, simple and
effective high resolution time support.
The lively discussion of the patch is not questioning the general idea.
The main point of criticizm is related to the enforced usage of 64-bit
variables and 64-bit arithmetic in hot execution paths. This is seen as
a penalty for 32-bit architectures and for low computing power CPUs
which are often used in embedded devices, but architectural simplicity
is a strong argument in favor of 64-bit arithmetics and we have not seen
a substantial proof of the overhead. (See also the detailed comparison
of timespec versus 64-bit nsec_t further below.)
Timer related observations
Ticks are a convenient mechanism for a lot of time triggered functions
like scheduler-ticks, timeouts etc. which require limited resolution and
For time of day timekeeping, which requires sub tick resolution
preferrable in human time units, ticks introduce a bunch of ugliness
especially when it comes to time synchronizing with high resolution time
sources. Another astonishing implementation detail of the current time
keeping is the fact that we get the monotonic clock (defined by POSIX as
a continous clock source which can not be set) by subtracting a variable
offset from the real time clock, which can be set by the user and
corrected by NTP or other mechanisms.
Another well-known drawback of the current tick based implementation is
the fact that ticks happen even on a completely idle system. This is an
undesired behaviour for battery powered devices. Resolving this
currently needs a lot of quirks to the upstream time(rs) system.
The current POSIX timer implementation is also quite complex due to its
implementation on top of the tick timers. We are forced to e.g. keep
track of armed CLOCK_REALTIME timers to readjust them when the clock has
been set. The POSIX timer API, as defined by Posix Specification 1003.1,
is inconsistent in the treatment of relative and absolute CLOCK_REALTIME
timers. Absolute timers are influenced by clock_set, relative timers are
not. This also applies to nanosleep. The specification of nanosleep
states on the other hand that the sleeping time must not be less than
the given timeout measured by CLOCK_REALTIME. Setting the clock while a
nanosleep is scheduled leads to an interesting situation:
Process 1 Process2
t2 - t1 =~ 0s
Implementing high resolution timers on top of a the current system also
requires a lot of quirks to keep the timer API usable for both high
resolution and tick based timers.
Such kinds of 'interaction artifacts' between the tick based data
structures and algorithms and the high-resolution data structures, even
if looked at without knowing anything about the time(r) subsystem,
already point in the direction of separating 'high resolution time(r)'
and 'low resolution timeout' APIs and subsystems.
As mentioned earlier, the switch to 1ms ticks during 2.5 development
series turned out to cause certain regressions and was recently
corrected to 4ms. One common type of regression was 'timer
soft-interrupt overrun', i.e. when processing related to a timer tick
did not finish within one jiffy, causing a domino effect on the timing
quality of the system.
What are the reasons? During the work on integrating high resolution
timers into the -rt tree we observed a lot of details related to this
problem. When changing the period of the timer tick (changing HZ) the
size of the primary timer wheel in the core code remains unchanged. This
results in the fact that the primary timer wheel [into which wheel the
secondary wheels are 'cascaded' periodically] becomes capable of
handling a smaller time span than before. The CONFIG_BASE_SMALL option
makes it even worse. Here is a table of the capacity limits of the
HZ CONFIG_BASE_SMALL=n CONFIG_BASE_SMALL=y
100 2560 ms 640 ms
250 1024 ms 256 ms
1000 256 ms 64 ms
So one source of regression is the increased necessity to move
non-expired timers from the outer wheels to the primary wheel.
This alone does not explain all the regressions yet though. We did some
instrumentation and statistics on the timer code related to common use
cases, where the regressions showed up - machines with high networking
and/or disk I/O load.
This revealed a reasonable explanation for this behaviour. Both
networking and disk I/O arm a lot of timeout timers (the maximum number
of armed timers during the tests observed was ~400000). The majority of
those timeouts are in the range of 0.5 seconds. As frequently seen with
timeout timers, most of those timers never expire, but under high load
they get easily into a time line where they have to be moved from the
outer wheel to the primary timer wheel, when HZ=1000. Have a look at the
timer cascading code [cascade() in kernel/timer.c] to see the penalty...
Another source of regression is the fact that quite a lot of timer
functions execute long lasting codepaths. E.g. in the networking code
rt_secret_rebuild() does a loop over rt_hash_mask (1024 in my case),
over entries and over some subsequent variable sized loops inside each
step. On a 300MHZ PPC system this accumulated to a worst case total of
>5ms (!). The networking code contains more of those loops in timer
functions and the worst case szenario is when all of those loops happen
in the same jiffy and block the timely delivery of other timers. There
are other culprits, but those in the networking code are the most
obvious ones. This went almost unnoticed on HZ=100 systems, but on
HZ=1000 based machines those effects surfaced. The change to HZ=250 is
just hiding the problem rather than solving it.
Another weird effect of the changes to the time tick period is the fact
that a lot of places in the code are using HZ incorrectly. Even today,
more than a year after the switchover. Many of those usages still assume
HZ=100 or even have a completely wrong understanding of the mechanism
provided by the Linux kernel timer API (which is unrelated to the HZ
changes of course).
These observations together finally led to the complete seperation of
the high resolution timer data structures from the jiffy wheel, in the
HRT-RT integration work. (to further reduce latencies we also separated
softirq threads, but that is another topic.)
What is solved by the available patches ?
As said before each of the patches addresses a particular part of the
time(r) related problems. Some of them are competing implementations.
We don't want to put down the efforts of the particular projects, but one
outstanding patch is John Stultz's work on the new time of day
subsystem. It really addresses one of the substantial linux time(r)
problems in a very generic and architecture-independent way, upon which
the other efforts can build cleanly.
The other patches mostly relate to tickless systems and high resolution
timers, and are providing great proof of concept implementations but
suffer from the bindings to particular architectures and the
restrictions that the current upstream Linux timer core code imposes
What changes are required?
The conclusions of my recent work on Linux time(rs) related problems and
the analysis of related patches are:
1. The HZ/jiffy based usage of time in the kernel code has to be
converted to human time units.
2. A clean seperation of all related APIs and subsystems is necessary
even if they have interdependencies and shared functionality
| HZ/jiffy conversion
The conversion of users of HZ/jiffy based timing to human time units is
necessary to allow changes to the core timer subsystem without breaking
the users all over the kernel. Looking at the code most HZ/jiffie timers
are using more or less correct conversions from human time units anyway.
A positive side effect of such a cleanup is the necessary auditing for
| API seperation
- time sources
- time synchronization
- time of day API
- timers API
- timeout API
- time sources:
The number and the resolution of available time sources varies a lot
over architectures and particular architecture specific platform
implementations. Some of them are only run time detectable. NEWTOD
provides a excellent code base for time source abstraction, but a couple
of details have to be discussed:
- resolution selection
- resolution and architecture dependend interface
- support for tick bound and tick less systems including a clean
integration into the interrupt handling code.
- usability of timesources for differrent purposes (timekeeping,
high/low res event scheduling)
- 32- vs. 64-bit arithmetic
- time synchronisation:
Time syncronization corrects the frequency skew of the time source. It
must provide a plugable interface for time synchronisation mechanisms to
allow the flexible implementation of time synchronization sources:
- time of day API:
The time of day API makes use of the eventually frequency corrected time
sources to implement the "human readable" interface. It is also
responsible for the translation of the monotonic time source - time
since system (re)booted - to the wall clock - real time - time. (Real
time in this context must not be confused with "real time" in the sense
- timer API:
The timer API provides finegrained precision timers with relation to the
time of day subsystem. It provides the functionality of:
- precise interval scheduling
- precise timeouts
with or without high resolution timing support depending on system
configuration and system capabilities.
- timeout API:
The timeout API provides a coarse resolution interface for timeout
purposes. As pointed out before the majority of timers are related to
timeouts. What's the nature of timeouts ?
- Timeout timers are usually armed to cover an error condition
- Most of those timers never expire (the non timer related good
condition arrives before expiry)
- The demands on resolution are usually quite low. It does not make
any difference if an error condition is detected a few or even a
few hundred milliseconds earlier or later. The relevant point is
that the error is detected at all.
On a heavy loaded web server ~95%+ of all timers (almost all of them
armed by network or I/O kernel code) are removed before expiry. The
remaining timers which really expire are mostly timers requested from
application code. The major usage there is some periodic supervisor
code, which checks program status or other application relevant
information, and delays.
Before inclusion of extensions to the current timer implementation e.g.
dynamic ticks, a API seperation and cleanup has to be done. Integrating
new functionality on top of the current code will just introduce a lot
of quirks and oddities which make the necessary cleanup and rework
John Stultz timeofday patches provide an excellent and solid base to
solve the first 3 of 5 points of the API seperation changes.
Ktimers add the timer API seperation with a clean way to integrate high
resolution time keeping.
The combination of both patches provides the grounds and leads the way
to the cleanup of the timeout API and the implementation of
dyntick/tickless support without introducing additional ugliness.
What is solved by ktimers ?
ktimers seperate the "timer API" from the "timeout API". ktimers are
The implementation was done with following constraints in mind:
- Not bound to jiffies
- Multiple time sources
- Per CPU timer queues
- Simplification of absolute CLOCK_REALTIME posix timers
- High resolution timer aware
- Allows the timeout API to reschedule the next event
(for tickless systems)
Ktimers enqueue the timers into a time sorted list, which is implemented
with a rbtree, which is effiecient and already used in other performance
critical parts of the kernel. This is a bit slower than the timer wheel,
but due to the fact that the vast majority of timers is actually
expiring it has to be waged versus the cascading penalty.
The code supports multiple time sources. Currently implemented are
CLOCK_REALTIME and CLOCK_MONOTONIC. They provide seperate timer queues
and support functions.
The time ordered implementation and storage of the expiry time as the
time of the selected time source removes the hard work of
reprogramming all armed absolute CLOCK_REALTIME posix timers when the
clock was set
During the initial implementation phase the choice of a time storage
format had to be done. Dispite the previous discussion about 64-bit time
storage vs. timespec structures, the decision was made to use plain
64-bit variables. The rationale behind this is:
1. Simple calculations (add, sub, compare), which are the ones used in
the fastpaths are simpler and better to handle than struct
2. The storage size is the same for 32-bit systems, but half the size
on 64-bit machines
3. The resulting binary code size is smaller due to the simpler fast
path operations. The comparison of the resulting binary code size
of a function which resembles parts of the hotpaths in the
enqueue and expiry code make this very clear. All compiled with
AMD64 I386 ARM PPC32 M68K
nsec_t_ops e2 11c fc 1ac ce
timespec_ops 19c 144 1c0 280 156
Smaller binary code usually executes faster.
4. The kludge introduced by timespec arithmetics is horrible to
maintain. The simple and straight forward 64-bit calculation have
much less of surprises and potential error sources hidden.
5. The areas where the 64-bit nsec_t value has to be converted to
timespec / timeval are very restricted and can be optimized
further. Except for one odd POSIX timer related case (cyclic timer
with no signal delivery) the calculation is simple and straight
forward. The most discussed code in John Stultz timeofday patch is
the conversion in gettimeofday(). This can be easily solved by a
low overhead storage in both formats. The POSIX timespec / timeval
interface to userspace for the apparently often used gettimeofday
syscall must not be used as an red herring to clutter the complete
kernel timer and time keeping subsystems with this uneffective
representation of time.
ktimers are available in a patch series for easier review:
The basic implementation of ktimers. The timer queues are called
inside the existing timer softirq. The time ordered queue
implementation allows to remove all the abs_list functionality from
posix-timers.c. Converted interfaces: itimers, nanosleep, posix
timers. There is no change to the current kernel time keeping system
required. The base patch utilizes existing interfaces.
The following add on patches are not provided for ad hoc inclusion as
they contain third party patches. The reason for providing this series
is to demonstrate the future use of ktimers and the simple
extensibility for the impelemtation of high resolution timers.
Especially John Stultz timeofday patch is a complete seperate issue
and just used due to the ability to provide high resolution timers in
a simple and non intrusive way.
The full patch series is available from
Generic extension to the base patch to support high resolution
timers. The high resolution changes are the seperation of the
softirq, the high resolution interrupt function and the timer
- timeofday_b5 patch
Integration of John Stultz timeofday patches to have a clean
abstraction of time sources for a non intrusive implementation of
high resolution timers. This patch will be replaced by the reworked
version which is currently developed by John Stultz.
- timeofday_fixup patch
Fixup clashing inlines
extend the timeofday API and switch ktimers to use the timeofday API
Patch to support high resolution timers on i386 with local APIC
timer used for high resolution events. Proof of concept with the
restriction to local APIC as high resolution event source at the
moment. The main point is to prove that high resolution timers do
not require large and intrusive patches anymore due to the cleanup
of the time system.
The complete patch is tested with the posix timer tests, which all
pass. It survives a couple of stress tests and shows no flaws when
integrated into the -rt tree. Of course this is brand new code, but
it is designed simple and robust from ground and got a thorough
review by a couple of people.
Some notes to the patch size(s):
17 files changed, 1328 insertions(+), 859 deletions(-)
code (-) 642 (+) 1058
comments (-) 217 (+) 270
The added code is mostly the base functionality of the ktimers
itself. The most cleanups happen in posix-timers.c, where all the
code related to the clock_was_set adjustment of absolute
CLOCK_REALTIME timers is removed. Converts itimers, nanosleep and
posixtimers to ktimer users
3 files changed, 330 insertions(+), 7 deletions(-)
code (-) 6 (+) 232
comments (-) 1 (+) 98
Add the generic infrastructure for high resolution timers. No non
POSIX clocks introduced, all ktimer users are automatically converted
73 files changed, 2926 insertions(+), 2675 deletions(-)
code (-) 2100 (+) 2146
comments (-) 575 (+) 780
A balanced patch providing a great improvement of functionality and
13 files changed, 635 insertions(+), 10 deletions(-)
code (-) 9 (+) 386
comments (-) 1 (+) 249
The largest addon is a header file containing scaled math operations, which
needs to be cleaned up.
- Total patch size
97 files changed, 5238 insertions(+), 3544 deletions(-)
code (-) 2751 (+) 3829
comments (-) 793 (+) 1409
13 files changed, 1464 insertions(+), 108 deletions(-)
code (-) 91 (+) 879
comments (-) 17 (+) 585
Most code is added to the
Only posixtimers are supported. Add seperate non POSIX clocks
(CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR)
- i386-hrt.patch (reduced to apic code)
13 files changed, 1371 insertions(+), 63 deletions(-)
code (-) 51 (+) 910
comments (-) 12 (+) 461
26 files changed, 2835 insertions(+), 171 deletions(-)
code (-) 142 (+) 1789
comments (-) 29 (+) 1046
The ktimer/timeofday/hrt combination adds ~1050 lines of source and
provides a clean API seperation and a lot of code/functionality
The high resolution timer patches add ~1750 lines of code for high
resolution time keeping without further functional improvements or