A new kernel timer API

[Posted May 18, 2005 by corbet]

John Stultz's new core time subsystem was covered on this page back in January. This patch set, which will be submitted soon for inclusion (into -mm), replaces a mess of architecture-specific time implementations with a cleaner, central time subsystem which can take full advantage of hardware time sources. Nishanth Aravamudan would now like to take advantage of the new low-level time code by replacing the kernel timer implementation. This work, if accepted, will lead to the incorporation of a new timer API to be used by kernel code when a function must be called at some point in the future.

In current Linux kernels, internal time (for most purposes) is measured in "jiffies," which are really just a counter which is incremented when each timer interrupt happens. The new time code supersedes jiffies with an absolute, monotonically increasing count of nanoseconds. References to jiffies thus become a call to:

    nsec_t do_monotonic_clock(void);

Using nanoseconds allows kernel code to work with high-resolution time in real-world units. That, in turn, lets kernel developers forget about the (error-prone) conversions between jiffies and real-world time which are currently necessary.

Nishanth's add-on patch changes the timer subsystem to use nanoseconds as well. The current add_timer() and mod_timer() interfaces remain supported, but are deprecated. The new interface for setting (or modifying) a timer is:

    int set_timer_nsecs(struct timer_list *timer, nsec_t expires);
    void set_timer_on_nsecs(struct timer_list *timer, nsec_t expires, 
                            int cpu);

This function will cause the given timer to be set to go off at expires, which is an absolute nanoseconds count. Usually, expires will be calculated by adding the desired delay (in nanoseconds) to whatever do_monotonic_clock() returns.

It's worth noting that this patch changes the meaning of the expires field in the timer_list structure. This field is now represented in an internal "timer intervals" unit, rather than in jiffies. If the old add_timer() and mod_timer() interfaces are used, the expires field will be silently converted to the internal format. Code which performs calculations on expires (by increasing the delay and calling mod_timer(), for example) could be in for a surprise.

This patch also deprecates schedule_timeout(), in favor of these functions:

    nsec_t schedule_timeout_nsecs(nsec_t timeout);
    unsigned long schedule_timeout_usecs(unsigned long usecs);
    unsigned int schedule_timeout_msecs(unsigned int msecs);

All three of these functions will set a timer for the given delay (which is a relative value, not absolute), then call schedule().

Index entries for this article
Kernel	Timers

A new kernel timer API

Posted May 19, 2005 5:47 UTC (Thu) by brouhaha (subscriber, #1698) [Link] (10 responses)

Using nanoseconds for the unit seems like it might be slightly short-sighted. It's probably fine for now, but will it be too coarse twenty years from now? Wouldn't it be better to use picoseconds as the unit? That still allows for over four seconds to be represented in an unsigned 32-bit integer, or hundreds of years in a 64-bit integer.

A new kernel timer API

Posted May 19, 2005 9:44 UTC (Thu) by kleptog (subscriber, #1183) [Link] (8 responses)

I'm not sure. In one picosecond, light has travelled about one third of a millimetre (a milli-foot?). Electrons not even that far. In a nanosecond, light moves about 30cm (a foot for you non-SI users).

I don't really know how much smaller we can make CPUs and stuff but I beleive we're reaching the point where higher clock-speeds are useless and we have to start doing more per clock cycle, hyperthreading, multicore, grid computing, etc. If this is the case, a higher resolution than nanoseconds does not seem particularly useful.

Hoave a nice day,

A new kernel timer API

Posted May 19, 2005 17:50 UTC (Thu) by brouhaha (subscriber, #1698) [Link] (7 responses)

I beleive we're reaching the point where higher clock-speeds are useless and we have to start doing more per clock cycle,

People have been saying that for at least fifteen years, and the clock rates keep going up. As Feynman said, "there's plenty of room at the bottom". Eventually we will hit the physical limits, but we aren't there yet.

My point wasn't so much that we need 1 ps timing precision, as that we may well need better than 1 ns. There's three orders of magnitude difference there. If we don't want to use 1 ps, we could certainly use 10 ps, 100 ps, or even 37.2 ps as the unit. But 1 ps seems somewhat more convenient.

A new kernel timer API

Posted May 19, 2005 18:42 UTC (Thu) by hamjudo (guest, #363) [Link] (3 responses)

But you got your units confused. The smallest power of ten unit that fits is the nanosecond. The smallest power of 2 unit is 2^-32 seconds, which is about 250 picoseconds.

To the parent poster, picosecond timers will be usefull, even on chips that are many picoseconds across. Note how well the Network Time Protocol can synchronize clocks to better than millisecond precision, even though the systems themselves are many light milliseconds apart. (assuming appropriate network interconnect.)

A new kernel timer API

Posted May 20, 2005 23:51 UTC (Fri) by giraffedata (guest, #1954) [Link] (2 responses)

But you got your units confused. The smallest power of ten unit that fits is the nanosecond. The smallest power of 2 unit is 2^-32 seconds, which is about 250 picoseconds.

You're right that brouhaha confused the units (he said 4 seconds worth of picoseconds fit in 32 bits; it's really 4 seconds worth of nanoseconds). But you're introducing a "fits" that nobody's talking about here -- apparently you're intending for 32 bits to count one second.

Note that the interface provides multiple granularities/ranges from which to choose. You can specify your interval in milliseconds, microseconds, or nanoseconds. That in no way means you can actually get that kind of precision out of the timer. There's really no reason not to throw in picoseconds, if only to save having to answer the question.

Why not units of 2^-32 seconds?

Posted May 23, 2005 16:25 UTC (Mon) by spitzak (guest, #4593) [Link] (1 responses)

That actually sounds like a useful and natural unit to use, and certainly easier for programmers to remember. It would allow the upper 32 bits to be equal to the Unix clock and allow conversion to a floating-point number of seconds without rounding errors.

Is there some good reason why a power of ten should be used? Is it because of rounding errors from times specified in decimal numbers of seconds?

Why not units of 2^-32 seconds?

Posted May 24, 2005 2:40 UTC (Tue) by giraffedata (guest, #1954) [Link]

Natural for whom? The computer?

Remember that we're talking about an external interface here -- the question is in what units would a user of the timer facility want to specify a duration? Virtually nobody measures time in binary units; we all think of time in milliseconds, nanoseconds, etc.

The Unix time_t type (which I think is what you're referring to as the Unix clock) doesn't actually figure in anywhere here -- this is a value that specifies a duration, not a point in time; and if it ever gets added to a point in time, that time is in the kernel internal format, which is a count of clock ticks.

A new kernel timer API

Posted May 19, 2005 22:24 UTC (Thu) by kleptog (subscriber, #1183) [Link] (2 responses)

LOL! If we're making devices which are smaller than a third of a millimetre across, I imagine we'd get a much better return clustering a few thousand onto a single chip than we'd get by trying to make them all run at a terahertz clock rate.

My personal feeling is that measuring less than a nanosecond is not useful given than the moment you're accessing something off-chip (like say memory) you're going to be delayed by tens of thousands of picoseconds and memory latency is not reducing anywhere near as fast as clock speeds.

But hey, I'm willing to be proved wrong.

A new kernel timer API

Posted May 22, 2005 15:17 UTC (Sun) by haraldt (guest, #961) [Link] (1 responses)

Processors are asyncronous even today, aren't they?
Todays standard is to move information from place A to place B within a manageable number of clock ticks. If a clock tick takes a picosecond, then the only requirement is that places A and B are never more than one third of a millimeter apart.

Instructions may have to run through a lot of clock ticks this way, but it's all a matter of resolution.

Won't promise this is is going to happen, but hey, it's an idea?

A new kernel timer API

Posted May 22, 2005 15:39 UTC (Sun) by haraldt (guest, #961) [Link]

Err.. that A and B are in multiples of a third millimeter apart.

You'd probably need asynchronous buses, asyncronous memory devices etc. too.
The distance between processor and main memory, for example, could be a load of clock ticks, often with several signals travelling on the same wire. But as long as equipment can handle asyncronous signalling (in addition to the speed of course) it's far from impossible.

A new kernel timer API

Posted May 26, 2005 20:53 UTC (Thu) by j1m+5n0w (guest, #20285) [Link]

Wouldn't it be better to use picoseconds as the unit?

I worked on a project for awhile implementing a high-precision timer mechanism in linux. We used the APIC timer, which gave us an accuracy of about 4 microseconds at best (worst case was much, much worse due to non-premptible kernel sections). Linux is at the point now where nonpreemptible sections longer than a couple milliseconds might happen occasionally, but they're relatively rare whereas latencies of a couple hundred microseconds happen all the time. That would imply that a wakeup timer with a granularity much less than a few hundred microseconds won't be all that useful, since it can make any guarantees, so there's currently not much of a need for timer APIs with a granularity finer than microseconds or nanoseconds.

One big problem is inconsistent interfaces. IIRC Nanosleep uses timespecs (32 bits for seconds, 32 bits for nanoseconds), select uses timvals (32 bits for seconds, 32 bits for microseconds), poll uses a 32 bit millisecond value, itimers use timeval, gettimeofday uses timeval, and aio (i think) uses timespecs. Timespec seems to make the most sense, since it can be used for very long or very short timeouts, and doesn't waste many bits (you might as well use the maximum precision you can get for free). Timeval is almost as good, but microseconds are kind of sloppy for gettimeofday, which might be able to tell what time it is with greater accuracy (though the system call takes about a microsecond to complete, so maybe the point is moot). Poll really shouldn't have used a single 32 bit value - it's too coarse for high-precision timeouts, and can't be used for very long timeouts either.

Someone else in this thread suggested using a 64 bit value of 2^-32 second units, which appeals to me but probably not everyone else. If the system call interface could standardize on timespec for everything time-related, that would be fine with me. Unfortunately, the system call interface is more or less etched in stone, so I don't forsee anyone changing it anytime soon.

Another change I would like to see but I don't know if anyone else does, is to have versions of nanosleep, select, poll, etc.. that use absolute time for their timeouts, rather than relative time. This ensures that the time lost during system call entry is accounted for properly. It also means that the kernel has to handle the case where the timer is expired before it's even added to the queue.

A new kernel timer API

Posted May 19, 2005 13:04 UTC (Thu) by davecb (subscriber, #1574) [Link] (1 responses)

Hmmn, does that mean we can have a portable
high-resolution timer interface in userland?

I'm a performance engineer and tend to depend
on (or port) implementations of the
POSIX hrtime().

--dave

A new kernel timer API

Posted May 19, 2005 17:14 UTC (Thu) by jsbarnes (guest, #4096) [Link]

The kernel already supports some POSIX clock routines, like clock_gettime,
as well as a few different types of clocks. If your platform supports it
(i.e. if you have a good clock source available and someone's written a
kernel driver for it), high resolution timers are available via that
interface.

date & duration

Posted May 20, 2005 21:42 UTC (Fri) by xav (guest, #18536) [Link]

There's already a possible confusion between relative and absolute nsecs. Maybe he should create types for both and some sparse-magic to check that only relative nsecs can be added to absolute nsecs.