Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

[Posted July 2, 2012 by corbet]

From:		John Stultz <johnstul-AT-us.ibm.com>
To:		Linux Kernel <linux-kernel-AT-vger.kernel.org>
Subject:		Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)
Date:		Sun, 01 Jul 2012 15:05:08 -0700
Message-ID:		<4FF0C994.2020300@us.ibm.com>
Cc:		Prarit Bhargava <prarit-AT-redhat.com>, stable-AT-vger.kernel.org, Thomas Gleixner <tglx-AT-linutronix.de>, Jan Engelhardt <jengelh-AT-inai.de>
Archive‑link:		Article

On 07/01/2012 11:29 AM, John Stultz wrote:
> TODOs:
> * Chase down the futex/hrtimer interaction to see if this could
> be triggered in any other way.

Ok, got a little more detailed diagnosis of what is going on figured out:

* Leap second occurs, CLOCK_REALTIME is set back one second.

* As clock_was_set() is not called, the hrtimer base.offset value for 
CLOCK_REALTIME is not updated, thus its sense of wall time is one second 
ahead of the timekeeping core's.

* At interrupt time (T), the hrtimer code expires all CLOCK_REALTIME 
based timers set for T+1s and before, causing early expirations for 
timers between T and T+1s since the hrtimer code's sense of time is one 
second ahead.

* This causes all TIMER_ABSTIME CLOCK_REALTIME timers to expire one 
second early.

* More problematically, all sub-second TIMER_ABSTIME CLOCK_REALTIME 
timers will return immediately.  If any such timer calls are done in a 
loop (as commonly done with futex_wait or other timeouts), this will 
cause load spikes in those applications.

* This state persists until clock_was_set() is called (most easily done 
via settimeofday())

I've used the attached test case to demonstrate triggering a leap-second 
and its effect on CLOCK_REALTIME hrtimers.

The test sets a leapsecond to trigger in 10 seconds, then in a loop 
sleeps for half a second via clock_nanosleep, printing out the current 
time, and the delta from the target wakeup time for 30 seconds.

When the leap second triggers, on affected machines you'll see the 
output streams quickly, with negative diff values, as clock_nanosleep is 
immediately returning.

To build:
gcc leaptest-timer.c -o leaptest-timer -lrt

I've reproduced this behaviour in kernel versions:
     v3.5-rc4
     v2.6.37
     v2.6.32.59
(And quite likely all in-between).

I haven't been able to build or boot anything earlier with the distro on 
my current test boxes, but I'm working to get older distro installed so 
I can do further testing.

Likely has potentially been around 
since:746976a301ac9c9aa10d7d42454f8d6cdad8ff2b in v2.6.22, as Ben Blum 
and Jan Ceuleers already noted.

With my fix to call clock_was_set when we apply a leapsecond, I no 
longer see the issue.

thanks
-john

/* Leap second timer test
 *              by: john stultz (johnstul@us.ibm.com)
 *              (C) Copyright IBM 2012
 *              Licensed under the GPL
 */

#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <sys/timex.h>

#define CALLS_PER_LOOP 64
#define NSEC_PER_SEC 1000000000ULL

struct timespec timespec_add(struct timespec ts, unsigned long long ns)
{
	ts.tv_nsec += ns;
	while(ts.tv_nsec >= NSEC_PER_SEC) {
		ts.tv_nsec -= NSEC_PER_SEC;
		ts.tv_sec++;
	}
	return ts;
}

struct timespec timespec_diff(struct timespec a, struct timespec b)
{
	long long ns;
	int neg = 0;

	ns = a.tv_sec *NSEC_PER_SEC + a.tv_nsec;
	ns -= b.tv_sec *NSEC_PER_SEC + b.tv_nsec;

	if (ns < 0) {
		neg = 1;
		ns = -ns;
	}
	a.tv_sec = ns/NSEC_PER_SEC;
	a.tv_nsec = ns%NSEC_PER_SEC;

	if (neg) {
		a.tv_sec = -a.tv_sec;
		a.tv_nsec = -a.tv_nsec;
	}

	return a;
}

int  main(void)
{
	struct timeval tv;
	struct timex tx;
	int i, inconsistent;
	long now, then;
	struct timespec ts;

	int clock_type 		= CLOCK_REALTIME;
	int flag 		= TIMER_ABSTIME;
	long long sleeptime	= NSEC_PER_SEC/2;

	/* clear TIME_WAIT */
	tx.modes = ADJ_STATUS;
	tx.status = 0;
	adjtimex(&tx);

	sleep(2);

	/* Get the current time */
	gettimeofday(&tv, NULL);

	/* Calculate the next leap second */
	tv.tv_sec += 86400 - tv.tv_sec % 86400;

	/* Set the time to be 10 seconds from that time */
	tv.tv_sec -= 10;
	settimeofday(&tv, NULL);

	/* Set the leap second insert flag */
	tx.modes = ADJ_STATUS;
	tx.status = STA_INS;
	adjtimex(&tx);

	clock_gettime(clock_type, &ts);
	now = then = ts.tv_sec;
	while(now - then < 30){
		struct timespec target, diff, rem;
		rem.tv_sec = 0;
		rem.tv_nsec = 0;

		if (flag == TIMER_ABSTIME)
			target = timespec_add(ts, sleeptime);
		else
			target = timespec_add(rem, sleeptime);

		clock_nanosleep(clock_type, flag, &target, &rem);
		clock_gettime(clock_type, &ts);

		diff = timespec_diff(ts, target);
		printf("now: %ld:%ld  diff: %ld:%ld rem: %ld:%ld\n",
				ts.tv_sec, ts.tv_nsec,
				diff.tv_sec, diff.tv_nsec,
				rem.tv_sec, rem.tv_nsec);
		now = ts.tv_sec;
	}

	/* clear TIME_WAIT */
	tx.modes = ADJ_STATUS;
	tx.status = 0;
	adjtimex(&tx);

	return 0;
}

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 16:09 UTC (Mon) by Baylink (guest, #755) [Link] (13 responses)

Why would a "clock be set back one second"?

Do people not understand how leap seconds are implemented? Really?

235958
235959
235960
000000
000001

Not, as Red Hat seems to think:

235958
235959
235959
000000
000001

Or, as google seems to think would be a Good Idea:

235958
235959
000000
000001

with *seconds being 1/86,401th of a second longer than other days* (no, I am not making any part of that up).

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 16:24 UTC (Mon) by Baylink (guest, #755) [Link] (1 responses)

To clarify: it appears to me (based on the available evidence at the moment) that the fundamental underlying cause of this failure is a failure by the people who need to to understand that 235960 is a valid timestamp.

ISO8601 actually permits 60 as a valid seconds count, for precisely this reason.

https://en.wikipedia.org/wiki/ISO_8601

I had thought that it, for some reason, permitted 61, too, but I was worng.

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 4, 2012 7:12 UTC (Wed) by butlerm (subscriber, #13312) [Link]

The cause is not a failure in understanding, it is a failure in implementation. The kernel does not track time in hours, minutes, and seconds, but it (regrettably) does track time using a standard encoding designed around the assumption that leap seconds can be profitably ignored.

Changing the kernel's internal time base to use something TAI derived instead of UTC derived is probably the only way to fix this problem reliably. The downside is that means the kernel would have to maintain a leap second table and convert back and forth between POSIX time and linear time where necessary.

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 16:27 UTC (Mon) by lindi (subscriber, #53135) [Link] (1 responses)

The interface between Linux and applications is the gettimeofday system call. I believe it returned

1341100798
1341100799
1341100800
1341100800
1341100801

Isn't this exactly the correct behavior?

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 17:24 UTC (Mon) by Jonno (subscriber, #49613) [Link]

Yes, that is the correct behaviour of gettimeofday().

However, there was some internal kernel code that expected to be informed when the time of day and elapsed time wasn't continuous (done by calling clock_was_set()). The code for settimeofday() got this right, but the clock_was_set() call was missing from the leap second introducing code, leading to some trouble I don't really understand.

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 18:53 UTC (Mon) by jhhaller (guest, #56103) [Link] (8 responses)

There are actually a couple of ways leap seconds are implemented in GNU/Linux (we can't blame the whole mess on the kernel). There are three parts to leap seconds, ntpd, the kernel, and glibc.

To use glibc with leap seconds, "right" timezone files must be used, e.g. US_Central_right. This allows the ISO version of timestamps to be used, and the clock will indeed tick at 235958, 235959, 235960, 000000. These are not the default timezones, for reasons described below. Note that this requires that the number of seconds since January 1, 1970 must account for all leap seconds.

To use glibc with the conventional timezones, there is no notion of leap seconds. This is the way ctime has worked since the beginning. It assures that every year will have the same number of seconds, except for leap day years, which have an extra day. This obviously causes problems when there is a leap second, as a second has to be played twice from the kernel, as the kernel clock can't know about leap seconds since ctime doesn't. There is no way in the current interfaces to report a time plus report that this kernel time represents a leap second. Also, the Posix definition for time does not account for leap seconds. The only way to do this is to replay the time value for second 59, as ctime has no way to know a leap second is coming to show the displayed second as 60.

Now, to throw NTP into the mix. The NTP protocol reports time in UTC, which is ephemeral or solar time. There is no history of leap seconds in the UTC protocol, just an indication that an upcoming minute at the end of the day will have 59, 60, or 61 seconds. While NTP could in theory run with the "right" timezones, and add or subtract historical leap seconds when setting the system time, that would make the time returned from the time call to be incorrect according to Posix.

In short, Posix is inconsistent with itself, or at least needs a new kernel API to reflect historical leap seconds, both for time and adjtime, although it appears that this was known when the time system call was standardized.

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 20:18 UTC (Mon) by chloe_zen (guest, #8258) [Link] (7 responses)

Should ntp not use a gradual clock skew instead of simply slamming the time to its new value?

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 20:36 UTC (Mon) by Thue (guest, #14277) [Link] (5 responses)

In any sane standard, the clock (and NTP) should be using TAI ( http://en.wikipedia.org/wiki/International_Atomic_Time ). The same way as the computer's clock doesn't include time zones, it shouldn't include leap seconds.

Time zone and leap second offsets should be added in user space programs, the same way I assume time zones currently are.

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 20:45 UTC (Mon) by chloe_zen (guest, #8258) [Link] (4 responses)

I don't disagree, but that's beside the point IMO. Sometimes clock drift happens. When ntp is called on to fix that drift -- whether due to stupid standards or everyday imprecision -- shouldn't it use the system calls that are designed for adjusting the system clock's fundamental speed, instead of just saying "ok your time is different NOW!" ?

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 21:08 UTC (Mon) by Thue (guest, #14277) [Link] (2 responses)

Of course we would still need NTP if the system clock was set to TAI, and of course the same gradual clock adjust as now should be used. The use of NTP is ortogonal to the UTC vs TAI as system clock argument.

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 21:40 UTC (Mon) by chloe_zen (guest, #8258) [Link] (1 responses)

Er, "the same gradual clock adjust as now" isn't so gradual, is it? Else this bug wouldn't have hit?

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 2, 2012 22:05 UTC (Mon) by Thue (guest, #14277) [Link]

The current clock implementation is only non-gradual at leap seconds.

gradual clock adjust

Posted Jul 2, 2012 23:11 UTC (Mon) by pflugstad (subscriber, #224) [Link]

FWIW, using NTP to gradually account for a leap second is what Google decided to do:

<http://googleblog.blogspot.com/2011/09/time-technology-an...>

Note that they don't explicitly say over what time window they adjust the time, but my impression from the above article is that it's done over a few hours, not over an entire day.

Re: [PATCH 0/2][RFC] Potential fix for leapsecond caused futex issue (v2)

Posted Jul 3, 2012 14:19 UTC (Tue) by Tobu (subscriber, #24111) [Link]

NTP merely informs the kernel (using adjtimex) that a leap second is being inserted. The kernel implementation makes an internal timestamp jump, but that's because the kernel counts using POSIX timestamps, which is an implementation decision. If the kernel's internal timekeeping used TAI, the kernel would cross-reference the adjtimex notification with some sort of leap seconds table, which it would use whenever it needs to come up with a POSIX timestamp (for much of its ABI including protocols, filesystem formats, and system calls). NTP is agnostic about how the kernel clock is run.