Short sleeps suffering from slack

By Jonathan Corbet
February 17, 2012

As a general rule, kernel developers will go out of their way to avoid breaking user-space code, even when that code is seen as being wrong and already broken. But there are exceptions; a recent discussion regarding timer behavior may prove to be an example of how such exceptions can come about.

The C-library sleep() function is defined to put the calling process to sleep for at least the number of seconds specified. One might think that calling sleep() with an argument of zero seconds would make relatively little sense; why put a process to sleep for no time? It turns out, though, that some developers put such calls in as a way to relinquish the CPU for a short period of time. The idea is to be nice and allow other processes to run briefly before continuing execution. Applications that perform polling or are otherwise prone to consuming too much CPU are often "fixed" with a zero-second sleep.

Once upon a time in Linux, sleep(0) would always put the calling process to sleep for at least one clock tick. When high-resolution timers were added to the kernel, the behavior changed: if a process asked to sleep on an already-expired timer (which is the case for a zero-second sleep), the call simply returned directly back to the calling process. Then came the addition of timer slack, which can extend sleep periods to force multiple processes to wake at the same time. This behavior will cause timers to run a little longer than requested, but the result is fewer processor wakeups and, thus, a savings of power. In the case of a zero-second sleep, the addition of timer slack turns an expired timer into one that is not expired, so the calling process will, once again, be put to sleep.

The default timer slack, at 50µs, is unlikely to cause visible changes to the behavior of most applications. But it seems that, on some systems, the timer slack value is set quite high - on the order of seconds - to get the best power behavior possible. That can extend the length of a zero-second sleep accordingly, leading to misbehaving applications.

Matthew Garrett, working under the notion that breaking applications is bad, submitted a patch making a special-case for zero-second sleeps. The idea is simple: if the requested sleep time is zero, timer slack will not be added and the process will not be delayed indefinitely. The problem with this approach is that the process will still not get the desired result: rather than yielding the processor, it will have simply performed a useless system call and gone right back to whatever it was doing before. Without timer slack, a request to sleep on an expired timer will return directly to user space without going through the scheduler at all.

An alternative would be to transform sleep(0) into a call to sched_yield(). But that idea is not hugely popular with the scheduler developers, who think that calls to sched_yield() are almost always a bad idea. It is better, they say, to fix the applications to stop polling or doing whatever else it is that they do that causes developers to think that explicitly yielding the CPU is the right thing to do.

According to Matthew, the number of affected applications is not tiny:

Checking through an exploded Fedora kernel tree suggests around 125 packages out of 11000 or so, so around 1% of userspace seems to use sleep(0) under certain circumstances. We can probably fix everything in the distribution, but that suggests that there's also going to be a significant amount of code in the outside world that's also broken.

Normal practice in kernel development would be to try to avoid breaking those applications if possible. Even in cases where applications are relying on undefined and undocumented behavior - certainly the case here - it is better if a kernel upgrade doesn't turn working code into broken code. Some participants have suggested that the same approach should be taken in this case.

The situation with sleep(0) is a little different from others, though. Application developers cannot claim a long history of working behavior in this case, since the kernel's response to a zero-second sleep has already changed a few times over the course of the last decade. And, according to Thomas Gleixner, it is hard to know when the special case applies or what should be done:

Dammit, we cannot come up with a reasonable definition for special casing that stuff simply because you cannot draw a clear boundary what to special case and what not. And there is no sensible definition for what to do - return right away or go through schedule() or what ever.

Thomas worries that there may be calls for special cases for similar calls - single-nanosecond calls to nanosleep(), for example - and that the result will be an accumulation of cruft in the core timer code. So, rather than try to define these cases and maintain the result indefinitely, he thinks it is better just to let the affected code break in cases where the timer slack has been set to a large value. And that is where the discussion faded away, suggesting that nothing will be done in the kernel to reduce the effect of timer slack on zero-second sleeps.

Index entries for this article
Kernel	Development model/User-space ABI
Kernel	hrtimer
Kernel	Timers

Short sleeps suffering from slack

Posted Feb 23, 2012 5:24 UTC (Thu) by xorbe (guest, #3165) [Link] (1 responses)

Seems like they could just add 1ns to all incoming values, and be done with it.

Short sleeps suffering from slack

Posted Feb 23, 2012 13:43 UTC (Thu) by vonbrand (subscriber, #4458) [Link]

Or better use something like having a slack of 10% incomming value + 1ns

Short sleeps suffering from slack

Posted Feb 23, 2012 13:55 UTC (Thu) by Ben_P (guest, #74247) [Link] (1 responses)

What types of applications break when sleep(0) just returns? sleep(0) seems like it would be used to "fix" concurrency issues in an even more naive way than sched_yield()?

I've seen so many poorly written programs "fix" concurrency problems with yields that I'm quite cynically whenever I see them in code. Unless it's in some very primitive concurrency or IO; yields only seem to delay incorrect code from breaking.

Short sleeps suffering from slack

Posted Feb 24, 2012 23:02 UTC (Fri) by giraffedata (guest, #1954) [Link]

What types of applications break when sleep(0) just returns? ...

I've seen so many poorly written programs "fix" concurrency problems with yields ...

It seems like you've answered your own question.

I have no trouble believing that these programs you've seen worked better after sleep(0) was added than before. Maybe it's just within a narrow field of application, but that may be the only field that matters. You might say these programs don't deserve to keep working, even in that narrow application, but you can't deny that making sleep() a no-op would do damage there.

Short sleeps suffering from slack

Posted Feb 24, 2012 0:29 UTC (Fri) by cmccabe (guest, #60281) [Link] (4 responses)

Ideal glibc implementation of sched_yield / sleep(0):

void sched_yield(void) { fprintf(stderr, "You are a bad developer. Go away.\n"); }

Short sleeps suffering from slack

Posted Feb 25, 2012 21:08 UTC (Sat) by nevets (subscriber, #11875) [Link] (2 responses)

I mostly agree with you but there are a small set of legitimate uses of sched_yield(). The only real use case I can think of is a set of real time threads that are simulating its own voluntary scheduler.

If you have a set of threads all at the same priority, running FIFO and pinned to the same CPU, you can use sched_yield() to put yourself behind the other threads with the same priority and let them work. I've been on one project that did this.

The kernel stop_machine mechanism use to do this. It used yield() to let its other threads get the scheduler (all running highest FIFO priority). It did this method up till v2.6.26, after that, the algorithm was changed.

Short sleeps suffering from slack

Posted Feb 26, 2012 22:48 UTC (Sun) by cmccabe (guest, #60281) [Link] (1 responses)

I'm not too familiar with SCHED_FIFO. But can't you use select(0, NULL, NULL, NULL, 0) to do the same thing? System calls are cancellation points, at least.

Incidentally, I was around for the cooperative multitasking days on Mac OS 6. It was not good. I'm sure there's some rationale for simulating that kind of thing in userspace, but a lot of times it smells like doing something in userspace that you ought to be doing in the kernel.

Short sleeps suffering from slack

Posted Feb 27, 2012 12:21 UTC (Mon) by jengelh (guest, #33263) [Link]

select(timeout=0)? That's just as bad as sleep(0). sched_yield() looks a lot better, and since it is also a system call, there is your cancellation point.

Short sleeps suffering from slack

Posted Mar 1, 2012 13:12 UTC (Thu) by farnz (subscriber, #17727) [Link]

There are two problem cases where that's a bad implementation:

sleep(0) where the 0 is not a hard-coded constant, but the result of a time calculation. The application doesn't actually mind that the sleep takes extra time, as it's aiming to do things like "sleep until the next work item is due to start", and can cope if it's late (e.g. a task scheduler aiming to kick off work every 60 seconds - if the work takes more than 60 seconds unexpectedly, it has to cope anyway). Such an application is probably doing a calculation of the form "next start time - current time"[2] to get the sleep time.
sched_yield in a SCHED_FIFO context. While the POSIX definition of sched_yield doesn't require it to be anything other than a no-op, for the specific case of SCHED_FIFO there's a good reason to implement it as "let any other runnable task of same priority as this task run"; absent such an implementation, there is no way for multiple co-operating SCHED_FIFO tasks to say "I still want the CPU, but if another task of equal priority wants the CPU, let it have it".

Note that the second case is specific to SCHED_FIFO - other scheduling algorithms will preempt a CPU-bound task if something else of same priority needs the CPU. SCHED_FIFO specifically does not allow that to happen, so you need some sensible mechanism for a task to say "I'm still CPU-bound, but this is an appropriate point to preempt me if another task needs to run".

Are we really chasing the right issue?

Posted Feb 24, 2012 8:16 UTC (Fri) by rvfh (guest, #31018) [Link] (15 responses)

> on some systems, the timer slack value is set quite high - on the order of seconds

So the problem is not just sleep(0) then! sleep(1) might sleep several seconds too... Isn't this the first issue to fix? Who decided my sleep(1) could wait several seconds and not just the 1 I coded?

To me the problem is when the sleep requested period is less than the slack value, and that's what I would fix.

Are we really chasing the right issue?

Posted Feb 24, 2012 8:45 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

the owner of the system set the slack value, why should an application programmer of some random application get to override this?

Are we really chasing the right issue?

Posted Feb 24, 2012 9:29 UTC (Fri) by rvfh (guest, #31018) [Link] (1 responses)

Problem is:
* app dev says it should sleep 1 second
* sys owner says if you sleep, then you may sleep for 5 seconds

What do we do? Either
* sleep for 1 second, as requested, or
* sleep for up to 5 seconds and break the application

I think this calls for a new user-space API, such as:
unsigned int sleep_slack(unsigned int seconds, unsigned int slack);

But sleep's behaviour should not be changed.

Are we really chasing the right issue?

Posted Feb 24, 2012 9:55 UTC (Fri) by tglx (subscriber, #31301) [Link]

> But sleep's behaviour should not be changed.

The kernel does not change sleep() behaviour. It's the sysadmins choice to set slack to something large. The kernel provides the mechanism, but not the policy.

Are we really chasing the right issue?

Posted Feb 24, 2012 10:20 UTC (Fri) by anselm (subscriber, #2796) [Link] (9 responses)

Who decided my sleep(1) could wait several seconds and not just the 1 I coded?

The person who wrote the spec for sleep(), which says, among other things:

The suspension time may be longer than requested due to the scheduling of other activity by the system.

So if you believe that »sleep(1)« will sleep for exactly one second, you are mistaken about how sleep() works.

oversleeping

Posted Feb 24, 2012 22:57 UTC (Fri) by giraffedata (guest, #1954) [Link]

I think it's more basic than the documented function of sleep(). In a non-realtime timeshared OS, the OS can take several seconds from you any time it wants, whether you did a sleep() or not. If you get to run at all, you should be grateful.

Are we really chasing the right issue?

Posted Feb 26, 2012 10:44 UTC (Sun) by IkeTo (subscriber, #2122) [Link] (7 responses)

> So if you believe that »sleep(1)« will sleep for exactly one second, you are mistaken about how sleep() works.

Nobody has any doubt about "sleep(1)" sleeping 1.01 second, or sleeping 2 whole days if the user suspended the computer. But that's a different proposition than expecting that "sleep(1)" would regularly sleep 10 seconds in a reasonably loaded system. As a developer, if I know that if instead it sleeps 10 seconds, my program will not behave as it should, what other options do I have?

Are we really chasing the right issue?

Posted Feb 27, 2012 10:50 UTC (Mon) by mpr22 (subscriber, #60784) [Link] (2 responses)

That would depend on whether your program breaking when the delay is 10 seconds instead of 1 second is justifiable. If it is, you'll just have to document that the user needs to turn down the timer slack setting on their system. If it isn't, fix your buggy program.

Are we really chasing the right issue?

Posted Feb 27, 2012 15:51 UTC (Mon) by fuhchee (guest, #40059) [Link] (1 responses)

"If it is, you'll just have to document that the user needs to turn down the timer slack setting on their system."

So a single systemwide knob has to be fixed by the user's sysadmin? That doesn't seem appropriate, just to retain previous capability.

Are we really chasing the right issue?

Posted Feb 27, 2012 19:05 UTC (Mon) by dlang (guest, #313) [Link]

it's not a "single systemwide knob", it's a per-cgroup knob

Are we really chasing the right issue?

Posted Mar 1, 2012 5:24 UTC (Thu) by kevinm (guest, #69913) [Link] (2 responses)

If your program will work correctly when sleeping one second, but not when sleeping 10 seconds, then you either have a buggy program (which is probably already failing on heavily loaded systems) or a program that should be using a real-time scheduling class.

A program that calls sleep(n) must already expect to sleep for at least n seconds. The timer-slack is just making these bugs more visible.

Are we really chasing the right issue?

Posted Mar 3, 2012 2:05 UTC (Sat) by IkeTo (subscriber, #2122) [Link] (1 responses)

Say I want to create a stop-watch application, the user specify a number of seconds to wait until an alert is shown, and meanwhile the stop watch will keep displaying the amount of time remaining, in seconds intervals. No user care if our display is updated 0.1 second too late, so the original sleep works perfectly. There is no need of "real-time scheduling class" requirement.

With the timer slack, all at a sudden users will see the timer being updated once fifteen seconds, and the final alert also late similarly. No user will miss such a "bug".

Now what option do I have?

1. I can ask the user to setuid root the program so that the program can use real-time scheduling, hoping that they have root privileges, and making every security sensitive user to raise their eyebrow.

2. I can ask the user to change the cgroup wide timer slack value, hoping that they have root privileges, and making the whole system wasting energy for all the time before the user/admin remember to reset the timer slack value, because they are now sleeping more than they do optimally.

3. I can stop sleeping at all, and instead use a busy loop with a very high nice level. Seems very drastic, waste a processor, waste power, make system load 1, but in a sense it is the best solution because it only affect the system for as long as the stop watch runs, and do not need root privileges.

How's that sound?

Are we really chasing the right issue?

Posted Mar 7, 2012 17:22 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

4. Write your program with a client/daemon architecture. The daemon can be activated as root by the system's daemon-managing services, then drop its privileges once it has given itself a real-time scheduling class. The client connects to the daemon via a socket, then sits in a blocking read() waiting for the once-a-second heartbeat packets from the daemon. If the daemon doesn't currently have any clients, it can just sit in a blocking accept() call until one shows up.

Admittedly this stops people on machines they don't administer from installing and using your application. However, if the user isn't trusted to have administrative access to the system, they probably shouldn't be self-installing applications that require policy violations to work as expected anyway.

Are we really chasing the right issue?

Posted Mar 9, 2012 8:41 UTC (Fri) by Thomas (subscriber, #39963) [Link]

Using a timer?

Are we really chasing the right issue?

Posted Feb 24, 2012 10:26 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

Who decided my sleep(1) could wait several seconds and not just the 1 I coded?

Linus, by virtue of deciding in 1991 that his new kernel would be an ordinary preemptively multitasking kernel, rather than something more exotic. sleep() has always had the property on Unix-like OSes that your process might sleep longer than you expect.

Are we really chasing the right issue?

Posted Mar 1, 2012 13:29 UTC (Thu) by slashdot (guest, #22014) [Link]

The timer slack must be at most around 1-10ms if the system is supposed to correctly run arbitrary software.