LWN.net Logo

The offline scheduler

By Jake Edge
September 2, 2009

One of the primary functions of any kernel is to manage the CPU resources of the hardware that it is running on. A recent patch, proposed by Raz Ben-Yehuda, would change that, by removing one or more CPUs out from under the kernel's control, so that processes could run, undisturbed, on those processors. The "offline scheduler", as Ben-Yehuda calls his patch, had some rough sailing in the initial reactions to the idea, but as the thread on linux-kernel evolved, kernel hackers stepped back and looked at the problems it is trying to solve—and came up with other potential solutions.

The basic idea behind the offline scheduler is fairly straightforward: use the CPU hot-unplug facility to remove the processor from the system, but instead of halting the processor, allow other code to be run on it. Because the processor would not be participating in the various CPU synchronization schemes (RCU, spinlocks, etc.), nor would it be handling interrupts, it can completely devote its attention to the code that it is running. The idea is that code running on the offline processor would not suffer from any kernel-introduced latencies at all.

The core patch is fairly small. It provides an interface to register a function to be called when a particular CPU is taken offline:

    int register_offsched(void (*offsched_callback)(void), int cpuid);
This registers a callback that will be made when the CPU with the given cpuid is taken offline (i.e. hot unplugged). Typically, a user would load a module that calls register_offsched(), then take the CPU offline which triggers the callback on the just-offlined CPU. When the processing completes, and the callback returns, the processor will then be halted. At that point, the CPU can be brought back online and returned to the kernel's control.

The interface points to one of the problems that potential users of the offline scheduler have brought up: one can only run kernel-context, and not user-space, code using the facility. Because many of the applications that might benefit from having the full attention of a CPU are existing user-space programs, making the switch to in-kernel code is seen as problematic.

Ben-Yehuda notes that the isolated processor has "access to every piece of memory in the system" and the kernel would still have access to any memory that the isolated processor is using. He sees that as a benefit, but others, particularly Mike Galbraith, see it differently:

I personally find the concept of injecting an RTOS into a general purpose OS with no isolation to be alien. Intriguing, but very very alien.

One of the main problems that some kernel hackers see with the offline scheduler approach is that it bypasses Linux entirely. That is, of course, the entire point of the patch: devoting 100% of a CPU to a particular job. As Christoph Lameter puts it:

OFFSCHED takes the OS noise (interrupts, timers, RCU, cacheline stealing etc etc) out of certain processors. You cannot run an undisturbed piece of software on the OS right now.

Peter Zijlstra, though, sees that as a major negative: "Going around the kernel doesn't benefit anybody, least of all Linux." There are existing ways to do the same thing, so adding one into the kernel adds no benefit, he says:

So its the concept of running stuff on a CPU outside of Linux that I don't like. I mean, if you want that, go ahead and run RTLinux, RTAI, L4-Linux etc.. lots of special non-Linux hypervisor/exo-kernel like things around for you to run things outside Linux with.

But, Ben-Yehuda sees multiple applications for processors dedicated to specific tasks. He envisions a different kind of system, which he calls a Service Oriented System (SOS), where the kernel is just one component, and if the kernel "disturbs" a specific service, it should be moved out of the way:

What i am suggesting is merely a different approach of how to handle multiple core systems. instead of thinking in processes, threads and so on i am thinking in services. Why not take a processor and define this processor to do just firewalling ? encryption ? routing ? transmission ? video processing... and so on...

Moving the kernel out of the way is not particularly popular with many kernel hackers. But the idea of completely dedicating a processor to a specific task is important to some users. In the high performance computing (HPC) world, multiple processors spend most of their time working on a single, typically number-crunching, task. Removing even minimal interruptions, those that perform scheduling and other kernel housekeeping tasks, leads to better overall performance. Essentially, those users want the convenience of Linux running on one CPU, while the rest of the system's CPUs are devoted to their particular application.

After a somewhat heated digression about generally reducing latencies in the kernel, Andrew Morton asked for a problem statement: "All I've seen is 'I want 100% access to a CPU'. That's not a problem statement - it's an implementation." In answer, Chris Friesen described one possible application:

In our case the problem statement was that we had an inherently single-threaded emulator app that we wanted to push as hard as absolutely possible.

We gave it as close to a whole cpu as we could using cpu and irq affinity and we used message queues in shared memory to allow another cpu to handle I/O. In our case we still had kernel threads running on the app cpu, but if we'd had a straightforward way to avoid them we would have used it.

That led Thomas Gleixner to consider an alternative approach. He restated the problem as: "Run exactly one thread on a dedicated CPU w/o any disturbance by the scheduler tick." Given that definition, he suggested a fairly simple approach:

All you need is a way to tell the kernel that CPUx can switch off the scheduler tick when only one thread is running and that very thread is running in user space. Once another thread arrives on that CPU or the single thread enters the kernel for a blocking syscall the scheduler tick has to be restarted.

Gregory Haskins then suggested modifying the FIFO scheduler class, or creating a new class with a higher priority, so that it disables the scheduler tick. That would incorporate Gleixner's idea into the existing scheduling framework. As might be guessed, there are still some details to work out on running a process without the scheduler tick, but Gleixner and others think it is something that can be done.

The offline scheduler itself kind of fell by the wayside in the discussion. Ben-Yehuda, unsurprisingly, is still pushing his approach, but aside from the distaste expressed about circumventing the kernel, the inability to run user-space code is problematic. Gleixner was fairly blunt about it:

I was talking about the problem that you cannot run an ordinary user space task on your offlined CPU. That's the main point where the design sucks. Having specialized programming environments which impose tight restrictions on the application programmer for no good reason are horrible.

Others are also thinking about the problem, as a similar idea to Gleixner's was recently posted by Josh Triplett in an RFC to linux-kernel. Triplett's tiny patch simply disables the timer tick permanently as a demonstration of the gain in performance that can be achieved for CPU-bound processes. He notes that the overhead for the timer tick can be significant:

On my system, the timer tick takes about 80us, every 1/HZ seconds; that represents a significant overhead. 80us out of every 1ms, for instance, means 8% overhead. Furthermore, the time taken varies, and the timer interrupts lead to jitter in the performance of the number crunching.

Triplett warns that his patch is "by no means represents a complete solution" in that it breaks RCU, process accounting, and other things. But it does boot and can run his tests. He has fixes for some of those problems in progress, as well as an overall goal: "I'd like to work towards a patch which really can kill off the timer tick, making the kernel entirely event-driven and removing the polling that occurs in the timer tick. I've reviewed everything the timer tick does, and every last bit of it could occur using an event-driven approach."

It is pretty unlikely that we will see the offline scheduler ever make it into the mainline, but the idea behind it has spawned some interesting discussions that may lead to a solution for those looking to eliminate kernel overhead on some CPUs. In many ways, it is another example of the perils of developing kernel code in isolation. Had Ben-Yehuda been working in the open, and looking for comments from the kernel community, he might have realized that his approach would not be acceptable—at least for the mainline—much sooner.


(Log in to post comments)

The offline scheduler

Posted Sep 3, 2009 12:54 UTC (Thu) by xoddam (subscriber, #2322) [Link]

> Had Ben-Yehuda been working in the open, and looking for comments
> from the kernel community, he might have realized that his
> approach would not be acceptable — at least for the mainline —
> much sooner.

He's been posting on this subject on LKML since October of last year

http://lkml.org/lkml/2008/10/17/516

and he came *here* in February.

http://lwn.net/Articles/319911/

He got very little in the way of comments (kudos to the few who engaged) but ploughed on with the technical work regardless. Only now has the discussion reached the point where 'prominent' scheduler hackers are offering much more comment than "why would you want to do that?" and realising that there is a genuine need which this hack is an attempt to address.

Ben-Yehuda is like a CPU-bound Con Kolivas with an extra language barrier.

Dynamic scheduler tick

Posted Sep 3, 2009 13:29 UTC (Thu) by mingo (subscriber, #31122) [Link]

Only now has the discussion reached the point where 'prominent' scheduler hackers are offering much more comment than "why would you want to do that?" and realising that there is a genuine need which this hack is an attempt to address.

As the article mentioned, the crux of the issue is a dynamic (not HZ driven) scheduler tick.

If you followed scheduler development you might have noticed that this (dynamic scheduler tick) was implemented 1.5 years ago by Peter Zijstra (who is the other scheduler maintainer in addition to myself).

For details, see this upstream commit:
    commit 8f4d37ec073c17e2d4aa8851df5837d798606d6f
    Author: Peter Zijlstra < a.p.zijlstra@chello.nl >
    Date:   Fri Jan 25 21:08:29 2008 +0100

        sched: high-res preemption tick

It was released in the v2.6.26 kernel iirc.

Nobody was really interested in it though and it had stability problems so it's disabled currently. It's a nice feature and completing that would speed up _all_ applications which are currently interrupted HZ times a second.

So not only have the scheduler maintainers realized this problem years ago, they have also implemented a rough prototype solution as well and tried to productize it. Given enough interest in this topic, it could be finished - most of the code is still there.

So i'm with Thomas on this one: the 'offline scheduler' is on the wrong track in its current form and we can do better than that. The scheduler maintainers (have to) insist on things to be implemented correctly and cleanly so that as many Linux applications can benefit from the end result as possible - not just the proprietary code Ben-Yehuda claimed the 'offline scheduler' was designed for.

Dynamic scheduler tick

Posted Sep 4, 2009 14:33 UTC (Fri) by razb (guest, #43424) [Link]

Hello Mingo

1. the offline scheduler is about treating a processor as a device. this is why I am offloading it. i have compared in my essay several partition- system, CPU sets, INtime and IBM partitions. I did not comare it to dynticks because dynticks is simply a different matter.

2. the offline schdeuler has other features that monitor (RTOP) and protect the kernel ( offline firewall ) when it is not possible.

http://sos-linux.svn.sourceforge.net/viewvc/sos-linux/off...

thank you
raz

Dynamic scheduler tick

Posted Sep 4, 2009 15:30 UTC (Fri) by mingo (subscriber, #31122) [Link]

Hello Mingo

1. the offline scheduler is about treating a processor as a device. this is why I am offloading it. i have compared in my essay several partition- system, CPU sets, INtime and IBM partitions. I did not comare it to dynticks because dynticks is simply a different matter.

The "offline scheduler" is, as you say, a CPU partitioning scheme.

Our (oft repeated) point is that Linux already has a CPU partitioning scheme: cpusets. It can be configured dynamically and will isolate one (or more CPUs) just fine.

This cpusets scheduler feature has been added to the Linux kernel 4.5 years ago in 2005, and has been released as part of the v2.6.12 Linux kernel. It has been part of Linux ever since then - continuously fixed/updated/enhanced.

If cpusets as implemented today does not fit your needs then the (upstream acceptable) solution is not to add a completely different facility with its extra layering, but to fix the currently existing one.

That will benefit all current cpusets users as well beyond enabling the usecases you are interested in.

A new facility is only added if the old one is unfixable. That has not been outlined here - it has not even been argued to be unfixable. [If that is proven then the new facility will simply replace the old (broken) one.]

This is really how the Linux kernel is developed - and always was. We try to avoid reinventing the wheel and we try to avoid duplicate functionality in the core kernel as much as possible. This is what is happening here too.

It sure does mean extra work and requires willingness to work with existing upstream facilities.

Duplicate/overlapping functionality quickly becomes a mess to users and is unmaintainable as well in the long run due to the increased complexity. We try to avoid such overlap and duplication as much as possible.

The lkml discussions with you stalled because you basically only repeated your arguments why you'd want to have the offline scheduler (which in itself is fine) - without showing much interest in improving existing kernel facilities or showing that they are unfixable (which is not fine if you want to enhance the upstream kernel).

Anyway, there's lots of possibilities how to continue this on the technical level. Everyone agrees that undisturbed CPU cores are desirable, so if you (or someone else) implements it correctly it will be accepted upstream - and gladly so. The job of a maintainer (like me) is to say 'no' to patches that are (not yet) good enough technically.

Thanks,

Ingo

Dynamic scheduler tick

Posted Sep 4, 2009 20:46 UTC (Fri) by razb (guest, #43424) [Link]

Hello again Ingo

Well, I understand your arguments and agree with the "upstream" consideration. the offline scheduler approach is agressive . when i offlined napi, i had to do some re-writing in dev.c .

>The lkml discussions with you stalled because you basically only >repeated your arguments why you'd want to have the offline scheduler >(which in itself is fine) - without showing much interest in improving >existing kernel facilities or showing that they are unfixable (which is >not fine if you want to enhance the upstream kernel

In the case of cpu sets, i argue that cpu sets do not provide complete partitioning. Meaning , i cannot ask a packet from 10gbps interface to be moved to processor X and another packet from the same 10gbps interface to be moved to processor Y. why should a flash video packet be moved to processor 7 if processor 7 is heavily busy with incoming ftp traffic ?

For the best of my knowledge; a napi context is triggered by the first packet which can be any processor "in the affinity".

But this is possible by offlin'ing napi. just simply route packets by their service type; not by irq masking; And who care for cache misses if i have an entire processor to do that work;

But you are correct that i haven't replied with technical details. i just posted the link to the essay.

what is correct way to isolate a processor, What are the restrictions ? what are the requirements ?

Raz

Dynamic scheduler tick

Posted Sep 4, 2009 21:07 UTC (Fri) by mingo (subscriber, #31122) [Link]

[...] In the case of cpu sets, i argue that cpu sets do not provide complete partitioning. [...]

Obviously they do not, as otherwise you would not have implemented your patch.

My point, which i outlined in more detail in my reply above, is that there are two approaches possible that are acceptable for upstreaming:

- either extend and fix cpusets with the features you desire

- or prove/show that that's impossible or undesirable. (in which case your solution will have to replace cpusets, cover all its usecases, migrate all its APIs and users smoothly, etc., etc.)

You took a third approach: "I added it as a new, separate, special-purpose feature, not integrated with existing cpusets facilities because it was the easiest for me that way".

That is the ... short-term easy but long-term expensive answer which people on lkml objected to for good reasons. We've been there, we've done that, we are still suffering the consequences ;-)

Linux is a 18+ years old kernel, there's not that many easy projects left in it anymore :-/ Core kernel features that look basic and which are not in Linux yet often turn out to be not that simple.

I hope this explains our point of view. We can continue this discussion on lkml - i'm very interested in extensions to cpusets and Peter Zijstra outlined models for integrating IRQ space partitioning into the cpusets model. (he called them system-sets) He sent a few prototype patches to lkml as well - early 2008 IIRC. Those could be picked up and finished, if you are interested.

Thanks,

Ingo

Dynamic scheduler tick

Posted Sep 15, 2009 8:21 UTC (Tue) by linuxrocks123 (guest, #34648) [Link]

I've been using dynamic tick for over a year. I just checked and I have the kernel option for it enabled in my 2.6.29.6 kernel. When was it disabled? Will you bring it back?

The offline scheduler

Posted Sep 3, 2009 22:06 UTC (Thu) by razb (guest, #43424) [Link]

:) I simply decided to stay low. I did not know it would irritate so many people.

raz

The offline scheduler

Posted Sep 4, 2009 15:36 UTC (Fri) by mingo (subscriber, #31122) [Link]

As far as i'm concerned the patches do not irritate me - why should they? (I dont find them upstream acceptable in their current form but hey, most of my own feature patches are not acceptable in their initial form either ;-)

"Staying low" is the worst possibly strategy if you want to improve the upstream kernel. Engaging in the process and listening to upstream feedback and acting on suggestions is important.

The offline scheduler

Posted Sep 10, 2009 4:10 UTC (Thu) by chojrak11 (guest, #52056) [Link]

Interesting topic, it reminds me of this thread:
http://lists.freebsd.org/pipermail/freebsd-performance/20...

The offline scheduler

Posted Sep 13, 2009 5:31 UTC (Sun) by zenaan (subscriber, #3778) [Link]

Can someone please refer us to the "somewhat heated digression about generally reducing latencies in the kernel"?

:)

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds