This week in the scheduling discussion

[Posted April 24, 2007 by corbet]

In last week's scheduler timeslice, Ingo Molnar had introduced his "completely fair scheduler" patch and Staircase Deadline scheduler author Con Kolivas had retreated in a bit of a sulk. Since then, Con has returned and posted several new revisions of the SD scheduler, but with little discussion. His intent, seemingly, is to raise the bar and ensure that whatever scheduler does eventually replace the current system is the best possible - a goal which few should be able to disagree with.

Most of the discussion, though, has centered around the CFS scheduler. Several testers have reported good results, but others have noted some behavioral regressions. These problems, like most of the others over the years, involve the X Window System. So potential solutions are being discussed yet again.

The classic response to X interactivity problems is to renice the X server. But this solution seems like a bit of a hack to many, so scheduler work has often been aimed at eliminating the need to run X at a higher priority. Con Kolivas questions this goal:

The one fly in the ointment for linux remains X. I am still, to this moment, completely and utterly stunned at why everyone is trying to find increasingly complex unique ways to manage X when all it needs is more cpu. Now most of these are actually very good ideas about _extra_ features that would be desirable in the long run for linux, but given the ludicrous simplicity of renicing X I cannot fathom why people keep promoting these alternatives.

Avoiding renicing remains a goal of CFS, but it's interesting to see that the v4 CFS patch does renice X - automatically. More specifically, the scheduler bumps the priority level of any process performing hardware I/O (as seen by calls to ioperm() or iopl(), the loop block device thread, and worker threads associated with workqueues. With the X server automatically boosted (as a result of its iopl() use), it does tend to be more responsive.

While giving kernel threads a priority boost might make sense in the long term, Ingo sees renicing X as a temporary hack. The real solution to the problem seems to involve two different approaches: CPU credit transfers between processes and group scheduling.

Remember that, with the CFS scheduler, each process accumulates a certain amount of CPU time which is "owed" to it; this time is earned by waiting while others use the processor. This mechanism can enforce basic fairness between processes, in that each one gets something very close to an equal share of the available CPU time. Whether this calculation is truly "fair" depends on how one judges fairness; if the X server is seen as performing work for other processes, then fairness would call for X to share in the credit accumulated by those other processes. Linus has been pushing for a solution along these lines:

The "perfect" situation would be that when somebody goes to sleep, any extra points it had could be given to whoever it woke up last. Note that for something like X, it means that the points are 100% ephemeral: it gets points when a client sends it a request, but it would *lose* the points again when it sends the reply!

The CFS v5 patch has the beginnings of support for this mode of operation. Automatic transfers of credit are not there, but there is a new system call:

    long sched_yield_to(pid_t pid);

This call gives up the processor much like sched_yield(), but it also gives half of the yielding process's credit (if it has any) to the process identified by pid. This system call could be used by (for example) the X libraries as a way to explicitly transfer credit to the X server. There is currently no way for the X server to give back the credit it didn't use; Ingo has mentioned the notion of a sched_pay() system call for that purpose. There's also no way to ensure that X uses the credit for work done on the yielding process's behalf; it could just as easily squander it on wobbly window effects. But it's a step in the right direction.

A further step, in a highly prototypical form, is Ingo's scheduler economy patch. This mechanism allows kernel code to set up a scheduler "work account"; processes can then make deposits to and withdrawls from the account with:

    void sched_pay(struct sched_account *account);
    void sched_withdraw(struct sched_account *account);

At this point, deposits and withdrawls all involve a fixed amount of CPU time. The Unix-domain socket code has been modified to create one of these accounts associated with each socket. Any non-root process (X clients, for example) writing to a socket will also make a deposit into the work account; root-owned processes (the X server, in particular) reading messages also withdraw from the account. It's all a proof of concept; a real implementation would require a rather more sophisticated API. But the proof does show that X clients can convey some of their CPU credits to the server when processor time is scarce.

The other idea in circulation is per-user or group scheduling. Here, the idea is to fairly split CPU time between users instead of between processes. If one user is running a single text editor process when another starts a kernel build with make -j 100, the scheduler will have 101 processes all contending for the CPU. The current crop of fair schedulers will divide the processor evenly between all of them, allowing the kernel build to take over while the text editor must make do with less than 1% of the available CPU time. This situation may be just fine with kernel developers, but one can easily argue that the right split here would be to confine the kernel build to half of the available time while allowing the text editor to use the other half.

That is the essence of per-user scheduling. Among other things, it could ease the X interactivity problem: since X runs as a different user (root, normally), it will naturally end up in a separate scheduling group with its own fair share of the processor. Linus has been pushing hard for group scheduling as well (see the quote of last week). Ingo responds that group scheduling is on his mind - he just hasn't gotten around to it yet:

Firstly, i have not neglected the group scheduling related CFS regressions at all, mainly because there _is_ already a quick hack to check whether group scheduling would solve these regressions: renice. And it was tried in both of the two CFS regression cases i'm aware of: Mike's X starvation problem and Willy's "kevents starvation with thousands of scheddos tasks running" problem. And in both cases, applying the renice hack [which should be properly and automatically implemented as uid group scheduling] fixed the regression for them! So i was not worried at all, group scheduling _provably solves_ these CFS regressions. I rather concentrated on the CFS regressions that were much less clear.

In other words, the automatic renicing described above is not a permanent solution; instead, it's more of a proof of concept for group scheduling. Ingo goes on to say that there's a lot of other important factors in getting interactive scheduling right; in particular, nanosecond accounting and strict division of CPU time were needed. Once all of those details are right, one can start thinking about the group scheduling problem.

So there would appear to be some work yet to be done on the CFS scheduler. That will doubtless happen; meanwhile, however, Linus has complained that some of this effort may be misdirected at the moment:

Anyway, I'd ask people to look a bit at the current *regressions* instead of spending all their time on something that won't even be merged before 2.6.21 is released, and we thus have some more pressing issues. Please?

One might argue that any work which is intended for the upcoming 2.6.22 merge window needs to be pulled into shape now. But the replacement of the CPU scheduler is likely to take a little bit longer than that. Given the number of open questions - and the amount of confidence replacing the scheduler requires - any sort of movement for 2.6.22 seems unlikely.

Index entries for this article
Kernel	Interactivity
Kernel	Scheduler/Completely fair scheduler

NIH

Posted Apr 26, 2007 8:24 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (2 responses)

In all those technical discussions lack one (obvious?) observation. There are many unix-like systems running X. Many of them are open source and/or documented in detail. And none of them is known for having X interactivity problem. How those systems solve problem? Do they renice X? Why no developer checked, instead developing new syscalls (pushing work to userspace and making Linux more incompatibile with other *nixes) ?

NIH

Posted Apr 26, 2007 8:56 UTC (Thu) by nix (subscriber, #2304) [Link]

It's not just X. Most large systems involve one process doing work on behalf of another. (The elephant in the room, as ever, is database servers, which often have similar problems internally as well as between them and their client processes.)

Older Unixes often had *severe* X interactivity problems. X has long been famous for its jerky rendering (especially `jerky mouse syndrome') and CPU hogginess; the jerky mouse has gone away of late, but only by the cheat of running the mouse pointer at insanely high priority and because machines are now fast enough that moving a mouse pointer is cheap.

NIH

Posted May 3, 2007 5:53 UTC (Thu) by renox (guest, #23785) [Link]

The general purpose OS with the best interactivity I know is/was BeOS.

I think that one of the reason of the good interactivity was that a bigger part of the GUI was inside the kernel..

I doubt that you'd be able to convince kernel developers to do the same thing.

This week in the scheduling discussion

Posted Apr 26, 2007 8:59 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

This time-as-money analogy can be extended further. I can imagine a system which has a sort of futures market in runtime, where processes that will need runtime in the future can ask for it, costing some of its present runtime to do so and trading off against other processes which are asking for the same thing. (Asking for time at specific instants would cost more.)

This might be useful for things like multimedia apps that can tell that they have a bunch of especially hefty decoding work coming up in five seconds, or something like that.

(But I'm just babbling and have no code...)

This week in the scheduling discussion

Posted Apr 26, 2007 9:24 UTC (Thu) by nowster (subscriber, #67) [Link] (2 responses)

This time-as-money analogy can be extended further. I can imagine a system which has a sort of futures market in runtime...

Could a "rogue trader" break the bank? :-)

This week in the scheduling discussion

Posted Apr 27, 2007 14:11 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

I don't see how (assuming `breaking the bank' to equate to a DoS attack or something like that). If a process demands heaps of time in the future and doesn't use it, it'll sacrifice all its current time and end up with nothing (i.e. an idling system or other processes using the time instead). If it demands heaps of time and other things do as well, then it won't get that much time. No problem in either case.

HOWTO deny service

Posted Apr 30, 2007 5:55 UTC (Mon) by kbob (guest, #1770) [Link]

Consider an isochronous process like a real time audio processor.
It needs a time chunk every N milliseconds, so that's what it bids for.

A malicious process could outbid the audio process for a smaller timeslice
right when the audio app needs it. Then it could busy-loop for just
long enough that the audio app won't be able to finish on time.
Denial of Service.

On a longer time scale, the isochronous app might be a machine vision
system, a CD burner, or a monthly accounting report.

The malicious process wouldn't intentionally be malicious, of course.
It would just have an unfortunate self-scheduling policy.

The SD scheduler is heaven on many server loads

Posted Apr 26, 2007 12:05 UTC (Thu) by hmh (subscriber, #3838) [Link] (2 responses)

All I know is that SD seems to be all I ever wanted for regular (i.e. not HPC) servers.

The SD design never starves anything, ever. It has bound latency. CFS can't guarantee either as well as SD can. SD allows one to set the exact scheduling priority of everything and it is always respected, as there is no interactive renicing: it is very predictable.

And since one can also set the rr_interval (scheduling granularity) if the load is not latency sensitive, you can reduce its cost in context switches. It still has a few ways to go, but I hope to see in server -ck soon.

And SD is O(1), too... CFS isn't.

As someone said in LKML (maybe Con?), if the window managers were not so dumb and would renice things easily and intelligently, it might even be ideal for desktops. Oh well :-)

The SD scheduler is heaven on many server loads

Posted Apr 26, 2007 12:11 UTC (Thu) by simlo (guest, #10866) [Link]

In practice there is not much difference between O(1) and O(log<number of tasks>). Especially since "1"=<number of priorities in the system>=140 and log<number of tasks><=log(2^64)=64.

The SD scheduler is heaven on many server loads

Posted Jun 4, 2007 9:00 UTC (Mon) by mingo (guest, #31122) [Link]

The SD design never starves anything, ever. It has bound latency. CFS can't guarantee either as well as SD can.

The CFS design does not starve anything ever either and has bound latency. I´m wondering why you think that ¨CFS can´t guarantee either as well¨.

This week in the scheduling discussion

Posted Apr 26, 2007 15:40 UTC (Thu) by pjones (subscriber, #31722) [Link] (1 responses)

There's also no way to ensure that X uses the credit for work done on the yielding process's behalf; it could just as easily squander it on wobbly window effects. But it's a step in the right direction.

I'm not so sure that squandry is a reasonable characterization of desktop effects such as "wobbly window[s]", nor that ensuring a strong coupling between credit yielded to X and work it performs is desirable.

If you've got the system configured to draw wobbly windows or other desktop effects, then any failure to draw your application's widgets in reasonable time or to wobble the windows correctly would represent the same failure. In either case, the screen looks bad and usability is degraded; the user suffers. There is no part of the screen that's more important to get right than any other. IT must all look good and perform with sufficient speed and responsiveness for the user to have a favorable experience.

X scheduling fairness

Posted Apr 27, 2007 15:58 UTC (Fri) by giraffedata (guest, #1954) [Link]

There is no part of the screen that's more important to get right than any other.

What if you have a long compile job running in an xterm and are simultaneously browsing the web? Wouldn't you prefer that CPU time go to drawing the browser windows than updating the xterm?

With the proposed approximation to credit transfer, it's possible for the X server to use the browser's credit on behalf of the less urgent compile job.

And even if you're simplistically setting a goal of having each consumer process get the same amount of CPU time, it's possible things would work out so the compile job would get more CPU time than the browser.

CPU vs disk schedulers

Posted May 3, 2007 8:47 UTC (Thu) by eduperez (guest, #11232) [Link] (1 responses)

While I really apreciate any effort put to improve Linux in general (and CPU schedulers in particualar), I must say that I have never experienced X being sluggish because of other processes starving CPU cycles. However, processes doing intensive disk work tend to have a greater effect on responsiveness; bring some swapping to the table and the game is over...

CPU vs disk schedulers

Posted May 3, 2007 12:44 UTC (Thu) by boniek (guest, #45061) [Link]

Agreed. CFQ needs more work so it can schedule writes. Glory to the developer that does this :)