Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch is still 2.6.21-rc7; the expected final 2.6.21 release has not happened as of this writing. Patches to fix regressions continue to accumulate in the mainline git repository.

There have been no -mm releases in the past week.

For older kernels: 2.6.16.49 was released on April 23 with a handful of fixes. Users of the 2.4 kernel can choose between 2.4.34.3 (April 22, networking fixes), 2.4.34.4 (fixes a build problem in 2.4.34.3), or 2.4.35-pre4 (April 22, various fixes).

Comments (none posted)

Quotes of the week

So while with other, heuristic approaches we always had the problem of creating a "hyper-inflation" of an uneconomic virtual currency that could be freely printed by certain tasks, in CFS the economy of this is strict and the finegrained plus/minus balance is strictly managed by a conservative and independent central bank.

-- Ingo Molnar brings fiscal discipline to scheduling

We like it in the kernel, we find it to be warm and fuzzy. Whereas, user space is a cold, dark, and rainy place, and we just don't want to go there.

-- Matt Ranon

Comments (2 posted)

This week in the scheduling discussion

In last week's scheduler timeslice, Ingo Molnar had introduced his "completely fair scheduler" patch and Staircase Deadline scheduler author Con Kolivas had retreated in a bit of a sulk. Since then, Con has returned and posted several new revisions of the SD scheduler, but with little discussion. His intent, seemingly, is to raise the bar and ensure that whatever scheduler does eventually replace the current system is the best possible - a goal which few should be able to disagree with.

Most of the discussion, though, has centered around the CFS scheduler. Several testers have reported good results, but others have noted some behavioral regressions. These problems, like most of the others over the years, involve the X Window System. So potential solutions are being discussed yet again.

The classic response to X interactivity problems is to renice the X server. But this solution seems like a bit of a hack to many, so scheduler work has often been aimed at eliminating the need to run X at a higher priority. Con Kolivas questions this goal:

The one fly in the ointment for linux remains X. I am still, to this moment, completely and utterly stunned at why everyone is trying to find increasingly complex unique ways to manage X when all it needs is more cpu. Now most of these are actually very good ideas about _extra_ features that would be desirable in the long run for linux, but given the ludicrous simplicity of renicing X I cannot fathom why people keep promoting these alternatives.

Avoiding renicing remains a goal of CFS, but it's interesting to see that the v4 CFS patch does renice X - automatically. More specifically, the scheduler bumps the priority level of any process performing hardware I/O (as seen by calls to ioperm() or iopl(), the loop block device thread, and worker threads associated with workqueues. With the X server automatically boosted (as a result of its iopl() use), it does tend to be more responsive.

While giving kernel threads a priority boost might make sense in the long term, Ingo sees renicing X as a temporary hack. The real solution to the problem seems to involve two different approaches: CPU credit transfers between processes and group scheduling.

Remember that, with the CFS scheduler, each process accumulates a certain amount of CPU time which is "owed" to it; this time is earned by waiting while others use the processor. This mechanism can enforce basic fairness between processes, in that each one gets something very close to an equal share of the available CPU time. Whether this calculation is truly "fair" depends on how one judges fairness; if the X server is seen as performing work for other processes, then fairness would call for X to share in the credit accumulated by those other processes. Linus has been pushing for a solution along these lines:

The "perfect" situation would be that when somebody goes to sleep, any extra points it had could be given to whoever it woke up last. Note that for something like X, it means that the points are 100% ephemeral: it gets points when a client sends it a request, but it would *lose* the points again when it sends the reply!

The CFS v5 patch has the beginnings of support for this mode of operation. Automatic transfers of credit are not there, but there is a new system call:

    long sched_yield_to(pid_t pid);

This call gives up the processor much like sched_yield(), but it also gives half of the yielding process's credit (if it has any) to the process identified by pid. This system call could be used by (for example) the X libraries as a way to explicitly transfer credit to the X server. There is currently no way for the X server to give back the credit it didn't use; Ingo has mentioned the notion of a sched_pay() system call for that purpose. There's also no way to ensure that X uses the credit for work done on the yielding process's behalf; it could just as easily squander it on wobbly window effects. But it's a step in the right direction.

A further step, in a highly prototypical form, is Ingo's scheduler economy patch. This mechanism allows kernel code to set up a scheduler "work account"; processes can then make deposits to and withdrawls from the account with:

    void sched_pay(struct sched_account *account);
    void sched_withdraw(struct sched_account *account);

At this point, deposits and withdrawls all involve a fixed amount of CPU time. The Unix-domain socket code has been modified to create one of these accounts associated with each socket. Any non-root process (X clients, for example) writing to a socket will also make a deposit into the work account; root-owned processes (the X server, in particular) reading messages also withdraw from the account. It's all a proof of concept; a real implementation would require a rather more sophisticated API. But the proof does show that X clients can convey some of their CPU credits to the server when processor time is scarce.

The other idea in circulation is per-user or group scheduling. Here, the idea is to fairly split CPU time between users instead of between processes. If one user is running a single text editor process when another starts a kernel build with make -j 100, the scheduler will have 101 processes all contending for the CPU. The current crop of fair schedulers will divide the processor evenly between all of them, allowing the kernel build to take over while the text editor must make do with less than 1% of the available CPU time. This situation may be just fine with kernel developers, but one can easily argue that the right split here would be to confine the kernel build to half of the available time while allowing the text editor to use the other half.

That is the essence of per-user scheduling. Among other things, it could ease the X interactivity problem: since X runs as a different user (root, normally), it will naturally end up in a separate scheduling group with its own fair share of the processor. Linus has been pushing hard for group scheduling as well (see the quote of last week). Ingo responds that group scheduling is on his mind - he just hasn't gotten around to it yet:

Firstly, i have not neglected the group scheduling related CFS regressions at all, mainly because there _is_ already a quick hack to check whether group scheduling would solve these regressions: renice. And it was tried in both of the two CFS regression cases i'm aware of: Mike's X starvation problem and Willy's "kevents starvation with thousands of scheddos tasks running" problem. And in both cases, applying the renice hack [which should be properly and automatically implemented as uid group scheduling] fixed the regression for them! So i was not worried at all, group scheduling _provably solves_ these CFS regressions. I rather concentrated on the CFS regressions that were much less clear.

In other words, the automatic renicing described above is not a permanent solution; instead, it's more of a proof of concept for group scheduling. Ingo goes on to say that there's a lot of other important factors in getting interactive scheduling right; in particular, nanosecond accounting and strict division of CPU time were needed. Once all of those details are right, one can start thinking about the group scheduling problem.

So there would appear to be some work yet to be done on the CFS scheduler. That will doubtless happen; meanwhile, however, Linus has complained that some of this effort may be misdirected at the moment:

Anyway, I'd ask people to look a bit at the current *regressions* instead of spending all their time on something that won't even be merged before 2.6.21 is released, and we thus have some more pressing issues. Please?

One might argue that any work which is intended for the upcoming 2.6.22 merge window needs to be pulled into shape now. But the replacement of the CPU scheduler is likely to take a little bit longer than that. Given the number of open questions - and the amount of confidence replacing the scheduler requires - any sort of movement for 2.6.22 seems unlikely.

Comments (14 posted)

Filesystems: chunkfs and reiser4

One of the fundamental problems facing filesystem developers is that, while disks are getting both larger and faster, the rate at which they are growing exceeds the rate at which they are speeding up. As a result, the time required to read an entire disk is growing. There is little joy in waiting for a filesystem checker to do its thing during a system reboot, so the prospect of ever-longer fsck delays is understandably lacking in appeal. Unfortunately, that is the direction in which things are going. Journaling filesystems can help avoid fsck, but only in situations where the filesystem has not suffered any sort of corruption.

Given that filesystem checks are something we have to deal with, it's worth thinking about how we might make them faster in the era of terabyte disks. One longstanding idea for improving the situation was recently posted in the form of chunkfs, "fs fission for faster fsck." The core idea is to take a filesystem and split it into several independent filesystems, each of which maintains its own clean/dirty state. Should things go wrong, only those sub-filesystems which were active at the time of failure need to be checked.

Like many experimental filesystem developments, chunkfs is built upon ext2. Internally, it is a series of separate ext2 filesystems which look like a single system to the higher layers of the filesystem. Each chunk can be maintained independently by the filesystem code, but the individual chunks are not visible outside of the filesystem. The idea is relatively simple, though, as always, there are a few pesky details to work out.

One is that inode numbers in the larger chunkfs filesystem must be unique. Each chunk, however, maintains its own list of inodes starting with number one, so inode numbers will be reused from one chunk to the next. Chunkfs makes these numbers unique by putting the chunk number in the upper eight bits of every inode number. As a result, there is a maximum of 256 chunks in any chunkfs filesystem.

A trickier problem comes about when a file grows. The filesystem will try to allocate additional file blocks in the chunk where the file was originally created. Should that chunk fill up, however, something else needs to happen; it would not be good for the filesystem to return "no space" errors when free space does exist in other chunks. The answer here is the creation of a "continuation inode." These inodes track the allocation of blocks in a different chunk; they look much like files in their own right, but they are part of a larger array of block allocations. The "real" inode for a given file can have pointers to up to four continuation inodes in different chunks; if more are needed, each continuation inode can, itself, point to another four continuations. Thus, continuation inodes can be chained to create files of arbitrary length.

This code is in a relatively early state; the text with the patch notes that "this is a preliminary implementation and lots of changes are expected before this code is even sanely usable." There is a set of tools which can be used by people who would like to test out chunkfs filesystems with well backed-up data. With some care and some testing, chunkfs may grow to the point that it's stable and shortening fsck times worldwide.

Meanwhile, one of the longest stories in Linux filesystem development has to be the reiser4 filesystem. By the time Hans Reiser first asked for the merging of reiser4 in July, 2003, the filesystem had been under development for some years. Almost four years have passed since then, and reiser4 remains outside of the mainline kernel. Hans Reiser is now out of the picture, his company (Namesys) is in trouble, and, to a casual observer, reiser4 appears not to be going anywhere.

There has been a recent increase in interest in this filesystem, though. It turns out that two Namesys employees are still working on the filesystem "mostly on enthusiasm." They have been feeding patches through to the -mm tree, and they are getting toward the end of their list of things to fix. So we might see a new push for inclusion of reiser4, perhaps as soon as 2.6.23. But, says Andrew Morton, some things would have to happen; in particular, there needs to be a new review of the reiser4 code.

To get it unstuck we'd need a general push, get people looking at and testing the code, get the vendors to have a serious think about it, etc. We could do that - it'd require that the namesys people (and I) start making threatening noises about merging it, I guess.

Or we could move all the reiser4 code into kernel/sched.c - that seems to get people fired up.

Your editor will go out on a limb and suggest that a mass move of the reiser4 code is unlikely. But a new round of talk on actually merging this filesystem is starting to look reasonably likely. There's enough work - and enough interesting ideas - in this code that people are unwilling to let it just fade away. Perhaps, soon, it will be heading for its long-sought spot in the mainline.

Comments (12 posted)

The suspend2 discussion resumes

One of the side discussions in the scheduler debate had to do with how the CFS scheduler broke the out-of-tree suspend2 suspend-to-disk code. Ingo Molnar, acting on the reports, found and fixed a bug in CFS. As a way of returning the favor, he then posted a review of the suspend2 code, noting that "the patch looks sane all around" and asking whether there were any plans to get suspend2 into the mainline kernel.

Perhaps Ingo wasn't listening the past few times this topic has been brought up. His question was music to suspend2 author Nigel Cunningham's ears; Nigel promptly responded with a lengthy reasons to merge suspend2 document. Among many other things, he notes that the user-space software suspend implementation (uswsusp) is still running behind suspend2 in features. It is true that little has been heard from uswsusp in recent times; there has not been a release since last November. Uptake by distributors has been slow. But that didn't stop uswsusp hacker Pavel Machek from jumping in saying "Well, current uswsusp code can do most of stuff suspend2 can do, with 20% (or so) of kernel code."

Those who followed the discussion one year ago when uswsusp was merged may remember that it triggered a debate on which functions can sensibly be moved out of the kernel to user space. Many developers thought that suspend-to-disk functionality was, perhaps, on the wrong side of that line. After this debate, the number of proposals for moving functionality out of the kernel fell significantly. People are still sensitive to the issue, though, as can be seen in this response from Linus:

This whole notion that "kernel lines of code" is somehow different is a stupid and idiotic _disease_ that is spread by microkernel people and people who have been brainwashed by them.

In a later, calmer moment he added:

This is why I don't believe in the whole kernel-line-counting thing. I'm personally 100% convinced that it's better to have ten times as many lines in the kernel, if it means that you can just forget about version skew and bad user-space interfaces etc.

This discussion should help to keep a lid on future "move kernel code to user space" projects. While there are certainly times when such moves make sense, there are also situations where putting functionality in user space just makes things harder. That said, one should not expect the recently-posted Kcli patch, intended to help move entire applications into the kernel, to get into the mainline anytime soon.

Meanwhile, what about suspend2? It is possible that the renewed discussion might provide some impetus for the merging of this longstanding development. Certainly suspend2 has a significant user community which would appreciate inclusion in the mainline. The amount of discussion has been relatively low, though. It may well be that enough systems now have working suspend-to-RAM support that the level of interest in suspend-to-disk is rather lower than it once was.

Comments (26 posted)

Adrian Bunk Linux 2.6.16.49 ?

Adrian Bunk Linux 2.6.16.49-rc1 ?

Willy Tarreau Linux 2.4.35-pre3 ?

Willy Tarreau Linux 2.4.35-pre4 ?

Willy Tarreau Linux 2.4.34.4 ?

Willy Tarreau Linux 2.4.34.3 ?

Ashok Raj Intel IOMMU Support. ?

Con Kolivas Announce - Staircase Deadline cpu scheduler v0.41 ?

Con Kolivas Staircase Deadline cpu scheduler version 0.43 ?

Con Kolivas Staircase Deadline cpu scheduler version 0.44 ?

Con Kolivas Staircase Deadline cpu scheduler version 0.45 ?

Con Kolivas Staircase Deadline cpu scheduler version 0.46 ?

Ingo Molnar CFS scheduler, v4 ?

Ingo Molnar CFS scheduler, -v5 ?

Ingo Molnar CFS scheduler, -v6 ?

Ingo Molnar 'Scheduler Economy' prototype patch for CFS ?

Tong Li Extend Linux to support proportional-share scheduling ?

Junio C Hamano GIT 1.5.1.2 ?

Sergey Yanovich alternative TI FM MMC/SD driver for 2.6.21-rc7 ?

Sakari Ailus OMAP 2 camera driver + other stuff ?

David Miller : Rewritten ESP driver, porters needed! ?

Peter P Waskiewicz Jr Multiqueue network device support ?

Anton Vorontsov Battery class, external power framework, ds2760 battery ?

Amit Gud ChunkFS: fs fission for faster fsck ?

Miklos Szeredi mount ownership and unprivileged mount syscall (v4) ?

Nick Piggin 2.6.21-rc7 new aops patchset ?

Nick Piggin Buffered write deadlock fix and new aops for 2.6.21-rc6-mm1 ?

Jens Axboe CFQ IO scheduler patch series ?

clameter@sgi.com Large Blocksize Support V3 ?

Christoph Lameter Variable Order Page Cache ?

Andy Whitcroft Lumpy Reclaim V6 ?

Benjamin Herrenschmidt Pass MAP_FIXED down to get_unmapped_area ?

Fengguang Wu on-demand readahead ?

YOSHIFUJI Hideaki Exporting IPv6 statistics via netlink. ?

Michael Wu mac80211: Add radiotap support ?

David Howells AF_RXRPC socket family and AFS rewrite [try #3] ?

David Howells AF_RXRPC socket family and AFS rewrite [try #4] ?

Roberto De Ioris UidBind LSM 0.1 ?

Roberto De Ioris UidBind LSM 0.2 ?

Jeremy Fitzhardinge xen: Xen implementation for paravirt_ops ?

Matt Ranon Kcli - Kernel command line interface. ?

Kay Sievers udev 109 release ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

This week in the scheduling discussion

Filesystems: chunkfs and reiser4

The suspend2 discussion resumes

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous