Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is still 2.6.21-rc7; the expected final 2.6.21 release has not happened as of this writing. Patches to fix regressions continue to accumulate in the mainline git repository.There have been no -mm releases in the past week.
For older kernels: 2.6.16.49 was released on April 23 with a handful of fixes. Users of the 2.4 kernel can choose between 2.4.34.3 (April 22, networking fixes), 2.4.34.4 (fixes a build problem in 2.4.34.3), or 2.4.35-pre4 (April 22, various fixes).
Kernel development news
Quotes of the week
This week in the scheduling discussion
In last week's scheduler timeslice, Ingo Molnar had introduced his "completely fair scheduler" patch and Staircase Deadline scheduler author Con Kolivas had retreated in a bit of a sulk. Since then, Con has returned and posted several new revisions of the SD scheduler, but with little discussion. His intent, seemingly, is to raise the bar and ensure that whatever scheduler does eventually replace the current system is the best possible - a goal which few should be able to disagree with.Most of the discussion, though, has centered around the CFS scheduler. Several testers have reported good results, but others have noted some behavioral regressions. These problems, like most of the others over the years, involve the X Window System. So potential solutions are being discussed yet again.
The classic response to X interactivity problems is to renice the X server. But this solution seems like a bit of a hack to many, so scheduler work has often been aimed at eliminating the need to run X at a higher priority. Con Kolivas questions this goal:
Avoiding renicing remains a goal of CFS, but it's interesting to see that the v4 CFS patch does renice X - automatically. More specifically, the scheduler bumps the priority level of any process performing hardware I/O (as seen by calls to ioperm() or iopl(), the loop block device thread, and worker threads associated with workqueues. With the X server automatically boosted (as a result of its iopl() use), it does tend to be more responsive.
While giving kernel threads a priority boost might make sense in the long term, Ingo sees renicing X as a temporary hack. The real solution to the problem seems to involve two different approaches: CPU credit transfers between processes and group scheduling.
Remember that, with the CFS scheduler, each process accumulates a certain amount of CPU time which is "owed" to it; this time is earned by waiting while others use the processor. This mechanism can enforce basic fairness between processes, in that each one gets something very close to an equal share of the available CPU time. Whether this calculation is truly "fair" depends on how one judges fairness; if the X server is seen as performing work for other processes, then fairness would call for X to share in the credit accumulated by those other processes. Linus has been pushing for a solution along these lines:
The CFS v5 patch has the beginnings of support for this mode of operation. Automatic transfers of credit are not there, but there is a new system call:
long sched_yield_to(pid_t pid);
This call gives up the processor much like sched_yield(), but it also gives half of the yielding process's credit (if it has any) to the process identified by pid. This system call could be used by (for example) the X libraries as a way to explicitly transfer credit to the X server. There is currently no way for the X server to give back the credit it didn't use; Ingo has mentioned the notion of a sched_pay() system call for that purpose. There's also no way to ensure that X uses the credit for work done on the yielding process's behalf; it could just as easily squander it on wobbly window effects. But it's a step in the right direction.
A further step, in a highly prototypical form, is Ingo's scheduler economy patch. This mechanism allows kernel code to set up a scheduler "work account"; processes can then make deposits to and withdrawls from the account with:
void sched_pay(struct sched_account *account); void sched_withdraw(struct sched_account *account);
At this point, deposits and withdrawls all involve a fixed amount of CPU time. The Unix-domain socket code has been modified to create one of these accounts associated with each socket. Any non-root process (X clients, for example) writing to a socket will also make a deposit into the work account; root-owned processes (the X server, in particular) reading messages also withdraw from the account. It's all a proof of concept; a real implementation would require a rather more sophisticated API. But the proof does show that X clients can convey some of their CPU credits to the server when processor time is scarce.
The other idea in circulation is per-user or group scheduling. Here, the idea is to fairly split CPU time between users instead of between processes. If one user is running a single text editor process when another starts a kernel build with make -j 100, the scheduler will have 101 processes all contending for the CPU. The current crop of fair schedulers will divide the processor evenly between all of them, allowing the kernel build to take over while the text editor must make do with less than 1% of the available CPU time. This situation may be just fine with kernel developers, but one can easily argue that the right split here would be to confine the kernel build to half of the available time while allowing the text editor to use the other half.
That is the essence of per-user scheduling. Among other things, it could ease the X interactivity problem: since X runs as a different user (root, normally), it will naturally end up in a separate scheduling group with its own fair share of the processor. Linus has been pushing hard for group scheduling as well (see the quote of last week). Ingo responds that group scheduling is on his mind - he just hasn't gotten around to it yet:
In other words, the automatic renicing described above is not a permanent solution; instead, it's more of a proof of concept for group scheduling. Ingo goes on to say that there's a lot of other important factors in getting interactive scheduling right; in particular, nanosecond accounting and strict division of CPU time were needed. Once all of those details are right, one can start thinking about the group scheduling problem.
So there would appear to be some work yet to be done on the CFS scheduler. That will doubtless happen; meanwhile, however, Linus has complained that some of this effort may be misdirected at the moment:
One might argue that any work which is intended for the upcoming 2.6.22 merge window needs to be pulled into shape now. But the replacement of the CPU scheduler is likely to take a little bit longer than that. Given the number of open questions - and the amount of confidence replacing the scheduler requires - any sort of movement for 2.6.22 seems unlikely.
Filesystems: chunkfs and reiser4
One of the fundamental problems facing filesystem developers is that, while disks are getting both larger and faster, the rate at which they are growing exceeds the rate at which they are speeding up. As a result, the time required to read an entire disk is growing. There is little joy in waiting for a filesystem checker to do its thing during a system reboot, so the prospect of ever-longer fsck delays is understandably lacking in appeal. Unfortunately, that is the direction in which things are going. Journaling filesystems can help avoid fsck, but only in situations where the filesystem has not suffered any sort of corruption.Given that filesystem checks are something we have to deal with, it's worth thinking about how we might make them faster in the era of terabyte disks. One longstanding idea for improving the situation was recently posted in the form of chunkfs, "fs fission for faster fsck." The core idea is to take a filesystem and split it into several independent filesystems, each of which maintains its own clean/dirty state. Should things go wrong, only those sub-filesystems which were active at the time of failure need to be checked.
Like many experimental filesystem developments, chunkfs is built upon ext2. Internally, it is a series of separate ext2 filesystems which look like a single system to the higher layers of the filesystem. Each chunk can be maintained independently by the filesystem code, but the individual chunks are not visible outside of the filesystem. The idea is relatively simple, though, as always, there are a few pesky details to work out.
One is that inode numbers in the larger chunkfs filesystem must be unique. Each chunk, however, maintains its own list of inodes starting with number one, so inode numbers will be reused from one chunk to the next. Chunkfs makes these numbers unique by putting the chunk number in the upper eight bits of every inode number. As a result, there is a maximum of 256 chunks in any chunkfs filesystem.
A trickier problem comes about when a file grows. The filesystem will try to allocate additional file blocks in the chunk where the file was originally created. Should that chunk fill up, however, something else needs to happen; it would not be good for the filesystem to return "no space" errors when free space does exist in other chunks. The answer here is the creation of a "continuation inode." These inodes track the allocation of blocks in a different chunk; they look much like files in their own right, but they are part of a larger array of block allocations. The "real" inode for a given file can have pointers to up to four continuation inodes in different chunks; if more are needed, each continuation inode can, itself, point to another four continuations. Thus, continuation inodes can be chained to create files of arbitrary length.
This code is in a relatively early state; the text with the patch notes
that "this is a preliminary implementation and lots of changes are
expected before this code is even sanely usable.
" There is a set of
tools which can be used by people who would like to test out chunkfs
filesystems with well backed-up data. With some care and some testing,
chunkfs may grow to the point that it's stable and shortening fsck times
worldwide.
Meanwhile, one of the longest stories in Linux filesystem development has to be the reiser4 filesystem. By the time Hans Reiser first asked for the merging of reiser4 in July, 2003, the filesystem had been under development for some years. Almost four years have passed since then, and reiser4 remains outside of the mainline kernel. Hans Reiser is now out of the picture, his company (Namesys) is in trouble, and, to a casual observer, reiser4 appears not to be going anywhere.
There has been a recent increase in interest in this filesystem, though. It turns out that two Namesys employees are still working on the filesystem "mostly on enthusiasm." They have been feeding patches through to the -mm tree, and they are getting toward the end of their list of things to fix. So we might see a new push for inclusion of reiser4, perhaps as soon as 2.6.23. But, says Andrew Morton, some things would have to happen; in particular, there needs to be a new review of the reiser4 code.
Or we could move all the reiser4 code into kernel/sched.c - that seems to get people fired up.
Your editor will go out on a limb and suggest that a mass move of the reiser4 code is unlikely. But a new round of talk on actually merging this filesystem is starting to look reasonably likely. There's enough work - and enough interesting ideas - in this code that people are unwilling to let it just fade away. Perhaps, soon, it will be heading for its long-sought spot in the mainline.
The suspend2 discussion resumes
One of the side discussions in the scheduler debate had to do with how the CFS scheduler broke the out-of-tree suspend2 suspend-to-disk code. Ingo Molnar, acting on the reports, found and fixed a bug in CFS. As a way of returning the favor, he then posted a review of the suspend2 code, noting that "the patch looks sane all around" and asking whether there were any plans to get suspend2 into the mainline kernel.
Perhaps Ingo wasn't listening the past few times this topic has been
brought up. His question was music to suspend2 author Nigel Cunningham's
ears; Nigel promptly responded with a lengthy reasons to merge suspend2 document. Among
many other things, he notes that the user-space software suspend
implementation (uswsusp) is still running behind suspend2 in features. It
is true that little has been heard from uswsusp in recent times; there has
not been a release since last November. Uptake by distributors has been
slow. But that didn't stop uswsusp hacker Pavel Machek from jumping in saying "Well, current uswsusp
code can do most of stuff suspend2 can do, with 20% (or so) of kernel
code.
"
Those who followed the discussion one year ago when uswsusp was merged may remember that it triggered a debate on which functions can sensibly be moved out of the kernel to user space. Many developers thought that suspend-to-disk functionality was, perhaps, on the wrong side of that line. After this debate, the number of proposals for moving functionality out of the kernel fell significantly. People are still sensitive to the issue, though, as can be seen in this response from Linus:
In a later, calmer moment he added:
This discussion should help to keep a lid on future "move kernel code to user space" projects. While there are certainly times when such moves make sense, there are also situations where putting functionality in user space just makes things harder. That said, one should not expect the recently-posted Kcli patch, intended to help move entire applications into the kernel, to get into the mainline anytime soon.
Meanwhile, what about suspend2? It is possible that the renewed discussion might provide some impetus for the merging of this longstanding development. Certainly suspend2 has a significant user community which would appreciate inclusion in the mainline. The amount of discussion has been relatively low, though. It may well be that enough systems now have working suspend-to-RAM support that the level of interest in suspend-to-disk is rather lower than it once was.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>