Brief items
The current 2.6 prepatch is still 2.6.21-rc7; the expected final
2.6.21 release has not happened as of this writing. Patches to fix
regressions continue to accumulate in the mainline git repository.
There have been no -mm releases in the past week.
For older kernels: 2.6.16.49 was released on
April 23 with a handful of fixes. Users of the 2.4 kernel can choose
between 2.4.34.3
(April 22, networking fixes), 2.4.34.4 (fixes a build problem
in 2.4.34.3), or 2.4.35-pre4
(April 22, various fixes).
Comments (none posted)
Kernel development news
So while with other, heuristic approaches we always had the problem
of creating a "hyper-inflation" of an uneconomic virtual currency
that could be freely printed by certain tasks, in CFS the economy
of this is strict and the finegrained plus/minus balance is
strictly managed by a conservative and independent central bank.
--
Ingo Molnar brings fiscal discipline to
scheduling
We like it in the kernel, we find it to be warm and fuzzy. Whereas,
user space is a cold, dark, and rainy place, and we just don't want
to go there.
--
Matt Ranon
Comments (2 posted)
In
last week's scheduler
timeslice, Ingo Molnar had introduced his "completely fair scheduler"
patch and Staircase Deadline scheduler author Con Kolivas had retreated in
a bit of a sulk. Since then, Con has returned and posted several new
revisions of the SD scheduler, but with little discussion. His intent,
seemingly, is to raise the bar and ensure that whatever scheduler does
eventually replace the current system is the best possible - a goal which
few should be able to disagree with.
Most of the discussion, though, has centered around the CFS scheduler.
Several testers have reported good results, but others have noted some
behavioral regressions. These problems, like most of the others over the
years, involve the X Window System. So potential solutions are being
discussed yet again.
The classic response to X interactivity problems is to renice the X server.
But this solution seems like a bit of a hack to many, so scheduler work
has often been aimed at eliminating the need to run X at a higher
priority. Con Kolivas questions this goal:
The one fly in the ointment for linux remains X. I am still, to
this moment, completely and utterly stunned at why everyone is
trying to find increasingly complex unique ways to manage X when
all it needs is more cpu. Now most of these are actually very good
ideas about _extra_ features that would be desirable in the long
run for linux, but given the ludicrous simplicity of renicing X I
cannot fathom why people keep promoting these alternatives.
Avoiding renicing remains a goal of CFS, but it's interesting to see that
the v4 CFS patch does renice X - automatically. More specifically, the
scheduler bumps the priority level of any process performing hardware I/O
(as seen by calls to ioperm() or iopl(), the loop block
device thread, and worker threads associated with workqueues. With the X
server automatically boosted (as a result of its iopl() use), it
does tend to be more responsive.
While giving kernel threads a priority boost might make sense in the long
term, Ingo sees renicing X as a temporary hack. The real solution to the
problem seems to involve two different approaches: CPU credit transfers
between processes and group scheduling.
Remember that, with the CFS scheduler, each process accumulates a certain
amount of CPU time which is "owed" to it; this time is earned by waiting
while others use the processor. This mechanism can enforce basic fairness
between processes, in that each one gets something very close to an equal
share of the available CPU time. Whether this calculation is truly "fair"
depends on how one judges fairness; if the X server is seen as performing
work for other processes, then fairness would call for X to share in the
credit accumulated by those other processes. Linus has been pushing for a solution along these
lines:
The "perfect" situation would be that when somebody goes to sleep,
any extra points it had could be given to whoever it woke up
last. Note that for something like X, it means that the points are
100% ephemeral: it gets points when a client sends it a request,
but it would *lose* the points again when it sends the reply!
The CFS v5 patch has the
beginnings of support for this mode of operation. Automatic transfers of
credit are not there, but there is a new system call:
long sched_yield_to(pid_t pid);
This call gives up the processor much like sched_yield(), but it
also gives half of the yielding process's credit (if it has any) to the
process identified by pid. This system call could be used by (for
example) the X libraries as a way to explicitly transfer credit to the X
server. There is currently no way for the X server to give back the credit
it didn't use; Ingo has mentioned the
notion of a sched_pay() system call for that purpose. There's
also no way to ensure that X uses the credit for work done on the yielding
process's behalf; it could just as easily squander it on wobbly window
effects. But it's a step in the right direction.
A further step, in a highly prototypical form, is Ingo's scheduler economy patch. This
mechanism allows kernel code to set up a scheduler "work account";
processes can then make deposits to and withdrawls from the account with:
void sched_pay(struct sched_account *account);
void sched_withdraw(struct sched_account *account);
At this point, deposits and withdrawls all involve a fixed amount of CPU
time. The Unix-domain socket code has been modified to create one of these
accounts associated with each socket. Any non-root process (X clients, for
example) writing to a socket will also make a deposit into the work
account; root-owned processes (the X server, in particular) reading
messages also withdraw from the account. It's all a proof of concept; a
real implementation would require a rather more sophisticated API. But the
proof does show that X clients can convey some of their CPU credits to the
server when processor time is scarce.
The other idea in circulation is per-user or group scheduling. Here, the
idea is to fairly split CPU time between users instead of between
processes. If one user is running a single text editor process when
another starts a kernel build with make -j 100, the
scheduler will have 101 processes all contending for the CPU. The current
crop of fair schedulers will divide the processor evenly between all of
them, allowing the kernel build to take over while the text editor must
make do with less than 1% of the available CPU time. This situation may be
just fine with kernel developers, but one can easily argue that the right
split here would be to confine the kernel build to half of the available
time while allowing the text editor to use the other half.
That is the essence of per-user scheduling. Among other things, it could
ease the X interactivity problem: since X runs as a different user (root,
normally), it will naturally end up in a separate scheduling group with its
own fair share of the processor. Linus has been pushing hard for group
scheduling as well (see the quote
of last week). Ingo responds that
group scheduling is on his mind - he just hasn't gotten around to it yet:
Firstly, i have not neglected the group scheduling related CFS
regressions at all, mainly because there _is_ already a quick hack
to check whether group scheduling would solve these regressions:
renice. And it was tried in both of the two CFS regression cases
i'm aware of: Mike's X starvation problem and Willy's "kevents
starvation with thousands of scheddos tasks running" problem. And
in both cases, applying the renice hack [which should be properly
and automatically implemented as uid group scheduling] fixed the
regression for them! So i was not worried at all, group scheduling
_provably solves_ these CFS regressions. I rather concentrated on
the CFS regressions that were much less clear.
In other words, the automatic renicing described above is not a permanent
solution; instead, it's more of a proof of concept for group scheduling.
Ingo goes on to say that there's a lot of other important factors in
getting interactive scheduling right; in particular, nanosecond accounting
and strict division of CPU time were needed. Once all of those details are
right, one can start thinking about the group scheduling problem.
So there would appear to be some work yet to be done on the CFS scheduler.
That will doubtless happen; meanwhile, however, Linus has complained that some of this effort may be
misdirected at the moment:
Anyway, I'd ask people to look a bit at the current *regressions*
instead of spending all their time on something that won't even be
merged before 2.6.21 is released, and we thus have some more
pressing issues. Please?
One might argue that any work which is intended for the upcoming 2.6.22
merge window needs to be pulled into shape now. But the replacement of the
CPU scheduler is likely to take a little bit longer than that. Given the
number of open questions - and the amount of confidence replacing the
scheduler requires - any sort of movement for 2.6.22 seems unlikely.
Comments (14 posted)
One of the fundamental problems facing filesystem developers is that, while
disks are getting both larger and faster, the rate at which they are
growing exceeds the rate at which they are speeding up. As a result, the
time required to read an entire disk is growing. There is little joy in
waiting for a filesystem checker to do its thing during a system reboot, so
the prospect of ever-longer fsck delays is understandably lacking in
appeal. Unfortunately, that is the direction in which things are going.
Journaling filesystems can help avoid fsck, but only in situations
where the filesystem has not suffered any sort of corruption.
Given that filesystem checks are something we have to deal with, it's worth
thinking about how we might make them faster in the era of terabyte disks.
One longstanding idea for improving the situation was recently posted in
the form of chunkfs, "fs
fission for faster fsck." The core idea is to take a filesystem and split
it into several independent filesystems, each of which maintains its own
clean/dirty state. Should things go wrong, only those sub-filesystems which
were active at the time of failure need to be checked.
Like many experimental filesystem developments, chunkfs is built upon ext2.
Internally, it is a series of separate ext2
filesystems which look like a single system to the higher layers of the
filesystem. Each chunk can be maintained independently by the filesystem
code, but the individual chunks are not visible outside of the filesystem.
The idea is relatively simple, though, as always, there are a few pesky
details to work out.
One is that inode numbers in the larger chunkfs filesystem must be unique.
Each chunk, however, maintains its own list of inodes starting with
number one, so inode numbers will be reused from one chunk to the next.
Chunkfs makes these numbers unique by putting the chunk number
in the upper eight bits of every inode number. As a result, there is a
maximum of 256 chunks in any chunkfs filesystem.
A trickier problem comes about when a file grows. The filesystem will try
to allocate additional file blocks in the chunk where the file was
originally created. Should that chunk fill up, however, something else needs
to happen; it would not be good for the filesystem to return "no space"
errors when free space does exist in other chunks. The answer
here is the creation of a "continuation inode." These inodes track the
allocation of blocks in a different chunk; they look much like files in
their own right, but they are part of a larger array of block allocations.
The "real" inode for a given file can have pointers to up to four
continuation inodes in different chunks; if more are needed, each
continuation inode can, itself, point to another four continuations. Thus,
continuation inodes can be chained to create files of arbitrary length.
This code is in a relatively early state; the text with the patch notes
that "this is a preliminary implementation and lots of changes are
expected before this code is even sanely usable." There is a set of
tools which can be used by people who would like to test out chunkfs
filesystems with well backed-up data. With some care and some testing,
chunkfs may grow to the point that it's stable and shortening fsck times
worldwide.
Meanwhile, one of the longest stories in Linux filesystem development has
to be the reiser4 filesystem. By the time Hans Reiser first asked for the merging of
reiser4 in July, 2003, the filesystem had been under development for
some years. Almost four years have passed since then, and reiser4 remains
outside of the mainline kernel. Hans Reiser is now out of the picture, his
company (Namesys) is in trouble, and, to a casual observer, reiser4 appears
not to be going anywhere.
There has been a recent increase in interest in this filesystem, though.
It turns out that two Namesys employees are
still working on the filesystem "mostly on enthusiasm." They have been
feeding patches through to the -mm tree, and they are getting toward the
end of their list of things to fix. So we might see a new push for
inclusion of reiser4, perhaps as soon as 2.6.23. But, says Andrew Morton, some things would have to
happen; in particular, there needs to be a new review of the reiser4 code.
To get it unstuck we'd need a general push, get people looking at
and testing the code, get the vendors to have a serious think about
it, etc. We could do that - it'd require that the namesys people
(and I) start making threatening noises about merging it, I guess.
Or we could move all the reiser4 code into kernel/sched.c - that
seems to get people fired up.
Your editor will go out on a limb and suggest that a mass move of the
reiser4 code is unlikely. But a new round of talk on actually merging this
filesystem is starting to look reasonably likely. There's enough work -
and enough interesting ideas - in this code that people are unwilling to
let it just fade away. Perhaps, soon, it will be heading for its
long-sought spot in the mainline.
Comments (12 posted)
One of the side discussions in the scheduler debate had to do with how the
CFS scheduler broke the out-of-tree
suspend2 suspend-to-disk code. Ingo
Molnar, acting on the reports, found and fixed a bug in CFS. As a way of
returning the favor, he then
posted a
review of the suspend2 code, noting that "
the patch looks sane
all around" and asking whether there were any plans to get suspend2
into the mainline kernel.
Perhaps Ingo wasn't listening the past few times this topic has been
brought up. His question was music to suspend2 author Nigel Cunningham's
ears; Nigel promptly responded with a lengthy reasons to merge suspend2 document. Among
many other things, he notes that the user-space software suspend
implementation (uswsusp) is still running behind suspend2 in features. It
is true that little has been heard from uswsusp in recent times; there has
not been a release since last November. Uptake by distributors has been
slow. But that didn't stop uswsusp hacker Pavel Machek from jumping in saying "Well, current uswsusp
code can do most of stuff suspend2 can do, with 20% (or so) of kernel
code."
Those who followed the discussion one year ago when uswsusp was merged may
remember that it triggered a debate on which functions can sensibly be
moved out of the kernel to user space. Many developers thought that
suspend-to-disk functionality was, perhaps, on the wrong side of that
line. After this debate, the number of proposals for moving functionality
out of the kernel fell significantly. People are still sensitive to the
issue, though, as can be seen in this response
from Linus:
This whole notion that "kernel lines of code" is somehow different
is a stupid and idiotic _disease_ that is spread by microkernel
people and people who have been brainwashed by them.
In a later, calmer moment he added:
This is why I don't believe in the whole kernel-line-counting
thing. I'm personally 100% convinced that it's better to have ten
times as many lines in the kernel, if it means that you can just
forget about version skew and bad user-space interfaces etc.
This discussion should help to keep a lid on future "move kernel code to
user space" projects. While there are certainly times when such moves make
sense, there are also situations where putting functionality in user space
just makes things harder. That said, one should not expect the
recently-posted Kcli patch,
intended to help move entire applications into the kernel, to get into the
mainline anytime soon.
Meanwhile, what about suspend2? It is possible that the renewed discussion
might provide some impetus for the merging of this longstanding
development. Certainly suspend2 has a significant user community which
would appreciate inclusion in the mainline. The amount of discussion has
been relatively low, though. It may well be that enough systems now have
working suspend-to-RAM support that the level of interest in
suspend-to-disk is rather lower than it once was.
Comments (26 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>