Brief items
The current 2.6 development kernel is 2.6.30-rc2,
released on April 14.
"
New 'microblaze' architecture, a somewhat late 'input' layer merge,
a new intel virtual networking driver and some firmware loading
updates. And mn10300 and frv moved their header files from include/asm to
arch. That accounts for the bulk, but shouldn't affect anybody." The
short-form changelog is in the announcement; see
the
full changelog for all the details.
There have been no stable 2.6 updates released in the last week, and none
are in the review process.
Comments (none posted)
Kernel development news
When the revolution comes, and the people who haven't converted to
git get sent to the gulags, we'll make "-M" the default.
--
Linus Torvalds
I have been asked to include aufs into mainline from several people
several times. As long as you have strong NACK for aufs and reject
all union-type filesystems, I have to give up unwillingly and will
answer them "Aufs was rejected. Let's give it up."
--
J.R. Okajima gives up
while(my_rootfs_hasnt_appeared_and_i_am_sad()) {
wait_on(&new_disk_discovery);
}
--
Alan Cox extends the boot API
IBM has a well-known disdain for vowels, and basically refuses to
use them for mnemonics (they were called on this, and did "eieio"
as an instruction just to try to make up for it).
But I'm from Finland. In Finnish, about 75% of all letters are
vowels. I find this dis-emvoweling to be stupid and
impractical. Without vowels, you can't tell Finnish words apart
(admittedly, _with_ vowels, you generally cannot pronounce them, so
to a non-Finn it doesn't much matter).
--
Linus Torvalds (thanks to Ben Hutchings)
Comments (13 posted)
Our recent coverage from the 2009 Linux Storage and
Filesystem Workshop (
day 1,
day 2) contained no notes
from the storage track - an unfortunate result of your editor's inability
to be in two places at the same time. Happily, James Bottomley took good
notes, which he has now made available to us to publish. Topics covered
include multipathing, I/O scheduling and tracing, ATA issues, and more;
click below for the full text.
Full Story (comments: 3)
By Jonathan Corbet
April 14, 2009
In a typical development cycle, Linus Torvalds
pulls patches from over 100 git
trees into the mainline repository. While this is going on, it's not
unusual for him to complain about how some of those trees are managed; most
of the gripes have to do with excessive use of rebasing and merging
operations. In a recent discussion on the dri-devel list, Linus
clarified his rules somewhat on subsystem tree
management. Your editor, on the theory that there might be a developer or
two out there who does not read dri-devel, thought that it might be good to
expose those rules more widely.
The git "rebase" operation takes a set of patches applied to one tree and
reworks them to apply to a different tree. If a developer has written some
patches against 2.6.29, he or she can use "git rebase" to turn them into
patches against 2.6.30-rc1 instead. With git, rebasing can also be used to
make edits to the commit history. If something needs to be fixed in a
patch which was made some time ago, the developer can (1) remove the
more recent patches from the tree, (2) make the needed changes, and
(3) rebase the removed patches back onto the fixed patch. This
technique can be used to silently disappear an embarrassing bug from the
history, improve patch changelogs, fix a patch conflict against somebody
else's tree, and more. It's something that git-based developers simply end up
doing occasionally.
There are a couple of problems associated with rebasing, though. One of
those is that it changes the commit history. Whenever a series of commits
is rebased, anybody who was working with the old history is left out in the
cold. If a heavily-used tree is rebased, all developers depending on that
tree are forced to scramble to readjust to the new reality. The other
problem is that rebased patches are changed patches; any testing that they
saw may no longer be applicable. That is why Linus tends to grumble hard at
trees which have obviously been rebased just prior to the sending of a pull
request. The changes in those trees probably worked before the rebase, but
the post-rebase changes have not been tested and may not work as well.
Rebasing is clearly a useful technique, though. Linus does not tell
developers not to use it; in fact, he encourages it sometimes. The key rule
that was passed down is this: Thou Shalt Not Rebase Trees With History
Visible To Others. If a developer has pulled in somebody else's tree, the
resulting tree can no longer be rebased, since that would break the second
developer's history. Similarly, once a tree has been exported such that
others may be making use of it, it can no longer be rebased.
On the other hand, private history can be rebased at will - and it probably
should be. If a patch is seen to introduce a bug, it's best to fix it at
the source rather than reverting it or adding a second, fixup patch; the
result is a cleaner history which is less likely to create problems for
people trying to bisect unrelated bugs. Your editor has found that
rebasing is often needed to add tags ("Acked-by," for example) to patches
which have been circulated for review. When one is creating a set of
patches for the mainline kernel, one is really creating an entire history,
not just the end result. Making that history clean and readable is to
everybody's benefit.
The associated rule that goes with this, though, is that trees which are
subject to rebasing should not be exposed to the world:
This means: if you're still in the "git rebase" phase, you don't
push it out. If it's not ready, you send patches around, or use
private git trees (just as a "patch series replacement") that you
don't tell the public at large about.
So, in other words, trees which might be rebased should be kept private.
They should also not have other developers' trees pulled into them.
It's worth noting that Linus very much practices what he preaches on this
front. The mainline git repository accepts 10,000 or so changesets every
development cycle, but it is never rebased. And that is a good thing:
rebasing the mainline would cause massive pain throughout the development
community.
Merging is the other place where subsystem maintainers can run afoul of the
Chief Penguin. A "merge" in git is similar to a merge in most other source
code management systems; it joins two (or more) independent lines of
development into the current branch. Git merges differ, though, in that
they can have more than two incoming branches; Ingo Molnar is famous for
his use of "octopus merges" joining vast numbers of branches in a single
operation. In almost all cases, performing a merge adds a special commit
to the repository indicating that the merge has been done and noting which
files, if any, had conflicts.
Merges go both ways. When Linus pulls a subsystem tree into the mainline,
the result is a merge. But it is also common for developers to perform
merges in the other direction; they will pull the mainline (or some
higher-level subsystem tree) into a branch containing a local line of
development. It is natural to want to develop code against the current
state of the art; it gives confidence that the end result will work with
everybody else's changes and minimizes the chances of an ugly merge conflict
later on.
But excessive pulling from the mainline (as evidenced by the merge commits
which result) tends to irritate Linus. As he put it:
But if I see a lot of "Merge branch 'linus'" in your logs, I'm not
going to pull from you, because your tree has obviously had random
crap in it that shouldn't be there. You also lose a lot of
testability, since now all your tests are going to be about all my
random code.
As anybody who has worked with tip-of-the-repository kernels knows, the
state of the mainline at any random point can be, well, random. So
frequent pulling of the mainline into a development branch will add a
certain amount of randomness to that branch; this randomness is not
particularly helpful for somebody who is trying to get a feature working.
It also increases the chances that another developer who ends up in the middle of
the series while running a bisect operation will encounter unrelated bugs.
So Linus would rather that developers not pull down from upstream trees:
And, in fact, preferably you don't pull my tree at ALL, since
nothing in my tree should be relevant to the development work _you_
do. Sometimes you have to (in order to solve some particularly
nasty dependency issue), but it should be a very rare and special
thing, and you should think very hard about it.
The reality of the situation tends not to be so strict, though. An
occasional merge to stay on top of what's happening elsewhere can make
sense. What Linus suggests, though, is that the merges happen at specific
release points. So pulling the tip of the mainline tree into a development
tree probably does not make sense, but there might be an argument for
pulling in 2.6.29 or 2.6.30-rc1. Doing things this way allows development
to be based on a (hopefully) relatively stable point where the issues are
known.
The temptation to merge in the mainline during development can be hard to
resist; one likes to know whether one's work is even remotely relevant to
the current state of the code. Fortunately, git makes it really easy to
create throwaway branches and test out merges and integration there. Once
it's clear that things work, the test branch can be deleted and the
(unmerged) development branch sent upstream.
Similar rules apply to the merging of downstream code. The receiving
repository should be in a reasonably well defined and stable state;
typically developers maintain a "for upstream" branch for this kind of
merge. And the downstream code should be "ready": it should be at some
sort of release point and not in a random state of development.
Of course, these rules are not absolute:
Git does allow people to do many different things, and solve
problems different ways. I just want all the regular workflows to
be "good practice", but then if you have to occasionally break the
rules to solve some odd problem, go ahead and break the rules (and
tell people why you did it that way this time).
Linus first started playing with BitKeeper in February, 2002, so the kernel
community now has seven years worth of experience with distributed version
control. But the truth of the matter is that we are still figuring out the
best way to use this particular tool. This is a process which is likely to
continue for some time yet. As other large projects move toward using
tools like git, they may want to look hard at the processes and rules which
have been developed in the kernel community; they might just be able to
shorten their own learning experience.
Comments (1 posted)
By Jonathan Corbet
April 14, 2009
One might think that the ext3 filesystem, by virtue of being standard on
almost all installed Linux systems for some years now, would be reasonably
well tuned for performance. Recent events have shown, though, that some
performance problems remain in ext3, especially in places where the
fsync() system call is used. It's impressive what can happen when
attention is drawn to a problem; the 2.6.30 kernel will contain
fixes which seemingly eliminate many of the latencies experienced by ext3
users. This article will look at the changes that were made, including a
surprising change to the default journaling mode made just before the
2.6.30-rc1 release.
The problem, in short, is this: the ext3 filesystem, when running in the
default data=ordered mode, can exhibit lengthy stalls when some
process calls fsync() to flush data to disk. This issue most
famously manifested itself as the much-lamented Firefox system-freeze problem, but it goes
beyond just Firefox. Anytime there is reasonably heavy I/O going on, an
fsync() call can bring everything to a halt for several seconds.
Some stalls on the order of minutes have been reported. This behavior has
tended to discourage the use of fsync() in applications and it
makes the Linux desktop less fun to use. It's clearly worth fixing - but
nobody did that for years.
When Ted Ts'o looked into the problem, he noticed an obvious problem: data
sent to the disk via fsync() is put at the back of the I/O
scheduler's queue, behind all other outstanding requests. If processes on
the system are
writing a lot of data, that queue could be quite long. So it takes a long
time for fsync() to complete. While that is happening, other
parts of the filesystem can stall, eventually bringing much of the system
to a halt.
The first fix was to mark I/O requests generated by fsync() with the
WRITE_SYNC operation bit, marking them as synchronous requests.
The CFQ I/O scheduler tries to run synchronous requests (which generally
have a process waiting for the results) ahead of asynchronous ones (where
nobody is waiting). Normally, reads are considered to be synchronous,
while writes are not. Once the fsync()-related requests were made
synchronous, they were able to jump ahead of normal I/O. That
makes fsync() much faster, at the expense of slowing down the
I/O-intensive tasks in the system. This is considered to be a good
tradeoff by just about everybody involved. (It's amusing to note that this
change is conceptually similar to the I/O priority patch posted by
Arjan van de Ven some time ago; some ideas take a while to reach
acceptance).
Block subsystem maintainer Jens Axboe disliked
the change, though, stating that it would cause severe performance
regressions for some workloads. Linus made it
clear, though, that the patch was probably going to go in, and that, if
the CFQ I/O scheduler couldn't handle it, there would soon be a change to a
different default scheduler. Jens probably would have looked further in
any case, but the extra motivation supplied by Linus is unlikely to have
slowed this process down.
The problem, as it turns out, is that WRITE_SYNC actually does two
things: putting the request onto the higher-priority synchronous queue, and
unplugging the queue. "Plugging" is the technique used by the block layer
to issue requests to the underlying disk driver in bursts. Between bursts,
the queue is "plugged," causing requests to accumulate there. This
accumulation gives the I/O scheduler an opportunity to merge adjacent
requests and issue them in some sort of reasonable order. Judicious use of
plugging improves block subsystem performance significantly.
Unplugging the
queue for a synchronous request can make sense in some situations; if
somebody is waiting for the the operation, chances are they will not be
adding any adjacent requests to the queue, so there is no point in waiting
any longer.
As it happens, though, fsync() is not one of those situations.
Instead, a call to fsync() will usually generate a whole series of
synchronous requests, and the chances of those requests being adjacent to
each other is fairly good. So unplugging the queue after each synchronous
request is likely to make performance worse. Upon identifying this
problem, Jens posted a series of
patches to fix it. One of them adds a new WRITE_SYNC_PLUG
operation which queues a synchronous write without unplugging the queue.
This allows operations like fsync() to create a series of
operations, then unplug the queue once at the end.
While he was at it, Jens fixed a couple of related issues. One was that
the block subsystem can still sometimes run synchronous requests behind
asynchronous requests in some situations. The code here is a bit tricky,
since it may be desirable to let a few asynchronous requests through occasionally to
prevent them from being starved entirely. Jens changed the balance to
ensure that synchronous requests get through in a timely manner.
Beyond that, the CFQ scheduler
uses "anticipatory" logic with synchronous requests; upon executing one
such request, it will stall the queue to see if an adjacent request shows
up. The idea is that the disk head will be ideally positioned to satisfy
that request, so the best performance is obtained by not moving it away
immediately.
This logic can work well for synchronous reads, but it's not helpful
when dealing with write operations generated by fsync(). So now there's a
new internal flag that prevents anticipation when WRITE_SYNC_PLUG
operations are executed.
Linus liked the changes:
Goodie. Let's just do this. After all, right now we would otherwise
have to revert the other changes as being a regression, and I
absolutely _love_ the fact that we're actually finally getting
somewhere on this fsync latency issue that has been with us for so
long.
It turns out that there's a little more,
though. Linus noticed that he was still getting stalls, even if they were
much shorter than before, and he wondered why:
One thing that I find intriguing is how the fsync time seems so
_consistent_ across a wild variety of drives. It's interesting how
you see delays that are roughly the same order of magnitude, even
though you are using an old SATA drive, and I'm using the Intel
SSD.
The obvious conclusion is that there was still something else going on.
Linus's hypothesis was that the volume of requests pending to the drive was
large enough to cause stalls even when the synchronous requests go to the
front of the queue. With a default configuration, requests can contain up
to 512KB of data; stack up a couple dozen or so of those, and it's going to
take the drive a little while to work through them. Linus experimented
with setting the maximum size (controlled by
/sys/block/drive/queue/max_sectors_kb) to 64KB, and reports
that things worked a lot better. As of this writing, though, the default
has not been changed; Linus suggested that it might make sense to cap the
maximum amount of outstanding data, rather than the size of any individual
request. More experimentation is called for.
There is one other important change needed to get a truly quick
fsync() with ext3, though: the filesystem must be mounted in
data=writeback mode. This mode eliminates the requirement that
data blocks be flushed to disk ahead of metadata; in data=ordered
mode, instead, the amount of data to be written guarantees that
fsync() will always be slower. Switching to
data=writeback eliminates those writes, but, in the process, it
also turns off the feature which made ext3 seem more robust than ext4.
Ted Ts'o has mitigated that problem somewhat, though, by adding in the same
safeguards he put into ext4. In some situations (such as when a new file
is renamed on top of an existing file), data will be forced out ahead of
metadata. As a result, data loss resulting from a system crash should be less of a
problem.
Sidebar: data=guarded
Another alternative to data=ordered may be the data=guarded mode proposed by
Chris Mason. This mode would delay size updates to prevent information
disclosure problems. It is a very new patch, though, which won't be ready
for 2.6.30.
The other potential problem with
data=writeback is that, in some
situations, a crash can leave a file with blocks allocated to it which have
not yet been written. Those blocks may contain somebody else's old data,
which is a potential security problem. Security is a smaller issue than it
once was, for the simple reason that multiuser Linux systems are relatively
scarce in 2009. In a world where most systems are dedicated to a single
user, the potential for information disclosure in the event of a crash
seems vanishingly small. In other words, it's not clear
that the extra security provided by
data=ordered is worth the
associated performance costs anymore.
So Ted suggested that, maybe,
data=writeback should be made the default. There was some
resistance to this idea; not everybody thinks that ext3, at this stage of
its life, should see a big option change like that. Linus, however, was unswayed by the arguments. He merged a
patch which creates a configuration option for the default ext3 data mode,
and set it to "writeback." That will cause ext3 mounts to silently switch
to data=writeback mode with 2.6.30 kernels. Says Linus:
I'm expecting that within a few months, most modern distributions
will have (almost by mistake) gotten a new set of saner defaults,
and anybody who keeps their machine up-to-date will see a smoother
experience without ever even realizing _why_.
It's worth noting that this default will not change anything if
(1) the data mode is explicitly specified when the filesystem is
mounted, or (2) a different mode has been wired into the filesystem
with tune2fs. It will also be ineffective if distributors change
it back to "ordered" when configuring their kernels. Some distributors, at
least, may well decide that they do not wish to push that kind of change to
their users. We will not see the answer to that question for some months
yet.
Comments (53 posted)
By Jonathan Corbet
April 14, 2009
Once upon a time, operating systems did not have to worry about hardware
coming and going at awkward times. Whatever peripherals were bolted into
the box when the system booted could be counted on to still be there at
shutdown time. Contemporary systems don't work that way; devices will come
and go at the whim of the user. Various subsystems have evolved mechanisms
for coping with hardware which suddenly vanishes, but that code tends to be
subsystem-specific and complex. Eric Biederman recently encountered this
code and didn't really like what he saw. So he has set out to make
something better.
Eric's patch series begins
with this observation:
Not long after I touched the tun driver and made it safe to delete
the network device while still holding it's file descriptor open I [saw]
someone else touch the code adding a different feature and my
careful work went up in flames. Which brought home another point:
at the best of it this is ultimately complex tricky code that
subsystems should not need to worry about.
Eric also notes that the growth in hotplug-capable PCI devices will increase the
number of subsystems and drivers which need to be prepared for this
eventuality. Rather than spread hotplug-specific code through more parts
of the kernel, he would like to create one central, well-supported mechanism.
The issue that Eric is looking at in particular is: what happens to open
file descriptors when the underlying resource goes away? Regardless of
whether that resource is a physical device, a module, or something
different altogether, the kernel needs to do a right thing when the file
descriptor no longer points to something valid. Eric's patches create a
new infrastructure which allows any subsystem to easily revoke access to a
file descriptor in a more reliable and robust manner than has been seen
before.
The first issue that comes up is, invariably, mmap(). If a
no-longer-existing device or file has been mapped into a process's address
space, interesting and unpleasant things could happen. Eric's answer is a
new function:
void remap_file_mappings(struct file *file,
struct vm_operations_struct *vm_ops);
A call to remap_file_mappings() will locate every virtual memory
area (VMA) associated with the given file. All mapped pages will
be unmapped, making them inaccessible to the process which had mapped
them. The operations associated with the VMA will be replaced with
vm_ops; those operations will normally be revoked_vm_ops,
which simply return a bus error whenever the process attempts to access one
of the affected pages.
The kernel also clearly needs to block any other operations -
read(), write(), ioctl(), etc. - which might be
performed on this file descriptor. The way to do that, of course, is to
replace the file_operations structure associated with the file.
The function to do that is:
int fops_substitute(struct file *file, const struct file_operations *f_op,
struct vm_operations_struct *vm_ops);
One might imagine that this function could be quite simple, along the lines
of:
file->f_op = f_op;
remap_file_mappings(file, vm_ops);
But the truth of the matter is rather more complicated. To begin with,
there may be threads running in the old file operations, and some of those
might be waiting for events which will, now, never happen. As a way of
helping drivers unwedge themselves in this situation, Eric's patches add a
new entry to struct file_operations:
int (*awaken_all_waiters)(struct file *filp);
This function should cause any thread which is waiting for the given file
to wake up and take note that the world has changed.
The next sticking point is that, now that the file operations have been
swapped out, there is no way for the underlying driver to know when all
file descriptors have been closed. That is handled by waiting until there
are no more known users of the old file operations, then calling the
release() function directly from fops_substitute(). That
leads to the sticky question of what happens if some thread never wakes up
and the usage count never goes to zero; in the current patch,
fops_substitute() will simply hang in this situation.
Before one can even worry about that, though, there is the troublesome
point that the kernel has no idea how many users of a given
file_operations structure exist. So Eric has had to add a
reference counting mechanism. In the new way of doing things, any kernel
code must bracket calls into a file's file_operations with:
int fops_read_lock(struct file *file);
void fops_read_unlock(struct file *file, int revoked);
The return value from fops_read_lock() (which Eric invariably
calls fops_idx) is non-zero if access to the file has already been
revoked; it must be passed into the matching call to
fops_read_unlock(). The biggest part of the patch series is a
slog through the core VFS code adding locking around every
file_operations access. That's a lot of little code changes which
have to be made in a lot of places.
There is a payoff, though: the handling of revoked files in various other
subsystems can be ripped out and replaced with the new, generic code. The
changes to the /proc filesystem, for example, leave the code
almost 400 lines shorter. So the kernel gets smaller, and the new code,
should, with luck, be more robust and more maintainable.
This mechanism is useful for situations where devices disappear, but there
is also a bigger goal in sight. There has long been a desire for a generic
revoke() system call which would disconnect all open descriptors
to a given file or device. It could be used to implement some sort of
secure attention key, killing all processes which have open file
descriptors to a console device, for example. revoke() would also
be useful for forced unmounting of filesystems. It's a useful idea, with
only one problem: revoke() is really hard. Nobody has yet come
through with an implementation that looks complete and robust enough to be
put into the kernel.
Eric's patch set has not gotten there yet either. But it does represent
another stab at the problem using an approach which, most developers agree,
is the way that revoke() needs to be implemented. Over time, it
might just evolve into the general solution which has evaded other
developers for years.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>