Brief items
The current 2.6 prepatch is 2.6.26-rc1,
released on May 3.
"
So this merge window was somewhat rocky in the sense that there was
a lot of arguments about it, but at the same time I at least personally
think that from a technical angle, we had somewhat less scary stuff going
on than has been almost the rule lately." At about 7500 commits,
this cycle has fewer changes than the last couple have; a lot of the
changes are infrastructural, so there will be fewer obvious new features
with 2.6.26 than with some of its predecessors. See
the short-form changelog for details, or
the
full changelog for lots of details.
A relatively slow stream of patches has been heading into the mainline git
repository since the -rc1 release.
The current stable 2.6 release is 2.6.25.2, released on May 6. This
release contains a single fix for a locally-exploitable security problem in
the filesystem locks code. 2.6.24.7 and 2.4.36.4 were also released with
this fix.
Previously, 2.6.25.1 and 2.6.24.6 had been released with
a larger set of fixes. In the absence of another security issue, there
will probably not be any more 2.6.24 stable updates.
Comments (none posted)
Kernel development news
Usually my git problems are root-caused down to my lack of a PhD in
hermeneutic metaphysiology, but not this time, methinks.
--
Andrew Morton
Kids: do not shove random modules into your kernel. Just because
Linus does something doesn't make it a good idea...
We've moved
half the kernel brains to userspace with udev, initrd and modules;
it's really unfair that you're not sharing all that
why-won't-my-machine-boot love.
--
Rusty Russell
[T]he kernel team has evolved from a small team of buddies to a
large enterprise. And to survive this evolution, we may need to
apply the immoral principles found in big companies.
--
Willy Tarreau
Comments (7 posted)
By Jonathan Corbet
May 5, 2008
About 500 changesets were merged after the publication of the
first and
second 2.6.26 merge window
summaries. The merge window is now closed; here is the final set of
changes which got in:
- New drivers for Solarflare Communications Solarstorm SFC4000
controller-based Ethernet controllers,
Hauppauge HVR-1600 TV tuner cards,
ISP 1760 USB host controllers,
Cypress c67x00 OTG controllers, and
Intel PXA 27x USB controllers.
- 8Kb stacks are, once again, the default for the x86 architecture.
"Out-of-memory situations are less problematic than silent and
hard to debug stack corruption."
- The klist type now has the usual-form macros for declaration and
initialization: DEFINE_KLIST() and KLIST_INIT().
Two new functions (klist_add_after() and
klist_add_before()) can be used to add entries to a klist in
a specific position.
- As had been planned, struct class_device has been removed
from the driver core, along with all of the associated infrastructure.
Classes are now implemented with an ordinary struct device.
- kmap_atomic_to_page() is no longer exported to modules.
- There are some new generic functions for performing 64-bit integer
division in the kernel:
u64 div_u64(u64 dividend, u32 divisor);
u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder);
s64 div_s64(s64 dividend, s32 divisor)
s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder);
Unlike do_div(), these functions are explicit about whether
signed or unsigned math is being done. The x86-specific
div_long_long_rem() has been removed in favor of these new
functions.
- There is a new string function:
bool sysfs_streq(const char *s1, const char *s2);
It compares the two strings while ignoring an optional trailing
newline.
- The prototype for i2c probe() methods has changed:
int (*probe)(struct i2c_client *client,
const struct i2c_device_id *id);
The new id argument supports i2c device name aliasing.
- There is a new configuration (MODULE_FORCE_LOAD) which
controls whether the loading of modules can be forced if the kernel
thinks something is not right; it defaults to "no."
Comments (10 posted)
By Jonathan Corbet
May 7, 2008
All communities develop rituals over time. One of the enduring
linux-kernel rituals is the regular heated discussion on development
processes and kernel quality. To an outside observer, these events
can give the impression that the whole enterprise is about to come crashing
down. But the reality is a lot like the New Year celebrations your editor
was privileged enough to see in Beijing: vast amounts of smoke and noise,
but everybody gets back to work as usual the next day.
Beyond that, though, discussions of this nature have real value. Any group
which is concerned about issues like quality must, on occasion, take a step
back and evaluate the situation. Even if there are no immediate outcomes,
the ideas raised often reverberate over the following months, sometimes
leading to real improvements.
The immediate inspiration for this round of discussion was broken systems
resulting from the 2.6.26 merge window. This development cycle has had a
rougher start than some, with more than the usual number of patches causing
boot failures and other sorts of inconvenient behavior. That led to some
back-and-forth between developers on how patches should be handled. Broken
patches are unfortunate, but one thing is worth noting here: these problems
were caught and fixed even before the 2.6.26-rc1 kernel release was made.
The problems which set off this round of discussion are not bugs which will
affect Linux users.
But, beyond any doubt, there will be other bugs which are slower to surface
and slower to be fixed. The number of these bugs has led to a number of
calls to slow down the development process in one way or another. To that
end, it is worth noting that the process has slowed down somewhat,
with the 2.6.26 merge window bringing in far fewer changesets than were
seen for 2.6.24 or 2.6.25. Whether this slower pace will continue into
future development cycles, or whether it's simply a lull after two
exceptionally busy cycles remains to be seen.
But, if the process does not slow down on its own, there are developers who
would like to find a way to force it to happen. Some have argued for
simply throttling the process by, for example, limiting new features in
each development cycle to specific subsystems of the kernel. There has
also been talk of picking the subsystems with the worst regression counts
and excluding new features from those subsystems until things improve. The
fact of the matter, though, is that throttling is unlikely to help the
situation.
Slowing down merging does not keep developers from developing, it just
keeps their code out of the tree. An extreme example can be found in the
2.4 kernel: the merging of new code was heavily throttled for a long time.
What happened was that the distributors started merging new developments
themselves because their users were demanding them. So a lot of kernels
which went under the name "2.4" were far removed from anything which could
be downloaded from kernel.org. That way lies fragmentation - and almost
certainly lower quality as well.
Linus actually takes this argument further
by arguing that quickly merging patches leads to better quality:
[M]y personal belief is that the best way to raise quality of code
is to distribute it. Yes, as patches for discussion, but even more
so as a part of a cohesive whole - as _merged_ patches!
The thing is, the quality of individual patches isn't what
matters! What matters is the quality of the end result. And people
are going to be a lot more involved in looking at, testing, and
working with code that is merged, rather than code that isn't.
Andrew Morton has also argued against
throttling:
If we simply throttled things, people would spend more time
watching the shopping channel while merging smaller amounts of the
same old crap.
Kernel developers are, of course, known to be hard-core shoppers, so giving
them more opportunity to pursue that activity is probably not the best
idea. Seriously, though: Andrew is in favor of a slower development
process, but only when approached from a different angle: his point is that
an increased focus on quality will, as a side effect, result in slower
development. Kernel developers need to be focused on finding and fixing
bugs rather than creating new ones and/or shopping.
It is worth noting that a substantial portion of the development community
appears to believe that there are no real problems in this regard. Bugs
are being found and fixed at a high rate and the kernel is solid for most
users. Arjan van de Ven notes:
Are we doing worse on quality? My (subjective) opinion is that we
are doing better than last year. We are focused more on
quality. We are fixing the bugs that people hit most. We are fixing
most of the regressions (yes, not all). Subsystems are seeing flat
or lower bugcounts/bugrates.
Ted Ts'o points out that a lot of problems
result from obscure and low-quality hardware, and that it's not possible to
make everybody happy. Andrew is unconvinced, though, and seems to fear that
the kernel is declining in quality.
In a sense, though, that part of the discussion is moot. Nobody would
argue against the idea that fewer bugs is a worthy goal, regardless of whether one believes
that the current process has quality problems. So talk of ways to make
things better is always on-topic.
Testing remains a big issue; the kernel, more than almost any other
project, is highly sensitive to the systems on which it is run. Many
problems (arguably the majority of them) are related to specific hardware,
or specific combinations of hardware; there is no way for the developers,
who do not have all possible hardware to test on, to ever find all of these
bugs. Users have to help with that process. Getting widespread testing
coverage is always hard; Peter Anvin argues
that the current process has actually made that harder:
One thing is that we keep fragmenting the tester base by adding new
confidence levels: we now have -mm, -next, mainline -git, mainline
-rc, mainline release, stable, distro testing, and distro release
(and some distros even have aggressive versus conservative tracks.)
Furthermore, thanks to craniorectal immersion on the part of
graphics vendors, a lot of users have to run proprietary drivers on
their "main work" systems, which means they can't even test newer
releases even if they would dare.
There is, in fact, a wealth of development kernels to test, and it is not
always clear where users and developers should be concentrating their
testing effort. A consensus may be forming, though, that more people
should be looking at the linux-next tree in particular. Linux-next is
where all of the patches intended for the next merge window are supposed to
congregate; the current contents of linux-next, as of this writing, are
targeted toward 2.6.27. This is the place where early integration issues
and other problems should be found; if linux-next is well tested, the
number of problems showing up in the next merge window should be somewhat
reduced.
The linux-next tree is an interesting experiment. It is, for all practical
purposes, making the development cycle longer: since linux-next exists, the
2.6.27 cycle has, in some sense, already started. Linux-next also does
something which kernel developers have tended to resist: causing the
stabilization period for one development cycle to overlap with active
development for the next cycle. In the past, it has been argued that this
kind of overlap will cause developers to prioritize the creation of new
toys over fixing the problems with last week's toys.
Some people argue that this is happening now: developers are not
spending enough time dealing with bugs - and that their carelessness is
creating too many bugs in the first place. Others assert that, while it will
never be possible to fix every reported bug, the bugs that really matter
are being addressed. A real resolution to this disagreement seems
unlikely; the creation of meaningful metrics on kernel quality is a
difficult task. About the best that can be done is to try to keep the
regression list as small as possible; as long as systems which once worked
continue to work, it is hard to argue too forcefully that things are headed
in the wrong direction.
Comments (12 posted)
By Jonathan Corbet
May 6, 2008
Bind mounts can be thought of as a sort of symbolic link at the filesystem
level. Using
mount --bind, it is possible to create a second
mount point for an existing filesystem, making that filesystem visible at a
different spot in the namespace. Bind mounts are thus useful for creating
specific views of the filesystem namespace; one can, for example, create a
bind mount which makes a piece of a filesystem visible within an
environment which is otherwise closed off with
chroot().
There is one constraint to be found with bind mounts as implemented in
kernels through 2.6.25, though: they have the same mount options as the
primary mount. So a command like:
mount --bind -o ro /vital_data /untrusted_container/vital_data
will fail to make /vital_data read-only under
/untrusted_container if it was mounted writable initially. On
your editor's 2.6.25 system, the failure is silent - the bind mount will be
made writable despite the read-only request and no error message will be
generated (the mount man page does document that options cannot be
changed).
There is clear value in the ability to make bind mounts read-only, though.
Containers are one example: an administrator may wish to create a container
in which processes may be running as root. It may be useful for that
container to have access to filesystems on the host, but the container
should not necessarily have write access to those filesystems. As of
2.6.26, this sort of configuration will be possible, thanks to the merging
of the read-only bind mounts patches by Dave Hansen.
As it happens, it's still not possible to create a read-only bind
mount with the command shown above; the read-only attribute can only be
added with a remount operation afterward. So the necessary sequence is
something like:
mount --bind /vital_data /untrusted_container/vital_data
mount -o remount,ro /untrusted_container/vital_data
This example raises an interesting question: what if some process opens a
file for write access between the two mount operations? A system
administrator has the right to expect that a read-only mount will, in fact,
only be used for read operations. The 2.6.26 patch is designed to live up
to that expectation, though the amount of work required turned out to be
more than the developers might have expected.
Filesystems normally track which files are opened for write access, so an
attempt to remount a filesystem read-only can be passed to the low-level
filesystem code for approval. But the low-level filesystem knows nothing
about bind mounts, which are implemented entirely within the virtual
filesystem (VFS) layer. So making read-only access for bind mounts work
requires that the VFS keep track of all files which have been opened for
write access. Or, more precisely, the VFS really only needs to keep track
of how many files are open for write access.
The technique chosen was to create something which looks like a write lock
for filesystems. Whenever the VFS is about to do something which involves
writing, it must first call:
int mnt_want_write(struct vfsmount *mnt);
The return value is zero if write access is possible, or a negative error
code otherwise. This call can be found in obvious places - such as in the
implementation of open() - when write access is requested. But
write access comes into play many other situations as well; for example,
renaming a file requires write access for the duration of the operation.
So mnt_want_write() calls have been sprinkled throughout the VFS
code.
When write access is no longer needed, the "write lock" should be released
with a call to:
void mnt_drop_write(struct vfsmount *mnt);
One of the discoveries which has been made is that write access is needed
in rather more places than one might have thought. In particular, it turns
out that there is need for mnt_want_write() calls within the
low-level filesystems as well as in the VFS layer. So getting the
read-only bind mounts patch into shape has been an ongoing process of
finding the spots which have been missed and adding
mnt_want_write() calls there. In an attempt to make this process
a bit less error-prone, Miklos Szeredi has put together a set of VFS helper functions
which encapsulate the situations where write access is needed. Those
functions have not been merged for 2.6.26, however.
Superficially, mnt_want_write() is easy to understand - it simply
increments a counter of outstanding write accesses. The problem with a
simple implementation, though, is that a shared, per-filesystem counter
would create scalability problems. On multiprocessor systems, the cache
line containing the counter would bounce around the system, slowing things
considerably.
A common response to this type of problem is to turn the counter into a per-CPU
variable, allowing operations on the counter to remain local to each
processor. When somebody needs to know the total value of the counters,
it's a simple matter of adding each CPU's version; this operation is slow,
but it is also rare. On big systems, though, the number of CPUs can be
large - as can the number of filesystems, and bind mounts will only
increase that number. The result is a multiplicative effect which, once
again, is a scalability problem, only this time it manifests itself in the
form of excessive memory use.
The read-only bind mounts patch resolves this situation by, in effect,
going back to global counters which are cached on specific processors. To
that end, each CPU has one of these structures:
struct mnt_writer {
spinlock_t lock;
unsigned long count;
struct vfsmount *mnt;
}
At any given time, this structure will hold a local count for one
filesystem, represented by mnt. If the processor needs to adjust
the write count for that filesystem, it's a simple matter of incrementing
or decrementing count. When the processor's attention turns to a
different filesystem, it must first adjust the global count for the old
filesystem, then it can switch its local mnt_writer structure to
the new one. The result is a compromise between purely local and purely
global counters which yields "good enough" performance on benchmarks
designed to stress the system.
Read-only bind mounts join with other features (such as shared subtrees) to create a
flexible set of tools for the construction of the filesystem namespace. It
is not clear how much of this functionality is being used at this time,
but, as the implementation of containers in the mainline gets closer to
completion, there is likely to be more interest in this capability. Linux
systems in coming years may have much more complex filesystem layouts than
have been seen in the past.
Comments (8 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>