Brief items
The current development kernel is 2.6.37-rc4,
released on November 29. "
As
suspected, spending most of the week in Japan made some kernel developers
break out in gleeful shouts of 'let's send Linus patches when he is
jet-lagged, and see if we can confuse him even more than usual'. As a
result -rc4 has about twice the commits that -rc3 had." It's still
mostly fixes, though; see the announcement for the short-form changelog, or
the
full changelog for all the details.
Stable updates: there have been no stable updates released over the
last week.
Comments (none posted)
Yeah, restricting information is always a double edged sword - and
by locking down we are implicitly assuming that the number of
people trying to do harm is larger than the number of people trying
to help. It is probably true though - and the damage they can
inflict is becoming more and more serious (financially, legally and
socially - and, in some cases, physically) with every year of
humanity moving their lives to the 'net.
--
Ingo Molnar
Well yes. We take something which will fail occasionally and with
your patch replace it with something which will fail a bit more
often. Why don't we go all the way and do something which will
fail *even more often*. Namely, just delete the damn function in
the hope that the resulting failures will provoke the ext4 crew
into doing something sane this time?
--
Andrew Morton
Comments (none posted)
The Linux Foundation has
announced
the annual update of its report on kernel development. There is little
there that will be new to LWN readers, but, in the humble opinion of your
editor (who is one of the authors), it is a good summary of the situation.
"
This paper documents a bit less frenzied development than the last
one, which was expected given all the new features of 2.6.30 (ext4, ftrace,
btrfs, perf etc) as well as the peak of merged drivers from Linux stable
tree. Regardless, this report continues to paint a picture of a very strong
and vibrant development community."
Comments (none posted)
By Jonathan Corbet
December 1, 2010
During the 2.6.37 merge window, a change was merged which made
/proc/kallsyms unreadable by unprivileged users by default. That
change was subsequently reverted when it was found to break the bootstrap
process on an older Ubuntu release. A new form of the patch has returned
which fixes that problem - but it still may not be merged.
The new patch is quite simple: if the
process reading the file lacks the CAP_SYS_ADMIN capability,
/proc/kallsyms appears to be an empty file. It has been confirmed
that this version of the patch no longer breaks user space. But there were
complaints anyway: rather than restricting access to the file with the
usual access control bits, this patch encodes a policy
(CAP_SYS_ADMIN) into the kernel where it cannot be changed. That
rubs a number of people the wrong way, so this patch probably will not go
in either. Instead, concerned administrators (or distributors) will need
to simply change the permissions on the file at boot time.
Comments (1 posted)
By Jonathan Corbet
December 1, 2010
Many of the kernel security vulnerabilities reported are information leaks
- passing the contents of uninitialized memory back to user space. These
leaks are not normally seen to be severe problems, but the potential for
trouble always exists. An attacker may be able to find a sequence of
operations which puts useful information (a cryptographic key, perhaps)
into a place where the kernel will leak it. So information leaks should be
avoided, and they are routinely fixed when they are found.
Many information leaks are caused by uninitialized structure members. It
can be easy to forget to assign to all members in all paths, or, possibly,
the form of the structure might change over time. One way to avoid that
possibility is to use something like memset() to clear the entire
structure at the outset. Kernel code uses memset() in many
places, but there are places where that is seen as an expensive and
unnecessary call; why clear a bunch of memory which will be assigned to
anyway?
One way of combining operations is with a structure initialization like:
struct foo {
int bar, baz;
} f = {
.bar = 1,
};
In this case, the baz field will be implicitly set to zero. This
kind of declaration should ensure that there will be no information leaks
involving this structure. Or maybe not. Consider this structure instead:
struct holy_foo {
short bar;
long baz;
};
On a 32-bit system, this structure likely contains a two-byte hole between
the two members. It turns out that the C standard does not require the
compiler to initialize holes; it also turns out that GCC duly leaves them uninitialized. So, unless
one knows that a given structure cannot have any holes on any relevant
architecture, structure initializations are not a reliable way of avoiding
uninitialized data.
There has been some talk of asking the GCC developers to change their
behavior and initialize holes, but, as Andrew Morton pointed out, that would not help for at least
the next five years, given that older compilers would still be in use. So it
seems that there is no real alternative to memset() when
initializing structures which will be passed to user space.
Comments (14 posted)
Kernel development news
By Jonathan Corbet
November 30, 2010
Tracepoints are small hooks placed into kernel code; when they are enabled,
they can generate event information which can be consumed through the
ftrace or perf interfaces. These tracepoints are defined via the decidedly
gnarly
TRACE_EVENT() macro which Steven Rostedt nicely
described in detail for LWN earlier this
year. As kernel developers add more tracepoints to the kernel, they are
occasionally finding things which can be improved. One of those seems
relatively simple: what if a tracepoint should only fire some of the time?
Arjan van de Ven recently posted a patch adding
a tracepoint to __mark_inode_dirty(), a function called deep
within the virtual filesystem layer to, surprisingly, mark an inode as
being dirty. Arjan's purpose is to figure out which processes are causing
files to have dirty contents; that will allow tools like PowerTop to tell
laptop users which process is causing their disk to spin up. The only
problem is that some calls to __mark_inode_dirty() are essentially
noise from this point of view; they happen, for example, when an inode is
first created or is being freed. Tracing those calls could create a stream
of useless events which would have to be filtered out by PowerTop, causing
PowerTop itself to require more power. So it is preferable to avoid
creating those events in the first place if possible.
For that reason, Arjan made the call to the tracepoint be
conditional:
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES))
trace_writeback_inode_dirty(inode, flags);
This code works in that it causes the tracepoint to be "hit" only when an
application has actually done something to dirty an inode.
The VFS developers seem to have no objection to this tracepoint being
added; the resulting information can be useful. But they didn't like the
conditional nature of it. Part of the problem is that tracepoints are
supposed to keep a low profile; developers want to be able to ignore them
most of the time. Expanding a tracepoint to two lines and an if
statement rather defeats that goal. But tracepoints are also supposed to
not affect execution time. They have been carefully coded to impose almost
no overhead when they are not enabled (which is most of the time); with
techniques like jump label, that overhead
can be reduced even further. But that if statement, being outside
of the tracepoint altogether, will always be executed regardless of whether
the tracepoint is currently enabled or not. Multiply that test-and-jump
across millions of calls to __mark_inode_dirty() on each of
millions of machines, and the extra CPU cycles start to add up.
So it was asked: could this test be moved into the tracepoint
itself? One approach might be to put the test into the
TP_fast_assign() portion of the tracepoint, which copies the
tracepoint data into the tracing ring buffer. The problem with that idea
is that, by that time, the tracepoint has already fired, space has been
allocated in the ring buffer, etc. There is currently no mechanism to
cancel a tracepoint hit partway through. There has, in the past, been
talk of adding some sort of "never mind" operation which could be invoked
within TP_fast_assign(), but that idea seems less than entirely
elegant.
What might happen, instead, is the creation of a variant of
TRACE_EVENT() with a name like TRACE_EVENT_CONDITION().
It would take an extra parameter which would be, of course, another tricky
macro. For Arjan's tracepoint, the condition would look something like:
TP_CONDITION(
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES))
return 1;
else
return 0;
),
The tracepoint code would then test the condition before doing any other
work associated with the tracepoint - but only if the tracepoint itself has
been enabled.
This solution should help to keep the impact of tracepoints to a minimum
once again, especially when those tracepoints are not enabled. There is
one potential problem in that the condition is now hidden deeply within the
definition of the tracepoint; that definition is usually found in a special
header file far from the code where the tracepoint is actually inserted.
At the tracepoint itself, the condition which might cause it not to fire is
not visible in any way. So, if somebody other than the initial developer
wants to use the tracepoint, they could misinterpret a lack of output as a
sign that the surrounding code is not being executed at all. That little
problem could presumably be worked around with clever tracepoint naming,
better documentation, or simply expecting users to understand what
tracepoints are actually telling them.
Comments (5 posted)
By Jonathan Corbet
December 1, 2010
An old-style rotating disk drive does not really care if any specific block
contains useful data or not. Every block sits in its assigned spot (in a
logical sense, at least), and the knowledge that the operating system does
not care about the contents of any particular block is not something the
drive can act upon in
any way. More recent storage devices are different, though; they can - in
theory, at least - optimize their behavior if they know which blocks are
actually worth hanging onto. Linux has a couple of mechanisms for
communicating that knowledge to the block layer - one added for 2.6.37 -
but it's still not clear which of those is best.
So when might a block device want to know about blocks that the host system
no longer cares about? The answer is: just about any time that there is a
significant mapping layer between the host's view of the device and the
true underlying medium. One example is solid-state storage devices (SSDs).
These devices must carefully shuffle data around to spread erase cycles
across the media; otherwise, the device will almost certainly fail
prematurely. If an SSD knows which blocks the system actually cares about,
it can avoid copying the others and make the best use of each erase cycle.
A related technology is "thin provisioning," where a storage array claims
to be much larger than it really is. When the installed storage fills, the
device can gently suggest that the operator install more drives,
conveniently available from the array's vendor. In the absence of
knowledge about disused blocks, the array must assume that every block that
has ever been
written to contains useful data. That approach may sell more drives in the
short term, but vendors who want their customers to be happy in the long
term might want to be a bit smarter about space management.
Regardless of the type of any specific device, it cannot know about
uninteresting blocks unless the operating system tells it. The ATA and
SCSI standard committees have duly specified operations for communicating
this formation; those operations are often called "trim" or "discard" at
the operating system level. Linux has had support for trim operations for
some time in the block layer; a few filesystems (and the swap code) have
also been modified to send down trim commands when space is freed up. So
Linux should be in good shape when it comes to trim support.
The only problem is that on-the-fly trim (also called "online discard")
doesn't work that well. On some devices, it slows operation considerably;
there's also been some claims that excessive trimming can, itself, shorten
drive life. The fact that the SATA version of trim is a non-queued
operation (so all other I/O must be stopped before a trim can be sent to
the drive) is also extremely unhelpful. The observed problems have been so
widespread that SCSI maintainer James Bottomley was recently heard to say:
However, I think it's time to question whether we actually still
want to allow online discard at all. Most of the benchmarks show
it to be a net lose to almost everything (either SSD or Thinly
Provisioned arrays), so it's become an "enable this to degrade
performance" option with no upside.
The alternative is "batch discard," where a trim operation is used to mark
large chunks of the device unused in a single operation. Batch discard
operations could
be run from the filesystem code; they could also run periodically from user
space. Using batch discard to run trim on every free space extent would be
a logical thing to do after an fsck run as well. Batching
discard operations implies that the drive does not know immediately when
space becomes unused, but it should be a more performance- and
drive-friendly way to do things.
The 2.6.37 includes a new ioctl() command called FITRIM
which is intended for batch discard operations. The parameter to FITRIM is
a structure describing the region to be marked:
struct fstrim_range {
uint64_t start;
uint64_t len;
uint64_t minlen;
};
An ioctl(FITRIM) call will instruct the filesystem that the free
space between start and start+len-1 (in bytes) should be
marked as unused. Any extent less than minlen bytes will be
ignored in this process. The operation can be run over the entire device
by setting start to zero and len to ULLONG_MAX.
It's worth repeating that this command is implemented by the filesystem, so
only the space known by the filesystem to be free will actually be
trimmed. In 2.6.37, it appears that only ext4 will have FITRIM
support, but other filesystems will certainly get that support in time.
Batch discard using FITRIM should address the problems seen with
online discard - it can be applied to large chunks of space, at a time
which is convenient for users of the system. So it may be tempting to just
give up on online discard. But Chris Mason cautions against doing that:
At any rate, I definitely think both the online trim and the FITRIM
have their uses. One thing that has burnt us in the past is coding
too much for the performance of the current crop of ssds when the
next crop ends up making our optimizations useless.
This is the main reason I think the online trim is going to be better
and better.
So the kernel developers will probably not trim online discard support at
this time. No filesystem enables it by default, though, and that seems
unlikely to change. But if, at some future time, implementations of the
trim operation improve, Linux should be ready to use them.
Comments (7 posted)
By Jonathan Corbet
November 30, 2010
The kernel has historically been developed independently of anything that
runs in user space. The well-defined kernel ABI, built around the POSIX
standard, has allowed for a nearly absolute separation between the kernel
and the rest of the system. Linux is nearly unique, however, in its
division of kernel and user-space development. Proprietary operating systems
have always been managed as a single project encompassing both user and
kernel space; other free systems (the
BSDs, for example) are run that way as well. Might Linux ever take a
more integrated approach?
Christopher Yeoh's cross-memory attach
patch was covered here last September. He recently sent out a new
version of the patch, wondering, in the process, how he could get a
response other than silence. Andrew Morton answered that new system calls are
increasingly hard to get into the mainline:
We have a bit of a track record of adding cool-looking syscalls and
then regretting it a few years later. Few people use them, and
maybe they weren't so cool after all, and we have to maintain them
for ever. Bugs (sometimes security-relevant ones) remain
undiscovered for long periods because few people use (or care
about) the code.
Ingo Molnar jumped in with a claim that the C
library (libc) is the real problem. Getting a new feature into the
kernel and, eventually, out to users takes long enough. But getting
support for new system calls into the C library seems to take much longer.
In the meantime, those system calls languish, unused. It is
possible for a suitably motivated developer to invoke an unsupported system
call with syscall(), but that approach is fiddly, Linux-specific,
and not portable across architectures (since system call numbers can change
from one architecture to the next). So most real-world use of
syscall() is probably due to kernel developers testing out new
system calls.
But, Ingo said, it doesn't have to be that way:
If we had tools/libc/ (mapped by the kernel automagically via the
vDSO), where people could add new syscall usage to actual,
existing, real-life libc functions, where the improvements could
thus propagate into thousands of apps immediately, without
requiring any rebuild of apps or even any touching of the
user-space installation, we'd probably have _much_ more lively
development in this area.
Ingo went on to describe some of the
benefits that could come from a built-in libc. At the top of the list is
the ability to make standard libc functions take advantage of new system
calls as soon as they are available; applications would then get immediate
access to the new calls. Instrumentation could be added, eventually
integrating libc and kernel tracing. Perhaps something better could have
been done with asynchronous I/O. And so on. He concluded by saying
"Apple and Google/Android understands that single-project mentality
helps big time. We don't yet."
As of this writing, nobody has responded to this suggestion. Perhaps it
seems too fantastical, or, perhaps, nobody is reading the cross-memory
attach thread. But it is an interesting idea to ponder on.
In the early days of Linux kernel development, the purpose was to create an
implementation of a well-established standard for which a great deal of
software had already been written. There was room for discussion about how
a specific system call might be implemented between the C library and
the kernel, but the basic nature of the task was well understood. At this
point, Linux has left POSIX far behind; that standard is fully
implemented and any new functionality goes beyond it. New system calls are
necessarily outside of POSIX, so taking advantage of them will
require user-space changes that, say, a
better open() implementation would not. But new features are
only really
visible if and when libc responds by making use of them and by making them
available to applications. The library most of us use (glibc) has not
always been known for its quick action in that regard.
Turning libc into an extension of the kernel itself would short out the
current library middlemen. Kernel developers could connect up and make use of
new system calls immediately; they would be available to applications at
the same time that the kernel itself is. The two components would
presumably, over time, work together better. A kernel-tied libc could also
shed a lot of compatibility code which is required if it must work properly
with a wide range of kernel versions. If all went well, we could have a
more tightly integrated libc which offers more functionality and better
performance.
Such a move would also raise some interesting questions, naturally,
starting with "which libc?" The obvious candidate would be glibc, but it's
a large body of code which is not universally loved. The developers of
whichever version of libc is chosen might want to have a say in the matter;
they might not immediately welcome their new kernel overlords.
One would hope that the ability to run the system with an
alternative C library would not be compromised. Picking up the pace
of libc development might bring interesting new capabilities, but there is
also the ever-present possibility of introducing new regressions.
Licensing could raise some issues of its own; an integrated libc would
have to remain separate enough to carry a different license.
And, one
should ask, where would the process stop? Putting nethack into the kernel
repository might just pass muster, but, one assumes, Emacs would encounter
resistance and LibreOffice is probably out of the question.
So a line needs to be drawn somewhere. This idea has come up in the past,
and the result has been that the line has stayed where it always was: at
the kernel/user-space boundary. Putting perf into the kernel repository
has distorted that line somewhat, though. By most accounts, the perf
experiment has been a success; perf has evolved from a rough utility to a
powerful tool in a surprisingly short time. Perhaps an integrated C
library would be an equally interesting experiment. Running that
experiment would take a lot of work, though; until somebody shows up with a
desire to do that work, it will continue to be no more than a
thought experiment.
Comments (113 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>