Brief items
The current 2.6 development kernel is 2.6.30-rc6,
released on May 15.
"
Things definitely are calming down, with just about 300 commits in
the last week. And most of them are pretty small too, although the powerpc
updates brought some defconfig changes that look largish." The
long-format
changelog has all the details.
The current stable kernel is 2.6.29.4, released on May 19. It
contains a fair number of fixes, some with apparent security implications,
and some user-space API improvements for labeled networking. 2.6.27.24, released at the same
time, contains a smaller (but still significant) list of fixes.
Comments (none posted)
Kernel development news
The Titanic sank because there was too much water onboard after
all. How it got in there (Iceberg, torpedoed by the CIA (oh it did
not yet exist) in retaliation for British misbehavior) would be
good to know. But the message should not suggest increasing the
size of the Titanic because it cannot hold all the incoming water.
--
Christoph Lameter (thanks to Florian Mickler)
Even it if is, in fact, correct, it's such an egregious violation
of good style, that your good programmer's card is going to lose a
big coupon and have a hole punched in it.
--
Pete Zaitcev
Comments (5 posted)
By Jonathan Corbet
May 20, 2009
get_random_int() for address space layout randomization was
examined on last week's Security page,
but, since then, some additional ideas have been discussed.
Performance was the main reason to stick with the partial MD4 hash that
is in the current code, but other possibilities are being considered.
Eric Biederman noted that using a
stream cipher such as AES might produce high-quality randomness without
the performance penalty, but Matt Mackall didn't agree: "It's also unclear
that encrypting small blocks with software AES is actually a
performance win relative to SHA1, once you look at the key scheduling
overhead and the cache footprint of its s-boxes."
Willy Tarreau pointed out that hash
functions used to produce the random numbers actually generate far more
data than is currently used. Storing the output of
half_md4_transform() (or that of a different implementation
using the much stronger SHA1 hash) and returning 4-byte chunks from the
full—128 or 160 bit—hash value for each call would perform
much better. But, Linus Torvalds is concerned that giving out more of the hash
value could lead to easier attacks: "I personally suspect that it
would be _much_ easier to attack the hash if we actually gave out the
whole 16 bytes (over several iteration), when compared to what we do
now (only give out a small part and then re-hash)." No patches
emerged from the discussion, but no one is completely happy with the
current implementation, so that could change at some point.
OOM by any other name. The out-of-memory (OOM) killer has often
been featured on this page. But Christoph Lameter raised a previously unseen complaint about OOM
in a recent discussion:
While we are at it: Could we get rid of the name "Out of Memory"
and stop printing texts to that effect? What we call an OOM is a
failure to perform memory reclaim or we are running out of reserves
due to not being able to run reclaim. Mostly this is due to OS
internal issues having nothing to do actual amounts of memory
available.
Christoph says that users are tempted to add more memory in response to OOM
situations, while the real fix is often to be found elsewhere - fixing the
application which is locking down the bulk of RAM, for example. There was
some discussion of changing the OOM messages to read something like "Unable
to satisfy memory allocation request and not making progress reclaiming
from other sources," but nothing has been merged at this point.
Tracepoints leave a trace: one of the key design criteria for kernel
tracepoints is that their impact must be minimal when they are not being
used. So Jason Baron was surprised to find out that, with the current
tracepoint implementation, arguments to tracepoints are being evaluated
before the determination of whether the tracepoint is active. In response,
he prepared a patch to
prevent the evaluation of arguments for inactive tracepoints. The patch
works, but at a cost: it requires a change to the internal tracepoint API.
There is resistance to the API change, mostly because the people involved
seem to think that the new version is uglier and that the problem is really
a bug in GCC. It may prove hard to avoid,
though; even if GCC is fixed soon, older versions will be out there for a long
time. Minimizing overhead is seen as more important than API beauty, so,
unless somebody comes up with a clever workaround, there may be no avoiding
an unwanted tracepoint change.
Union mounts: there is a
new set of union mount patches out there for evaluation, along with the necessary user-space tools.
Valerie Aurora has put together a
howto page for those who would like to experiment with this code. See
this LWN article (also by
Valerie) for information on the implementation of union mounts.
Comments (18 posted)
The
GlusterFS 2.0 release is
available. GlusterFS is an interesting cluster filesystem which runs
mostly in user space and which claims to scale into the petabyte range.
See
the
feature list for an overview. "
As your volume size grows beyond
32TBs, fsck (filesystem check) downtime becomes a huge problem. GlusterFS
has no fsck. It heals itself transparently with very little impact on
performance." License is GPLv3.
Comments (5 posted)
By Jonathan Corbet
May 19, 2009
The proposed
reflink() system call creates an interesting cross
between a hard link and a file copy. The end result of a successful
reflink() call is a new, distinct file - with its own inode -
which shares data blocks with the original file. A copy-on-write policy is
used, so the two files remain distinct; if one is modified, the changes
will not be visible in the other. This call has a number of uses,
including fast snapshotting and as a sort of optimized copy operation.
But, as was described
in the
previous article on reflink(), there is some disagreement over
how file ownership and security-related metadata should be handled.
It comes down to the different use cases for this system call. In the
"snapshot" case, security information must be preserved; that, in turn,
means that reflink() can only be used by the owner of the file (or
by a process with sufficient capabilities to get around ownership
restrictions). On the other hand, those wanting to use reflink()
as a fast file copy
would rather see security information treated like it would be with a file
copy; the user creating the reflink must have read access to the original
file and ends up owning the new one.
For a while, it seemed like the reflink-as-copy use case was simply going
to be left out in the cold. But then Joel Becker, the author of the
reflink() patches, proposed a compromise. If the
process calling reflink() had ownership or suitable privilege, the
snapshot semantics would prevail. Otherwise, read access would be required
and a new set of security attributes would be applied. The idea was to try
to automatically do the right thing in all situations.
In the end, though, this approach didn't fly either. From Andy Lutomirski's objection:
There are plenty of syscalls that require some privilege and fail
if the caller doesn't have it. But I can think of only one syscall
that does *something different* depending on who called it: setuid.
Please search the web and marvel at the disasters caused by
setuid's magical caller-dependent behavior (the sendmail bug is
probably the most famous). This proposal for reflink is just
asking for bugs where an attacker gets some otherwise privileged
program to call reflink but to somehow lack the privileges
(CAP_CHOWN, selinux rights, or whatever) to copy security
attributes, thus exposing a link with the wrong permissions.
Others agreed that automagically changing behavior depending on caller
privilege was not the best way to go. So Joel went back to the drawing
board yet another time.
On May 15, he came back with a new
proposal. The reflink() API would now look like:
int reflink(const char *oldpath, const char *newpath, int preserve);
The new preserve parameter would be a set of flags allowing the
caller to specify which bits of security-oriented information are to be
preserved. Anticipated values are:
- REFLINK_ATTR_OWNER: keep the ownership of the file the
same. The caller must either be the owner or have the
CAP_CHOWN capability.
- REFLINK_ATTR_SECURITY: preserves the SELinux/SMACK/TOMOYO
linux security state. This flag is only valid if
REFLINK_ATTR_OWNER is also provided. In the absence of
REFLINK_ATTR_SECURITY, the new link gets a brand-new security
state, as if it were any other new file.
- REFLINK_ATTR_MODE: the discretionary access control
permissions bits remain the same; requires ownership or
CAP_FOWNER.
- REFLINK_ATTR_ACL: all access control lists are preserved.
This only works if REFLINK_ATTR_MODE is specified.
The API would also provide REFLINK_ATTR_NONE and
REFLINK_ATTR_ALL, with the obvious semantics. Importantly, if the
caller lacks
the requisite credentials to preserve the requested information, the call
will simply fail. There will be no magically-changing semantics depending
on the caller's capabilities.
Joel also proposes some new flags to the ln command:
- -r requests that a reflink be made.
- -P says that the reflink() call should use
REFLINK_ATTR_ALL
- -p (lower case) is like -P, except that it will
retry with REFLINK_ATTR_NONE if the first call fails.
There were some question as to whether all the flags are necessary; perhaps
all that is really needed is "preserve all" or "preserve none." But Joel
feels like one might as well add the flexibility, given that the argument
is being added to the API anyway, and there doesn't seem to be that much
strong sentiment to the contrary. All told, the reflink() API
would appear to be stabilizing toward something that everybody can agree
on. It's probably late for 2.6.31, but this new system call could conceivably be
ready for the 2.6.32 development cycle.
Comments (4 posted)
By Jonathan Corbet
May 18, 2009
Prior to the 2.6 kernel series, the Linux block layer was somewhat
simplistic and inflexible; it showed a lot of history from the early days
of the Linux kernel. With the 2.5 development series came a complete
rewrite; there have, of
course, been a great many changes since then as well. But there are still
bits of history to be found in the Linux block API. If Tejun Heo has his
way, some of that history will be gone in the near future.
The standard way for a block driver to gain access to the next I/O request
in the queue is with a call to:
struct request *elv_next_request(struct request_queue *queue);
This function returns the request which is, in the I/O scheduler's opinion,
the best one to execute next. An interesting feature of
elv_next_request() is that it leaves the request on the queue; two
calls to elv_next_request() in quick succession will return
pointers to the same request. A block driver can explicitly remove the
request from the queue with a call to blkdev_dequeue_request(),
but that step is not necessary. If a request remains at the head of the
queue when the block driver indicates that it has been completed, the block
layer will dequeue the request at that time.
Leaving the request on the queue is a throwback to the very early days,
when requests were handled one at a time - often a single sector at a
time. By hiding the queuing details, the block layer made life easier for
simple block drivers. But this apparent simplicity comes at a cost: it
complicates the block API and makes it impossible for the block layer to
know when processing of a request has begun. So it's not possible to do
reliable request timing when drivers work on requests which remain on the
queue.
This feature is also increasingly useless. Any contemporary driver worth
its salt will process multiple requests at once; that, in return, requires
that the driver dequeue requests and keep track of them itself. So few
drivers that people actually care about use the process-on-queue model.
Given that, Tejun has come to the conclusion that processing on-queue
requests is an idea whose time has passed. He has posted a lengthy patch series to make
it go away.
The bulk of the patches are concerned with converting all drivers over to
the "dequeue the request first" mode of operation. Typically that's just a
matter of adding a blkdev_dequeue_request() call in the right
place. A few places (the IDE subsystem, for example) are a bit more
complex, but, for the most part, the changes are straightforward.
Once that has been done, the patch series culminates with a set of API
changes. There is no more elv_next_request(); instead, a driver
wanting to look at a request without dequeueing it will call:
struct request *blk_peek_request(struct request_queue *queue);
Following that, a request can be dequeued with a call to
blk_start_request(), which replaces
blkdev_dequeue_request():
void blk_start_request(struct request *req);
In addition to removing the request from the queue,
blk_start_request() will start a timer for the request, allowing
it to eventually respond if completion is never signaled. Most of the
time, though, drivers will just call:
struct request *blk_fetch_request(struct request_queue *q);
which is a combination of blk_peek_request() and
blk_start_request().
There is one other, under-the-hood change which goes along with the above:
any attempt to complete a request which remains on the request queue will
oops the system. One can think of this as a very clear message that
on-queue processing is no longer considered to be the right thing to do in
the Linux kernel. That, in turn, is part of the motivation for the API
changes, which, for the most part, are just name changes: Tejun wants to be
sure that maintainers of out-of-tree block drivers will notice that
something has changed and respond accordingly.
These patches have been through a couple of rounds of review. Nothing is
ever certain, but it's entirely possible that this set of changes could go
in for the 2.6.31 kernel.
Comments (none posted)
By Jonathan Corbet
May 19, 2009
In an ideal world, our computers would have enough memory to run all of the
applications we need. In the real world, our systems are loaded with
contemporary desktop environments, office suites, and more. So, even with
the large amounts of memory being shipped on modern systems, there still
never quite seems to be enough. Memory gets paged out to make room for new
demands, and performance
suffers. Some help may be on the way in the form of a new
patch by Wu
Fengguang which has the potential to make things better, should it ever be
merged.
The kernel maintains two least-recently-used (LRU) lists for pages owned by
user processes. One of these lists holds pages which are backed up by
files - they are the page cache; the other list holds anonymous pages which
are backed up by the swap device, assuming one exists. When the kernel
needs to free up memory, it will do its best to push out pages which are
backed up by files first. Those pages are much more likely to be
unmodified, and I/O to them tends to be faster. So, with luck, a system
which evicts file-backed pages first will perform better.
It may be possible to do things better, though. Certain kinds of
activities - copying a large file, for example - can quickly fill memory
with file-backed pages. As the kernel works to recover those pages, it
stands a good chance of pushing out other file-backed pages which are
likely to be more useful. In particular, pages containing executable code
are relatively likely to be wanted in the near future. If the kernel pages
out the C library, for example, chances are good that running processes
will cause it to be paged back in quickly. The loss of needed
executable pages is part of why operations involving large amounts of file
data can make the system seem sluggish for a while afterward.
Wu's patch tries to improve the situation through a fairly simple change:
when the page reclaim scanning code hits a file-backed, executable page which has the
"referenced" bit set, it simply clears the bit and moves on. So executable
pages get an extra trip through the LRU list; that will happen repeatedly
for as long as somebody is making use of the page. If all goes well, pages
running useful code will stay in RAM, while those holding less useful file
data will get pushed out first. It should lead to a more responsive
system.
The code seems to be in a relatively finished state at this point. So one
might well ask whether it will be merged in the near future. That is never
a straightforward question with memory management code, though. This patch
may well make it into the mainline, but it will have to get over some
hurdles in the process.
The first of those hurdles is a simple
question from Andrew Morton:
Now. How do we know that this patch improves Linux?
Claims like "it feels more responsive" are notoriously hard to quantify.
But, without some sort of reasonably objective way to see what benefit is
offered by this patch, the kernel developers are going to be reluctant to
make changes to low-level memory management heuristics. The fear of
regressions is always there as well; nobody wants to learn about some large
database workload which gets slower after a patch like this goes in. In
summary: knowing whether this kind of patch really makes the situation
better is not as easy as one might wish.
The second problem is that this change would make it possible for a sneaky
application to keep its data around by mapping its files with the
"executable" bit set. The answer to this objection is easier: an
application which seeks unfair advantage by playing games can already do
so. Since anonymous pages receive preferable treatment already, the sneaky
application could obtain a similar effect on current kernels by allocating
memory and reading in the full file contents. Sites which are truly
worried about this sort of abuse can (1) use the memory controller to
put a lid on memory use, and/or (2) use SELinux to prevent
applications from mapping file-backed pages with execute permission
enabled.
Finally, Alan Cox has wondered whether this
kind of heuristic-tweaking is the right approach in the first place:
I still think the focus is on the wrong thing. We shouldn't be
trying to micro-optimise page replacement guesswork - we should be
macro-optimising the resulting I/O performance. My disks each do
50MBytes/second and even with the Gnome developers finest creations
that ought to be enough if the rest of the system was working
properly.
Alan is referring to some apparent performance problems with the memory
management and block I/O subsystems which crept in a few years ago. Some
of these issues have been
addressed for 2.6.30, but others remain
unidentified and unresolved so far.
Wu's patch will not change that, of course. But it may still make life a
little better for desktop Linux users. It is sufficiently simple and well
contained that, in the absence of clear performance regressions for other
workloads, it will probably find its way into the mainline sooner or later.
Comments (18 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>