Brief items
The 3.0 kernel is not yet released as of this writing. Subtle core
kernel bugs, first with the VFS (see below) then with RCU have delayed the
release which is otherwise ready; history suggests it will come out
immediately after the LWN Weekly Edition is published.
Stable updates: no stable updates have been released in the last
week, and none are in the review process as of this writing.
Comments (1 posted)
This patch set contains fixes for a trainwreck involving RCU, the
scheduler, and threaded interrupts. This trainwreck involved RCU
failing to properly protect one of its bit fields, use of RCU by
the scheduler from portions of irq_exit() where in_irq() returns
false, uses of the scheduler by RCU colliding with uses of RCU by
the scheduler, threaded interrupts exercising the problematic
portions of irq_exit() more heavily, and so on.
--
Paul McKenney on why we can't have nice
3.0 things (yet)
That said, have fun and make sure that you have the fire
extinguisher ready when you start using this!
--
Thomas Gleixner
Comments (none posted)
Matthew Garrett
investigates the subtleties of booting Linux with EFI. Once again, hardware vendors are myopically focusing on Windows.
"
As we've seen many times in the past, the only thing many hardware vendors do is check that Windows boots correctly. Which means that it's utterly unsurprising to discover that there are some systems that appear to ignore EFI boot variables and just look for the fallback bootloader instead. The fallback bootloader that has no namespacing, guaranteeing collisions if multiple operating systems are installed on the same system.
[...]
It could be worse. If there's already a bootloader there, Windows won't overwrite it. So things are marginally better than in the MBR [Master Boot Record] days. But the Windows bootloader won't boot Linux, so if Windows gets there first we still have problems."
Comments (25 posted)
Those who follow the realtime preemption patch set know that it has been
stuck on 2.6.33 for some time. With the release of
a new patch based on 3.0-rc7, Thomas Gleixner
tells us why: the entire series has been reworked and cleaned up, a new
solution to the per-CPU variable problem has been implemented, and a nasty
bug held up what would otherwise have been a release based on 2.6.38.
"
The beast insisted on destroying filesystems with reproduction times
measured in days and the total refusal to reveal at least a minimalistic
hint to debug the root cause. Staring into completely useless traces for
months is not a very pleasant pastime." The 3.0-rc7 version of the
patch, happily, shows no such behavior.
Comments (6 posted)
By Jonathan Corbet
July 20, 2011
One of the nice things that the IPv6 protocol was supposed to do for us was
to eliminate the need for network address translation (NAT). The address
space is large enough that many of the motivations for the use of NAT (lack
of addresses, having to renumber networks when changing providers) are no
longer present. NAT is often seen as a hack which breaks the architecture
of the Internet, so there has been no shortage of people who would be happy
to see it go; the IPv6 switch has often looked like the opportunity to make
it happen.
So it is not surprising that, when Terry Moës posted an IPv6 NAT implementation for Linux, the
first response was less than favorable. Anybody wanting to see the end of
NAT is unlikely to welcome an implementation which can only serve to
perpetuate its use after the IPv6 transition. The sad fact, though, is
that NAT appears to be here to stay. David Miller expressed it in a typically direct manner:
People want to hide the details of the topology of their internal
networks, therefore we will have NAT with ipv6 no matter what we
think or feel.
Everyone needs to stop being in denial, now.
Like it or not, we will be dealing with NAT indefinitely. For those who
are curious about how it might work in Linux, Terry's implementation can be
found on
SourceForge along with a paper describing the design of the code. Both
stateless (RFC 6296) and
stateful NAT are supported.
Comments (45 posted)
Kernel development news
By Jonathan Corbet
July 19, 2011
It's all Hugh's fault.
Linus was all set to release the final 3.0 kernel when Hugh Dickins showed
up on the list with a little problem:
occasionally a full copy of the kernel source tree fails because one of the
files found therein vanishes temporarily. What followed was a determined
bug-chasing exercise which demonstrates how subtle and tricky some of our
core code has become. The problem has been found and squashed, but there
may be more.
A bit of background might help in understanding what was happening here.
The 2.6.38 release included the dcache
scalability patches; this code uses a number of tricks to avoid taking
locks during the process of looking up file names. For the right kind of
workload, the "RCU walk" method yields impressive performance improvements.
But that only works if all of the relevant directory entry ("dentry")
structures are in the kernel's dentry cache and the lookup process does not
race with other CPUs which may be making changes on the same path.
Whenever such a situation is encountered, the lookup process will fall back
to the older, slower algorithm which requires locking each dentry.
The dentry cache (dcache) is a highly dynamic data structure, with dentries
coming and going at all times. So one CPU might be removing a dentry at
the same time that another is using it to look up a name. Chaos is avoided
through the use of read-copy-update (RCU) to manage the removal of dentries; a
dentry may be removed from the cache, but, if the thread using that dentry
for lookup got a reference to it before its removal, the structure itself will
continue to exist for as long as that thread needs it. The same should be
true of the inode structure associated with that dentry.
Hugh tracked the problem down to a bit of code in
walk_component():
err = do_lookup(nd, name, path, &inode);
/* ... */
if (!inode) {
path_to_nameidata(path, nd);
terminate_walk(nd);
return -ENOENT;
}
If do_lookup() returns a null inode pointer,
walk_component() assumes that a "negative dentry" has been
encountered. Negative dentries are kept in the dentry cache to record the
fact that a specific name does
not exist; they are an important performance-enhancing feature in
the Linux virtual filesystem layer. To see an example, run any simple
program under strace and watch how many system calls return with
ENOENT; lookups on nonexistent files happen frequently. What Hugh
determined was that this inode pointer was coming back null even though the
file exists, leading the code to believe that a negative dentry had been
found and causing the "briefly vanishing file" problem.
Hugh must have looked at this code for some time before concluding that the
kernel must be removing the dentry from the dcache at just the wrong time
during the lookup process. As described above, the dentry itself continues
to exist after its removal from the cache, but that does not mean that it
is unchanged: the removal process sets its d_inode pointer to
NULL. (It's worth noting that this behavior goes against normal
RCU practice, which calls for the structure to be preserved unmodified
until the last reference is known to be gone).
Hugh concluded that this null pointer was being picked up
later by the lookup process, causing walk_component()
to conclude that the file does not exist when all that had happened was the
removal of a dentry from the cache. His problem report included a patch
causing the lookup code to check much more carefully when the inode pointer
comes up null.
Linus acknowledged the problem but didn't
like the fix which, he thought, was too specific to one particular
situation. He proposed an alternative: just don't set d_inode to
NULL; that would keep the inode pointer from picking up that value
later. Al Viro posted an alternative
fix which changed dcache behavior in less subtle ways, and worried about the possibility of introducing
other weird bugs:
I'm not entirely convinced that it's a valid optimization in the
first place (probably is, but I'm seriously scared by the
complexity we already have there), and I'm really not fond of the
idea of dealing with whatever subtle crap we might discover with
Linus' patch. Again, dcache is not in a healthy shape right now;
at this point dumb and straightforward is, IMO, better than subtle
and risking to step on toes of very odd code out there...
Once we are done with code audit, sure, I'm fine with ->d_inode
being kept until dentry is actually freed. Any code that relies
on that thing being cleared is asking for trouble and should be
rewritten anyway. The only thing is, it needs to be found before
we rewrite it...
Linus didn't like Al's fix either; it threatened to force slow lookups when
negative dentries are involved.
The discussion of the patches went on at some length; in the process of
trying to find the safest way to fix this subtle bug the participants slowly
came to the realization that they did not actually know what was
happening. After looking at things closely, Linus threw up his hands and admitted he didn't
understand it:
So how could Hugh's NULL inode ever happen in the first place? Even
with the current sources? It all looks solid to me now that I look
at all the details.
As it happens, Linus's exposition was enough to point Hugh at the real
problem. Just as the process of transiting through a specific dentry is
almost complete, do_lookup() makes a call to
__follow_mount_rcu(), whose job is to redirect the lookup process
if it is passing through a mount point. The inode pointer is passed to
__follow_mount_rcu() separately; Hugh noticed that this function
was doing the following:
*inode = path->dentry->d_inode;
In other words, the inode pointer is being re-fetched from the dentry
structure; this assignment happens regardless of whether the dentry
represents a mount point.
That is the true source of the
problem: if the dentry has been removed from the dcache after the lookup
process gained a reference, d_inode will be NULL. So
__follow_mount_rcu() will zero a pointer which had pointed to a
valid inode, causing later code to think that the file does not exist at
all.
Linus posted a fix for the real problem
along with his now-famous
Google+ posting saying that he was delaying the 3.0 release for a day
just in case:
We have a patch, we understand the problem, and it looks
ObviouslyCorrect(tm), but I don't think I want to release 3.0 just
a couple of hours after applying it.
Linus delayed the release despite the
inconvenient fact that it will push the 3.1 merge window into his
planned vacation. That was a well-placed bit of caution on his part: the
ObviouslyCorrect(tm) patch had YetAnotherSubtleBug(tm) in it. A fixed
version of the patch exists, and this particular bug should, at this point,
be history.
There is a sobering conclusion to be drawn from this episode, though. The
behavior of the dentry cache is, at this point, so subtle that even the
combined brainpower of developers like Linus, Al, and Hugh has a hard time
figuring out what is going on. These same developers are visibly nervous
about making changes in that part of the kernel. Our once approachable and
hackable kernel has, over time, become more complex and difficult to
understand. Much of that is unavoidable; the environment the kernel runs
in has, itself, become much more complex over the last 20 years. But if we
reach a point where almost nobody can understand, review, or fix some of
our core code, we may be headed for long-term trouble.
Meanwhile, we should be able to enjoy a 3.0 release (and a 2.6.39 update)
without mysteriously vanishing files. One potential short-term problem
remains, though: given that the next merge window will push into Linus's
vacation, there is a distinct chance that he might be more than usually
grumpy with maintainers who get their pull requests in late. Wise
subsystem maintainers may want to be ready to go when the merge window
opens.
Comments (27 posted)
By Jake Edge
July 20, 2011
The setuid() system call has always been something of a security
problem for Linux (and other Unix systems). It interacts oddly with
security and other kernel features (e.g. the unfortunately named "sendmail-capabilities
bug") and is often used incorrectly in programs. But, it is part of the
Unix legacy, and one that will be with us at least until the 2038 bug puts
Unix systems out of their misery. A recent patch from Vasiliy Kulikov arguably shows
these kinds of problems in action: weird interactions with resource limits
coupled with misuse of the setuid() call.
There is a fair amount of history behind the problem that Kulikov is trying
to solve. Back in 2003, programs that used setuid() to switch to
a non-root user could be used to evade the limit on the number of processes
that an administrator had established for that user
(i.e. RLIMIT_NPROC). But that was fixed with a patch from Neil Brown that
would cause the setuid() call to fail if the new user was at or above
their process limit.
Unfortunately, many programs do not check the return value from calls to
setuid() that are meant to reduce their privileges. That, in
fact, was exactly the hole that sendmail fell into when Linux capabilities
were introduced, as it did not check to see that the change to a new UID
actually succeeded. Buggy programs that don't check that return
value can cause fairly serious security problems because they assume their
actions are limited by the reduced privileges of the
switched-to user, but
are actually
still operating with the increased privileges (often root) that they
started with. In effect, the 2003 change made it easier for attackers to
cause setuid() to fail when RLIMIT_NPROC was being used.
Kulikov described the problem back in June,
noting that it was not a bug in Linux, but allowed buggy privileged
programs to wreak havoc:
I don't consider checking RLIMIT_NPROC in
setuid() as a bug (a lack of syscalls return code checking is a real
bug), but as a pouring oil on the flames of programs doing poorly
written privilege dropping. I believe the situation may be improved by
relatively small ABI changes that shouldn't be visible to normal
programs.
In the posting, he suggested two possible solutions to the problem. The
first is to
move the check against RLIMIT_NPROC from set_user()
(a setuid() helper function) to execve() as most programs
will check the status of that call (and can't really cause
any harm if they don't). The other suggestion is one that was proposed by Alexander
Peslyak (aka Solar Designer) in 2006 to cause a failed setuid()
call to send a SIGSEGV to the process,
which would presumably terminate those misbehaving programs.
The first solution is not complete because it would still allow users
to violate their process limit by using programs that do a
setuid() that is not followed by an execve(), but that is a
sufficiently rare case that it isn't considered to be a serious problem.
Peslyak's solution was seen as too big of a hammer when it was proposed,
especially for programs that do check the status of
setuid(), and might have proper error handling for that case.
There were no responses to his initial posting, but when he brought it back
up on July 6, he was pleasantly surprised
to get a positive response from Linus Torvalds:
My reaction is: "let's just remote the crazy check from set_user()
entirely". If somebody has credentials to change users, they damn well
have credentials to override the RLIMIT_NPROC too, and as you say,
failure is likely a bigger security threat than success.
The whole point of RLIMIT_NPROC is to avoid fork-bombs. If we go over
the limit for some other reason that is controlled by the super-user,
who cares?
That led to the patch, which changed do_execve_common() to return
an error (EAGAIN) if the user was over their process limit and
removed the check from set_user(). The patch was generally
well-received,
though several commenters were not convinced that it should go into the -rc
for 3.0 as Torvalds had suggested. In fact, as Brown dug into the patch, he
saw a problem that might need addressing:
Note that there is room for a race that could have unintended consequences.
Between the 'setuid(ordinary-user)' and a subsequent 'exit()' after execve()
has failed, any other process owned by the same user (and we know where are
quite a few) would fail an execve() where it really should not.
Basically, the problem is that switching the process to a new user could
now exceed the process limit, but that limit wouldn't actually be enforced
until an execve() was done (the failure of which would presumably
cause the process to exit). In the interim, any execve() from
another of the user's processes would fail. It's not clear how big of a
problem that is,
though it could certainly lead to unexpected behavior. Brown offered up
a patch that would address the problem by
adding a process flag (PF_NPROC_EXCEEDED) that would be set
if a setuid() caused the process to exceed RLIMIT_NPROC
and would then be checked in do_execve_common(). Thus, only the
execve() in the offending process would fail.
Kulikov and Peslyak liked the approach, though Peslyak was not convinced it
added any real advantages over Kulikov's original patch. He also pointed out that there could be a
indeterminate amount of time between the setuid() and
execve(), so the RLIMIT_NPROC test should be repeated when
execve() is called: "It would be surprising to see a process
fail on execve() because of RLIMIT_NPROC when that limit had been
reached, say, days ago and is no longer reached at the time of
execve()."
So far, Brown has not respun the patch to add that test. There is also the
question of whether the problem that Brown is concerned about needs to be
addressed at all, and whether it is worth using up another process flag
bit (there are currently only three left) to do so. In the end, some kind
of fix is likely to go in for 3.1 given Torvalds's interest in seeing this
problem with buggy programs disarmed. It's unclear which approach will win
out, but either way, setuid() will not fail due to exceeding the
allowable number of processes.
As Kulikov and others noted, it is definitely not a bug in the
kernel that is being fixed here. But, it is a common enough error in
user-space programs—often with dire consequences—which makes it
worthwhile to fix as a pro-active security
measure. Peslyak listed several recent
security problems that arose from programs that do not check the return
value from setuid(). He also noted that the problem is not
limited to setuid-root programs, as other programs that try to switch to a
lesser—differently—privileged user can also cause
problems when using setuid() incorrectly.
The impact of this fix is quite small, and badly written user-space
programs—even those meant to run with privileges—abound, which
makes this change more palatable than some other pro-active fixes. As we
have seen before, setuid() is subtle and quick to anger; it can
have surprising interactions with other
seemingly straightforward security measures. Closing a hole with
setuid(), even if the problem lives in user space, will definitely
improve overall Linux security.
Comments (4 posted)
By Jonathan Corbet
July 19, 2011
There are numerous use cases for a checkpoint/restart capability in the
kernel, but the highest level of interest continues to come from the
containers area. There is clear value in being able to save the complete
state of a container to a disk file and restarting that container's
execution at some future time, possibly on a different machine. The
kernel-based checkpoint/restart patch has been discussed here a number of
times, including
a report from last year's
Kernel Summit and
a followup published
shortly thereafter. In the end, the developers of this patch do not seem
to have been able to convince the kernel community that the complexity of
the patch is manageable and that the feature is worth merging.
As a result, there has been relatively little news from the
checkpoint/restart community in recent months. That has changed, though,
with the posting of a new patch by Pavel
Emelyanov. Previous patches have implemented the entire checkpoint/restart
process in the kernel, with the result that the patches added a lot of
seemingly fragile (though the developers dispute that assessment) code into
the kernel. Pavel's approach, instead, is focused on simplicity and doing
as much as possible in user space.
Pavel notes in the patch introduction that almost all of the information
needed to checkpoint a simple process tree can already be found in
/proc; he just needs to augment that information a bit. So his
patch set adds some relevant information there:
- There is a new /proc/pid/mfd directory containing
information about files mapped into the process's address space. Each
virtual memory area is represented by a symbolic link whose name is
the area's starting virtual
address and whose target is the mapped file. The bulk of this
information already exists in /proc/pid/maps, but the
mfd directory collects it in a useful format and makes it
possible for a checkpoint program to be sure it can open the exact
same file that the process has mapped.
- /proc/pid/status is enhanced with a line listing all of the
process's children. Again, that is information which could be
obtained in other ways, but having it in one spot makes life easier.
- The big change is the addition of a /proc/pid/dump
file. A process reading this file will obtain the information about
the process which is not otherwise available: primarily the contents
of the CPU registers and its anonymous memory.
The
dump file has an interesting format: it looks like a new
binary executable format to the kernel. Another patch in Pavel's series
implements the necessary logic to execute a "program" represented in that
format; it restores the register and memory contents, then resumes
executing where the process was before it was checkpointed. This approach
eliminates the need to add any sort of special system call to restart a
process.
There is need for one other bit of support, though: checkpointed processes
may become very confused if they are restarted with a different process ID
than they had before. Various enhancements to (or replacements for) the
clone() system call have been proposed to deal with this problem
in the past. Pavel's answer is a new flag to clone(), called
CLONE_CHILD_USEPID, which allows the parent process to request
that a specific PID be used.
With this much support, Pavel is able to create a set of tools which can
checkpoint and restart simple trees of processes. There are numerous
things which are not handled; the list would include network connections,
SYSV IPC, security contexts, and more. Presumably, if this patch set looks
like it can be merged into the mainline, support for other types of objects
can be added. Whether adding that support would cause the size and
complexity of the patch to grow to the point where it rivals its
predecessors remains to be seen.
Thus far, there has been little discussion of this patch set. The fact
that it was posted to the containers list - not the largest or most active
list in our community - will have something to do with that. The few
comments which have been posted have been positive, though. If this patch
is to go forward, it will need to be sent to a larger list where a wider
group of developers will have the opportunity to review it. Then we'll be
able to restart the whole discussion for real - and maybe actually get a
solution into the kernel this time.
Comments (21 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>