Brief items
The current development kernel is 2.6.39-rc6,
released on May 3. Linus said:
We're still chasing some stuff down, but I think we're ok for
-rc6. This isn't going to be the final -rc, but it doesn't seem to
be in bad shape, and people who have posted regression reports
please check them again, and people who haven't, please do give it
a test.
See the
full changelog for all the details.
Stable updates: 2.6.38.5
(May 2), 2.6.35.13 (April 28),
and 2.6.27.59 (April 30). Each
contains a long list of important changes.
Comments (none posted)
Let ARM rot in mainline. I really don't care anymore.
--
Russell King
Nah, I'll leave it in so that they can have a blast from the past
and can boast on the slashdot of the future or on ((LWN++)..)++
that they've fixed the oldest bug in Linux when booting it on
quantum x86 computers and tried running their RAS daemon.
--
Borislav Petkov
Yes, open source programming is a team sport, but finding the right
people is really the killer feature (the same is obviously true in
the kernel too - I really do think we have a great set of
maintainers. I may complain about them and I'm somewhat infamous
for my flames when things don't work, but at the same time I'm
convinced there's some of the best people out there working on
maintaining the kernel).
--
Linus Torvalds
Comments (none posted)
In response to the EFF's
call
for owners of wireless networks to allow open access, Luis Rodriguez has
put up
a wiki
page to gather information on how providing open access can be made
safer. Contributions of ideas and comments are welcome.
Comments (none posted)
LinuxFR has an
interview with
Linus Torvalds on a wide range of subjects. "
LinuxFR : What is your opinion about Android ? Are you mostly happy they made cellphones very usable or sad because it's really a kernel fork ?
Linus Torvalds : I think forks are good things, they don't make me sad. A lot of Linux development has come out of forks, and it's the only way to keep developers honest - the threat that somebody else can do a better job and satisfy the market better by being different. The whole point of open source to me is really the very real ability to fork (but also the ability for all sides to then merge the forked content back, if it turns out that the fork was doing the right things!)"
Comments (65 posted)
By Jonathan Corbet
May 4, 2011
As was discussed at the
2011 filesystem,
storage, and memory management summit, there is an increasing level of
interest in restricting the amount of kernel memory which can be used by
groups of processes. One area of special interest is the directory entry
(dentry) cache; a
malicious program can, by creating a deep enough directory hierarchy, run
the kernel out of memory with an explosion of the size of the dentry
cache. So limiting dentry use has some real appeal, especially for those
working to ensure that containers running on a Linux system cannot
interfere with each other.
Pavel Emelyanov's per-container dcache
management patches are a first attempt at limiting dentry use. This
patch works by organizing dentries into "mobs," being groups of dentries
all of which represent names in a specific subtree of the filesystem. If
the root of a mob were the root of a container's filesystem namespace, all
dentries created by that container would be contained within that mob. At
that point, a simple sort of resource control can be applied: adding a
dentry to a mob which has hit its maximum size would require the removal of
another dentry to compensate. If no dentries an be removed, attempts to
add others will fail.
The patch set adds three new ioctl() calls: FIMOBROOT to
create a new mob at a given point in the filesystem, FIMOBSIZE to
set the maximum size of a mob, and FIMOBSTAT to query the current
usage of a mob. Pavel is somewhat apologetic about this interface; he
seems to think it will have to change before the work could be considered
upstream. But the first step is get some discussion of the concept; so
far, there have been no responses to Pavel's patches.
Comments (none posted)
Kernel development news
By Jonathan Corbet
May 3, 2011
The perf events subsystem often looks like it's on the path to take over
the kernel; there is a great deal of development activity there, and it has
become a sort of generalized event reporting mechanism. But the original
purpose of perf events was to provide access to the performance monitoring
counters made available by the hardware, and it is still used to that end.
The merging of perf was a bit of a hard pill for users of alternative
performance monitoring tools to swallow, but they have mostly done so. The
recent discussion on "offcore" events shows that there are still some
things to argue about in this area, even if everybody seems likely to get
what they want in the end.
The performance monitoring unit (PMU) is normally associated with the CPU;
each processor has its own PMU for monitoring its own specific events.
Some newer processors (such as Intel's Nehalem series) also provide a PMU
which is not tied to any CPU; in the Nehalem case it's part of the "uncore"
which handles memory access at the package level. The off-core PMU has a
viewpoint which allows it to provide a better picture of the overall memory
behavior of the system, so there is interest in gaining access to events
from that PMU. Current kernels, though, do not provide access to these
offcore events.
For a while, the 2.6.39-rc kernel did provide access to these
events, following the merging of a
patch by Andi Kleen in March. One piece that was missing, though, was
a patch to the user-space perf tool to provide access to this
functionality. There was an attempt to merge that piece toward the end of
April, but it did
not yield the desired results; rather than merge the additional
change, perf maintainer Ingo Molnar removed
the ability to access offcore events entirely.
Needless to say, that action has led to some unhappiness in the perf user
community; there are developers who had already been making use of those
events. Normally, breaking things in this way would be considered a
regression, and the patch would be backed out again. But, since this
functionality never appeared in a released kernel, it cannot really be
called a regression. That, of course, is part of the point of removing the
feature now.
Ingo's complaint is straightforward: the interface to these events was too
low-level and too difficult to use. The rejected perf patch had an example
command which looked like:
perf stat -e r1b7:20ff -a sleep 1
Non-expert readers may, indeed, be forgiven for not immediately
understanding that this command would monitor access to remote DRAM -
memory which is hosted on a different socket. Ingo asserted that the
feature should be more easily used, perhaps with a command like (from the
patch removing the feature):
perf record -e dram-remote ./myapp
He also said:
But this kind of usability is absolutely unacceptable - users
should not be expected to type in magic, CPU and model specific
incantations to get access to useful hardware functionality.
The proper solution is to expose useful offcore functionality via
generalized events - that way users do not have to care which
specific CPU model they are using, they can use the conceptual
event and not some model specific quirky hexa number.
The key is the call for "generalized events" which are mapped, within the
kernel, onto whatever counters the currently-running hardware uses to
obtain that information. Users need not worry about the exact type of
processor they are running on, and they need not dig out the data sheet to
figure out what numbers will yield the results they want.
Criticism of this move has taken a few forms. Generalized events, it is
said, are a fine thing to have, but they can never reflect all of the
weird, hardware-specific counters that each processor may provide. These
events should also be managed in user space where there is more flexibility
and no need to bloat the kernel. There were some complaints about how some
of the existing generalized events have not always been implemented
correctly on all architectures. And, they say, there will always be people
who want to know what's in a specific hardware counter without having the
kernel trying to generalize it away from them. As Vince Weaver put it:
Blocking access to raw events is the wrong idea. If anything, the
whole "generic events" thing in the kernel should be ditched.
Wrong events are used at times (see AMD branch events a few
releases back, now Nehalem cache events). This all belongs in
userspace, as was pointed out at the start. The kernel has no
business telling users which perf events are interesting, or
limiting them!
Ingo's response is that the knowledge and
techniques around performance monitoring should be concentrated in one
place:
Well, the thing is, i think users are helped most if we add useful,
highlevel PMU features added and not just an opaque raw event
pass-through engine. The problem with lowlevel raw ABIs is that the
tool space fragments into a zillion small hacks and there's no good
concentration of know-how. I'd like the art of performance
measurement to be generalized out, as well as it can be.
Vince, meanwhile, went on to claim that perf was a
reinvention of the wheel which has ignored a lot of the experience built
into its predecessors. There are, it seems, still some scars from that
series of events. Thomas Gleixner disagreed with
the claim that perf is an exercise in wheel reinvention, but he did say
that he thought the raw events should be made available:
The problem at hand which ignited this flame war is definitely
borderline and I don't agree with Ingo that it should not made be
available right now in the raw form. That's an hardware enablement
feature which can be useful even if tools/perf has not support for
it and we have no generalized event for it. That's two different
stories. perf has always allowed to use raw events and I don't see
a reason why we should not do that in this case if it enables a
subset of the perf userbase to make use of it.
It turns out that Ingo is fine with raw events
too. His stated concern is that access to raw events should not be the
primary means by which most users gain access to those performance
counters. So he is blocking the availability of those events for now for
two reasons. One of those is that he wants the generalized mode of access
to be available first so that users will see it as the normal way to access
offcore events. If there is never any need to figure out hexadecimal
incantations, many user-space developers will never bother; as a result,
their commands and code should eventually work on other processors as well.
The other reason for blocking raw events now is that, as the interface to
these events is thought through, the ABI by which they are exposed to user
space may need to change. Releasing the initial ABI in a stable kernel
seems almost certain to cement it in place, given that people were already
using it. By deferring these events for one cycle (somebody will certainly
come up with a way to export them in 2.6.40), he hopes to avoid being stuck
with a second-rate interface which has to be supported forever.
There can be no doubt that managing this feature in this way makes life
harder for some developers. The kernel process can be obnoxious to deal
with at times. But the hope is that doing things this way will lead to a
kernel that everybody is happier with five years from now. If things work
out that way, most of us can deal with a bit of frustration and a one-cycle
delay now.
Comments (5 posted)
By Jake Edge
May 4, 2011
Sandboxing processes such that they cannot make "dangerous" system calls is
an attractive feature that has already been implemented in a limited way
for Linux with seccomp. Two years ago, we looked at a proposal to expand seccomp to
allow more fine-grained control over which system calls would be allowed.
That proposal has been mostly dormant since then, but was recently
resurrected after incorporating some of the suggestions made at that time.
The reaction to the current proposal so far seems positive, and it might
just be gaining some traction that the previous patchset lacked.
Seccomp (from "secure computing") is enabled via a prctl() call
and, once enabled, restricts the process from making any further system calls beyond
read(), write(), exit(), and
sigreturn()—any other system call will abort the
process. That creates a pretty secure sandbox, but it is also
extremely limited as there are other things that developers might want to
do from within such a sandbox. In fact, the Chromium browser has gone to great lengths to implement its own
sandbox that uses seccomp, but expands the range of legal system calls
through some contortions.
That led Adam Langley of the Chromium team to propose adding a bitmask of allowable system
calls for a new seccomp mode. That would have allowed processes to make a
binary choice (allowed or disallowed) for each system call. At the time,
Ingo Molnar suggested using the Ftrace filter
code to make the interface even more flexible by allowing filters to be
applied to the system call arguments. Essentially, that would make for
three choices for each system call: enabled, disabled, or filtered.
Fast-forward to today, and that is what a patchset from Will Drewry implements. It
should come as no surprise that Molnar was pleased to see his idea result in working
code: "Ok, color me thoroughly
impressed - AFAICS you implemented my suggestions [...] and you made it
work in practice!". Eric Paris was likewise impressed, noting that an expanded seccomp
could be used for QEMU. Molnar and Paris did not agree about replacing the
LSM approach using filters, but that was something of an aside. Serge
E. Hallyn also pointed out that the new feature
would be useful for containers: "to try and provide some bit of
mitigation for the fact that they are sharing a kernel with the
host".
The proposed interface, which is likely to change based on comments in the
thread, looks like:
const char *filters[] =
"sys_read: (fd == 1) || (fd == 2)\n"
"sys_write: (fd == 0)\n"
"sys_exit: 1\n"
"sys_exit_group: 1\n"
"on_next_syscall: 1";
prctl(PR_SET_SECCOMP, 2, filters);
That example is taken from Drewry's
documentation file that accompanies the patches.
It would allow reading from two file descriptors (1 and 2) and writing to
one (0), while
allowing any calls to the two other system calls listed. The
on_next_syscall means that the rules would not be enforced until
after one more system call is made. That would allow a parent to
fork(), set up the seccomp sandbox in the child process, then exec
a new
program which would be governed by the new rules.
That on_next_syscall piece drew a few comments. As it turns out,
there are really only two cases that need to be handled, either the rules
should go into effect immediately (for a process that wants to restrict
itself before handling untrusted input for example), or they should go into
effect after an exec (for a parent that is spawning an untrusted child).
Making the "after exec" case the default, while still allowing a
process
to request immediate application, seems to be the way things are headed.
There were also questions about using kernel-internal symbol names like
sys_read. Exporting those as a kernel ABI is not likely to pass
muster, as it might restrict the option of changing those function names
down the road—or require a messy compatibility layer if they did
change. Drewry wanted
to avoid using the system call numbers as Langley's original patch did, but
as Frederic Weisbecker pointed out, those
numbers are already part of the kernel ABI. Drewry is planning to make
that switch and users of the interface will need to use the
unistd.h header file or a library to map system call names to
numbers.
The patches also modify the /proc/PID/status file to output any
existing filters that are applied to the process. Given that most applications
that read that file don't need the extra information, though, Motohiro
Kosaki suggested that seccomp get its own
file. Drewry's plan is to provide that information in the
/proc/PID/seccomp_filter file instead, and remove it from
status.
Since it uses the Ftrace infrastructure and hooks, the new seccomp mode
only works for those system calls that have Ftrace events associated with
them. Using one of those non-instrumented system calls in the filters will
result in an
EINVAL from the prctl() call.
Enabling CONFIG_SECCOMP_FILTER (which depends on
CONFIG_FTRACE_SYSCALLS) will allow the use of the new mode.
Overall, Drewry has been very receptive to suggestions for changes, and
the feedback to the concept has been pretty uniformly positive. Molnar
suggested breaking
out the Ftrace filter engine further—beyond the minimal changes that
Drewry's patches make—so that it would be available for more
widespread use in the kernel. Molnar does wonder whether Linus Torvalds or
Andrew Morton might object to more use of the filter mechanism, however: "are you guys opposed to such flexible, dynamic
filters conceptually? I think we should really think hard about the actual ABI
as this could easily spread to more applications than
Chrome/Chromium." So far, neither has spoken up one way or the other.
Currently it would seem that Drewry is off working on the next revision of
the patchset, and it certainly doesn't seem like anything that would be
merged in the upcoming 2.6.40 cycle. As Molnar notes, the ABI needs to be carefully
thought-out, there are still some RCU issues that are being discussed, and
it probably needs some soaking time in the -next tree, but barring some major
complaint
cropping up, it's a feature that will likely make its way into the
mainline relatively soon. While that won't allow Chromium to immediately ditch
its complicated sandboxing arrangement, it may well be able to
do so a few years down the road. Other applications will benefit
from an expanded seccomp as
well.
Comments (10 posted)
By Jake Edge
May 4, 2011
Back in October, Bryce Lelbach announced that he (and others) had built
a working Linux kernel using (mostly) Clang, the LLVM-based C compiler. At
the Linux Foundation Collaboration Summit (LFCS) back in April, Lelbach
gave a talk about the progress that had been made, and the work still to be
done, for the LLVM Linux (LLL)
project. That talk, along with the rest of the LLVM track, was quite
interesting, and once again showed that having two (or more) "competing"
projects is generally beneficial to both.
Why build Linux with Clang?
Lelbach started off describing the reasons behind the decision to try to
build Linux with Clang, most of which centered around the diagnostics that
the compiler produces. The Clang static analyzer has the ability to show
"what the compiler sees when it's looking at your code", he
said. He thought that a huge codebase like Linux could benefit from that
kind of analysis.
In fact, the Clang diagnostics were quite useful when he was building the
Broadcom wireless driver for his MacBook, he said. Clang doesn't forget
things, so it can show macros before their expansion, typedefs, and so on.
It also shows the line in the source code with a caret pointing to the
offending code, along with "fixit hints". Those hints can be automatically
applied to the source code to fix the problem in question.
The project got a 2.6.36-based kernel running back in October, and now has
working kernels based on .37 and .38. Neither Xen nor KVM worked at the
time of the talk and Xen won't even compile, though KVM
is said to work now. More than 90% of the drivers in the kernel will at
least compile, and many will work. Some out-of-tree binary drivers (Broadcom, NVIDIA) will
work as well. SMP versions of the kernel for both 32 and 64-bit x86
platforms are now working, though some of the code needs to be patched in
order to build correctly.
Things that don't work
The integrated assembler (IA) for Clang does not have support for
generating "real
mode" code using .code16gcc directives, so the Linux boot code cannot
be built using IA. There is a "nasty pile" of real mode code
required to boot on x86, Lelbach said. IA is the default for recent
versions of Clang, but using the
GNU Assembler (gas) was required for the boot code. Adding support for an
LLVM x86-16 backend is the right approach, he said, and LLVM project
members in the audience agreed that it was something that could be added to
IA.
The "vast majority of GCC extensions are supported" by Clang,
even those which are not documented, which makes compiling the kernel much
easier. Things like inline assembly, the __attribute__ and
__builtin__ syntax, and so on, all just work. He expected that
there might be problems with inline assembly, but that has not proven to be
the case. Clang defaults to the C99 standard, though, so the
gnu89 standard needs to be specified to build the kernel.
There are some GCC extensions that aren't implemented, however, including
explicit register variables. That lack blocks Xen and some user-space
libraries (like glibc) from compiling. There
are also some "intentionally unsupported extensions",
including local/nested functions, which is only used in a Thinkpad driver.
A bigger problem is that Clang lacks support for variable-length arrays in
structures (VLAIS). A declaration like:
void f (int i) {
struct foo_t {
char a[i];
} foo;
}
cannot be compiled in Clang, though declarations like:
void f (int i) {
char foo[i];
}
are perfectly acceptable. Code like the former is used in the iptables
code, the
kernel hashing (HMAC) routines, and some drivers. Those parts have to be
patched
in order to be built, he said.
Once again, someone from the audience piped up to say that support for
VLAIS could be added as long as the patches were not "
wildly
invasive". The LLL folks "
prefer adding things to Clang
rather than patching the kernel", Lelbach said.
That led to a question about whether the project was pushing any of its
patches upstream to the kernel. Lelbach said that the PaX team (who is
another LLL developer) had
submitted a few, but that those were rejected; "after three, we
stopped" submitting them. Part of the problem is that the patches
are not ready for inclusion because there is a lack developer time to get
them into shape. As an audience member noted, though, the kernel folks are
quick
to take any patches that fix bugs found by Clang.
Code generation and optimization problems
There are several code generation and optimization options for GCC that
aren't supported by Clang. One of those is -mregparm that governs
the number of registers used to pass integer arguments. That means
calls to functions like memcpy() are generated that ignore the custom
calling conventions.
Also,
-fcall-saved-reg is not supported by Clang and that affects the
uses of the ALTERNATIVE() macro in the kernel, which chooses
between assembly instructions depending on the processor model. For some of the
__arch_hweight*() implementations ALTERNATIVE() buries
the actual function
call inside
assembly code, so Clang doesn't know about it. That means that the
generated code is not saving all of the registers that it needs to, so
uses
of ALTERNATIVE() are commented out and a normal call to the
function is used instead.
Another problem is with -pg, which enables instrumentation code
for function calls in GCC, and is used when building Ftrace. For inline
functions, the calls to mcount() get added multiple times, both
when the code is generated and when it is expanded inline. The
no_instrument_function attribute is not properly propagated to inline
functions, he said.
The final
problem that Lelbach mentioned is the -fno-optimize-sibling-calls
flag that is not supported by Clang. The flag disables tail call
elimination, and the kernel introspection code (like Ftrace) assumes
specific stack
depths in various places. Because Clang doesn't support the flag,
code which walks the call stack can end up dereferencing user-space
pointers, which
leads to runtime crashes. This was worked around by defining
HAVE_ARCH_CALLER_ADDR for x86 and defining
CALLER_ADDR[1-6] as dummy values, effectively disabling the stack
backtracing.
It is not just Lelbach who is working LLL, and he noted that the PaX team,
Alp Toker, and Török Edwin have all contributed, along with various
Clang/LLVM and Linux
kernel hackers. There are plans to create a mailing list for the project
and the beginnings of a wiki are taking
shape. Overall, it's an interesting project that will likely end up
helping to find bugs in the kernel while discovering features that could or
should be supported by LLVM/Clang.
[ Thanks to Bryce Lelbach, PaX team, and Török Edwin for
filling in holes in my notes. ]
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>