Brief items
The current development kernel is 3.4-rc4,
released on April 21. Linus says:
"
But none of it really looks all that scary. It looks like the 3.4
release is all on track, but please do holler if you see
regressions." There are some reports of ongoing difficulties with
the Nouveau graphics driver; hopefully those will be sorted out soon.
Stable updates: the 3.0.29, 3.2.16, and 3.3.3 stable kernel updates were released on
April 22. With 3.2.16, Greg Kroah-Hartman will
stop producing 3.2 updates, but Ben Hutchings has stated his intent to
pick up and continue maintenance for this kernel for some time yet.
The 3.0.30 and 3.3.4 updates are in the review process as of
this writing; they can be expected on or after April 26.
Comments (none posted)
Pretty much the FIRST thing you add after "hello world" is a device
tree parser, because that tells you how much memory you've got and
where it is.
--
Rob Landley
So I think the industry's well within range of making single
package Linux-capable devices with sufficient RAM, flash, CPU, and
basic peripherals like oscillator, USB, and I²C/SPI in a 8mm²
package for $5 today. In fact, the engineering effort for an ARM
licensee to do that is significantly less than you'd spend trying
to cut Linux's memory footprint in half. Ergo, the days of projects
like Linux-tiny are behind us.
--
Matt Mackall
Some patches scare me when they successfully boot.
--
Tejun Heo
Comments (none posted)
TechCrunch has
an
interview with Linus Torvalds. "
Would some kernel people prefer
getting tucked in at night instead of being cursed at? I’m sure that would
be appreciated. I don’t think I have it in me, though."
Comments (54 posted)
Google has
announced
that it has put up a (read-only) git.kernel.org mirror at
kernel.googlesource.com.
"
kernel.googlesource.com is served out of multiple Google data
centers, utilizing facilities in Asia, the United States and Europe to
provide speedy access from almost anywhere in the world." (Thanks
to several LWN readers).
Comments (20 posted)
Kernel development news
By Jonathan Corbet
April 24, 2012
While storage devices are billed as being "random access" in nature, the
truth of the matter is that operations to some parts of the device can be
faster than operations to others. Rotating storage has a larger speed
differential than flash, while hybrid devices may show a large difference
indeed. Given that differences exist, it is natural to want to place more
frequently-accessed data on the faster part of the device. But a recent
proposal to allow applications to influence this placement has met with
mixed reviews; the problem, it seems, is a bit more complicated than it
appears.
The idea, as posted by Ted Ts'o, is to
create a couple of new flags to be provided by applications at the time a
file is created. A file expected to be accessed frequently would be
created with
O_HOT, while a file that will see traffic only rarely would be
marked with O_COLD. It is assumed that the filesystem would, if
possible, place O_HOT files in the fastest part of the underlying
device.
The implementation requires a change to the create() inode
operation; a new parameter is added to allow the VFS layer to pass down the
flags passed by the application. That change is the most intrusive part of
the patch, requiring tweaks to most filesystems—43 files changed in all.
The only filesystem actually implementing these flags at the outset is, naturally, ext4.
In that implementation, O_HOT files will be placed in low-numbered
blocks, while O_COLD files occupy the high-numbered blocks—but
only if the filesystem is stored on a rotating device. Requesting
O_HOT placement requires the CAP_RESOURCE privilege or
the ability to dip into the reserved block pool.
A lot of people seem to like the core idea, but there were a lot of
questions about the specifics. What happens when the storage device is an
array of rotating devices? Why assume that a file is all "hot" or all
"cold"; some parts of a given file may be rather hotter than others. If an
application is using both hot and cold files, will the (long) seeks between
them reduce performance overall? What about files whose "hotness" varies
over time? Should this concept be tied into the memory management
subsystem's notion of hot and cold pages? And what do "hot" and "cold"
really mean, anyway?
With regard to the more general question, Ted responded that, while it would be possible to
rigorously define the meanings of "hot" and "cold" in this context, it's
not what he would prefer to do:
The other approach is to leave things roughly undefined, and accept
the fact that applications which use this will probably be
specialized applications that are very much aware of what file
system they are using, and just need to pass minimal hints to the
application in a general way, and that's the approach I went with
in this O_HOT/O_COLD proposal.
In other words, this proposal seems well suited to the needs of, say, a
large search engine company that is trying to get the most out of its
massive array of compute nodes. That is certainly a valid use case, but a
focus on that case may make it hard to generalize the feature for wider
use.
Generalizing the feature may also not be helped by placing the decision on
who can mark files as "hot" at the individual filesystem level.
That design could lead to different
policies provided by different filesystems; indeed, Ted expects that to
happen. Filesystem-level policy will allow for experimentation, but it
will push the feature further into an area where it is only useful for
specific applications where the developers have full control over the
underlying system. One would not expect to see O_HOT showing up
in random applications, since developers would have no real way to know
what using that flag would do for them. And that, arguably, is just as
well; otherwise, it would not be surprising to see the majority of files
eventually designated as "hot."
Interestingly, there is an alternative approach which was not discussed
here. In 2010, a set of "data temperature"
patches was posted to the btrfs list. This code watched accesses to
files and determined, on the fly, which blocks were most in demand. The
idea was that btrfs could then migrate the "hot" data to the faster parts
of the storage device, improving overall performance. That work would
appear to have stalled; no new versions of those patches have appeared for
some time. But, for the general case, it would stand to reason that actual
observations of access patterns would be likely to be more accurate than
various developers' ideas of which files might be "hot."
In summary, it seems that, while there is apparent value in the concept of
preferential treatment for frequently-accessed data, figuring out how to
request and implement that treatment will take some more time. Among other
things, any sort of explicit marker (like O_HOT) will quickly
become part of the kernel ABI, so it will be difficult to change once
people start using it. So it is probably worthwhile to ponder for a while
on how this feature can be suitably designed for the long haul, even if
some hot files will have to languish in cold storage in the meantime.
Comments (12 posted)
By Jonathan Corbet
April 24, 2012
Locking and the associated possibility of deadlocks are a challenge for
developers working anywhere in the kernel. But that challenge appears to
be especially acute in the virtual filesystem (VFS) layer, where the needs
of many collaborating subsystems all come together in one place. The
difficulties inherent in VFS locking were highlighted recently when the
proposed
IMA appraisal extension ran into
review problems.
The proposed fix shows that, while these issues can be worked around, the
solution is not necessarily simple.
The management of file structure reference counts is done with
calls to fget() and fput(). A file structure,
which represents an open file, can depend on a lot of resources: as long as
a file is open, the kernel must maintain its underlying storage device,
filesystem, network protocol information, security-related information,
user-space notification requests, and more. An fget() call will
ensure that all of those resources stay around as long as they are needed.
A call to fput(), instead, might result in the destruction of any
of those resources. For example, closing the last file on an unmounted
filesystem will cause that filesystem to truly go away.
What all this means is that a call to fput() can do a lot of work,
and that work may require the acquisition of a number of locks. The
problem is that fput() can also be called from any number of
contexts; there are a few hundred fput() and fput_light()
calls in the kernel. Each of those call sites has its own locking
environment and, usually, no knowledge of what code in other subsystems
may be called from fput(). So the potential for problems like
locking-order violations is real.
The IMA developers ran into exactly that sort of problem. The IMA
appraisal cleanup code is one of those functions that can be invoked from
an arbitrary fput() call. That code, it seems, sometimes needs to
acquire the associated inode's i_mutex lock. But the locking
rules say that, if both i_mutex and the task's mmap_sem
are to be acquired, i_mutex must be taken first. If somebody
calls fput() with mmap_sem held—something that happens in
current kernels—the ordering rule will be violated, possibly deadlocking
the system. A deadlocked system is arguably secure, but IMA users might be
forgiven for complaining about this situation anyway.
To get around this problem, IMA tried to check for the possibility of
deadlock inside fput(), and, in that case, defer the underlying
__fput() call (which does the real work) to a later and safer
context. This idea did not impress VFS
maintainer Al Viro, who pointed out that there is no way to encode all
of the kernel's locking rules into fput(). In such situations, it
can be common for core kernel developers to say "NAK" and get back to what
they were doing before, but Al continued to ponder the problem, saying:
If it had been IMA alone, I would've cheerfully told them where to
stuff the whole thing. Unfortunately, fput() *is* lock-heavy even
without them.
After thinking for a bit, he came up with a
plan that offered a way out. Like the scheme used by IMA, Al's idea
involves turning risky fput() calls into an asynchronous operation
running in a separate thread. But there is no knowledge of locking rules
added to fput(); instead, the situation is avoided altogether
whenever possible, and all remaining calls are done asynchronously.
In particular, Al is looking at all callers of fget() and
fput() to see if they can be replaced with fget_light()
and fput_light() instead. The "light" versions have a number of
additional requirements: they come close to requiring that the calling code
run in atomic context while the reference to the file structure is
held. For a lot of situations - many system calls, for example - these
rules don't get in the way. As the name suggests, the "light" versions are
less expensive, so switching to them whenever possible makes sense
regardless of any other issues.
Then, fput() in its current form is renamed to
fput_nodefer(). A new fput() is added that, when
the final reference to a file is released, queues the real work to be done
asynchronously later on. The "no defer" version will obviously be
faster—the deferral mechanism will have a cost of its own—so its use will
be preferred whenever possible. In this case, "whenever possible" means
whenever the caller does not hold any locks. That is a constraint that can
be independently verified for each call site; the "no defer" name should
also hopefully serve as a warning for any developer who might change the
locking environment in the future.
With luck, all of the performance-critical calls can be moved to the "no
defer" version, minimizing the performance hit that comes from the deferral
of the fput() call. So it seems like a workable solution—except
for one little problem pointed out by
Linus: deferral can change the behavior seen by user space. In particular,
the actual work of closing a file may not be complete by the time control
returns to user space, causing the process's environment to differ in
subtle and timing-dependent ways. Any program
that expects that the cleanup work will be fully done by the time a
close() call returns might just break.
The "totally asynchronous deferral" literally *breaks*semantics*.
Sure, it won't be noticeable in 99.99% of all cases, and I doubt
you can trigger much of a test for it. But it's potential real
breakage, and it's going to be hard to ever see. And then when it
*does* happen, it's going to be totally impossible to debug.
That does not seem like a good outcome either. The good news is that there
is a potential solution out there in the form of Oleg Nesterov's task_work_add() patch set. This
patch adds a functionality similar to workqueues, but with a fundamental
difference: the work is run in the context of the process that was active
at the time the work is added.
In brief, the interface defines work to be done this way:
#include <linux/task_work.h>
typedef void (*task_work_func_t)(struct task_work *);
struct task_work {
struct hlist_node hlist;
task_work_func_t func;
void *data;
};
The task_work structure can be initialized with:
void init_task_work(struct task_work *twork, task_work_func_t func,
void *data);
The work can be queued for execution with:
int task_work_add(struct task_struct *task, struct task_work *twork,
bool notify);
A key aspect of this interface is that it will run any queued work before
returning to user space from the kernel. So that work is guaranteed to be
done before user space can run again; in the case of a function like
close(), that guarantee means that user space will see the same
semantics it did before, without subtle timing issues. So, Linus
suggested, this API may be just what is needed to make the new
fput() scheme work.
There is just one final little problem: about a half-dozen architectures
lack the proper infrastructure to support task_work_add()
properly. That makes it unsuitable for use in the core VFS layer. Unless,
of course, you're Al Viro; in that case it's just a matter of quickly reviewing all the architectures and
coming up with a proposed fix—perhaps in assembly language—for each one.
Assuming Al's work passes muster with the architecture maintainers, all of
this work is likely to be merged for 3.5 - and the IMA appraisal work
should be able to go in with it.
Comments (6 posted)
By Jake Edge
April 25, 2012
Now that it is looking like Linux will be getting an enhanced "secure
computing" (seccomp) facility, some are starting to turn toward actually
using the new feature in applications. To that end, Paul Moore has introduced libseccomp, which is meant to make
it easier for applications to take advantage of the packet-filter-based
seccomp mode. That will lead to more secure applications that can
permanently reduce their ability to make "unsafe" system calls, which can
only be a good thing for Linux application security overall.
Enhanced seccomp has taken a somewhat tortuous path toward the
mainline—and it's not done yet. Will Drewry's BPF-based solution (aka seccomp filter or
seccomp mode 2) is
currently in linux-next,
and recent complaints about it have been few and far between, so it would seem
likely that it will appear in the 3.5 kernel. It will provide
fine-grained control over the system calls that the process (and its
children) can make.
What libseccomp does is make it easier for applications to add support
for sandboxing themselves by providing a simpler API to use the new
seccomp mode. By way of contrast, Kees Cook posted a seccomp filter
tutorial that describes how to build an application using the filters
directly.
In addition, it is also interesting to see that the recent OpenSSH 6.0
release contains support for seccomp filtering using a
(pre-libseccomp) patch from
Drewry. The patch limits the privilege-separated OpenSSH child process to
a handful of legal
system calls, while setting up open() to fail with an
EACCESS error
As described in the man pages that accompany the libseccomp
code, the starting point is to include seccomp.h, then an application must call:
int seccomp_init(uint32_t def_action);
The
def_action parameter governs the default action that is taken
when a system call is rejected by the filter.
SCMP_ACT_KILL will
kill the process, while
SCMP_ACT_TRAP will cause a
SIGSYS
signal to be issued. There are also options to force rejected system calls
to return a certain error (
SCMP_ACT_ERRNO(errno)), to generate a
ptrace() event (
SCMP_ACT_TRACE(msg_num)), or to simply
allow the system call to proceed (
SCMP_ACT_ALLOW).
Next, the application will want to add its filter rules. Those rules can
apply to any invocation of a particular system call, or it can restrict calls to
only use certain values for the system call arguments. So, a rule could specify
that write() can only be used on file descriptor 1, or that
open() is forbidden, for example.
The interface for adding rules is:
int seccomp_rule_add(uint32_t action,
int syscall, unsigned int arg_cnt, ...);
The
action parameter uses the same action macros as are used in
seccomp_init(). The
syscall argument is the system call
number of interest for this rule, which could be specified using
__NR_syscall values, but it is recommended that the
SCMP_SYS() macro be used to properly handle multiple
architectures. The
arg_cnt specifies the number of rules
that are being passed; those rules then follow.
In the simplest case, where the rule is just allowing a system call for
example, there are no argument rules. So, if the default action is
to kill the process, adding a rule to allow close() would look
like:
seccomp_rule_add(SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
Doing filtering based on the system call arguments relies on a set of
macros that specify the argument of interest by number (
SCMP_A0()
through
SCMP_A5()), and the comparison to be done (
SCMP_CMP_EQ,
SCMP_CMP_GT, and so on). So, adding a rule that allows writing to
stderr would look like:
seccomp_rule_add(SCMP_ACT_ALLOW, SCMP_SYS(write), 1,
SCMP_A0(SCMP_CMP_EQ, STDERR_FILENO));
Once all the rules have been added, the filter is loaded into the kernel
(and activated) with:
int seccomp_load(void);
The internal library state that was used to build the filter is no longer
needed after the call to
seccomp_load(), so it can be released
with a call to:
void seccomp_release(void);
There are a handful of other functions that libseccomp provides, including
two ways to extract the filter code from the library:
int seccomp_gen_bpf(int fd);
int seccomp_gen_pfc(int fd);
Those functions will write the filter code in either kernel-readable BPF or
human-readable Pseudo Filter Code (PFC) to
fd. One can also set
the priority of system calls in the filter. That priority is used as a
hint by the filter generation code to put higher priority calls earlier in
the filter list to reduce the overhead of checking those calls (at the
expense of the others in the rules):
int seccomp_syscall_priority(int syscall, uint8_t priority);
In addition, there are a few attributes for the seccomp filter that can be
set or queried using:
int seccomp_attr_set(enum scmp_filter_attr attr, uint32_t value);
int seccomp_attr_get(enum scmp_filter_attr attr, uint32_t *value);
The attributes available are the default action for the filter
(
SCMP_FLTATR_ACT_DEFAULT, which is read-only), the action taken
when the loaded filter does not match the architecture it is running on
(
SCMP_FLTATR_ACT_BADARCH, which defaults to
SCMP_ACT_KILL), or whether
PR_SET_NO_NEW_PRIVS is
turned on or off before activating the filter
(
SCMP_FLTATR_CTL_NNP, which defaults to
NO_NEW_PRIVS being
turned on). The
NO_NEW_PRIVS flag is a recent kernel addition
that stops a process and its children from ever being able to get new
privileges (via
setuid() or capabilities for example).
The last attribute came about after some discussion in the announcement
thread. The consensus on the list was that it was desirable to set NO_NEW_PRIVS by default, but
allow libseccomp users to override that if desired. Other than some
kudos from other developers about the project, the only other messages in
the thread concerned
the GPLv2 license. Moore said that the GPL was really just his
default license and, since it made more sense for a library to use the
LGPL, he was able
to get the other contributors to agree to
switch to the LGPLv2.1
While it is by no means a panacea, the seccomp filter will provide a way
for applications to make themselves more secure. In particular, programs
that handle untrusted user input, like the Chromium browser which was the
original impetus for the feature, will be able to limit the ways in which
damage can be done through a security hole in their code.
One would guess we will see more applications using the feature via
libseccomp. Seccomp mode 2 is currently available in Ubuntu kernels, and is
slated for inclusion in ChromeOS—with luck we'll see it in the
mainline soon too.
Comments (6 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
- Lucas De Marchi: kmod 8 .
(April 19, 2012)
Page editor: Jonathan Corbet
Next page: Distributions>>