The 3.4 merge window remains open
, so there is no current
development kernel. See the article below for a summary of changes pulled
in for the 3.4 release.
Stable updates: the 3.0.26 and 3.2.13 updates were released on March 23;
each contains a small set of important fixes.
The 188.8.131.52 update came out on March 22; it
has a much longer list of fixes.
Comments (none posted)
Current trends are: for every 1000 patches sent there's maybe one
patch that has a tad too much information in its changelog - but
instead offers good entertainment in the changelog so it's still
perfectly fine. 990 patches have too little information. The
remaining 9 are just fine.
-- Ingo Molnar
When trying to review this I went completely crosseyed then fell on
-- Andrew Morton
has a dangerous job
In a classic computer environment you would want the log filled
with notifications so that the user could do something about it. On
a phone, settop box, TV set or seatback entertainment system
logging is evil. No one who has any business seeing a log message
has any desire to see one. It does not matter how important the log
message might be.
It's getting harder and harder to have rational error handling at
the OS level as application environments move to higher levels and
-- Casey Schaufler
Comments (none posted)
Linus has merged a
which moves the Nouveau graphics driver out of its symbolic
location in staging and into the mainline proper; among other things, this
move is an indication that no further ABI breaks (which have not happened
for a while anyway) are expected. Also merged is initial mode-setting
support for the just-released "Kepler" chipset from NVIDIA.
("Symbolic" because the Nouveau code has never been in the staging tree;
only the configuration option was placed there.)
Comments (15 posted)
Kernel development news
In the 3.3 release announcement
warned developers that he would be taking a bit of time off during the
merge window; that did indeed happen over the last week. Still, he managed
to pull some 4,000 changesets since last week's
. Some of the more significant changes merged in the last week
- The PowerPC has gained a new firmware-assisted dump facility for the
quick capture and analysis of crash dumps.
- The GFS2 filesystem now supports the FITRIM ioctl()
command which can be used to send discard requests to the underlying
- The prctl() system call has a new option called
PR_SET_CHILD_SUBREAPER. Marking a process this way will
cause any orphan descendant processes to be reparented to the marked
process rather than to the init process. There is a
corresponding PR_GET_CHILD_SUBREAPER option as well.
- The Microblaze architecture now has high memory support.
- The ext4 "noacl" and "noattr" mount options have been marked
deprecated with an eye toward removal in the near future. Without
these options, it will not be possible to disable ACL and extended
attribute support. No other filesystem allows that support to be
disabled. The "journal=update" and "resize" mount options have been
removed entirely. On the other hand, plans to remove the "bsd_df",
"minix_df", "grpid" and "nogrpid" options have been dropped in
response to complaints from users.
- New hardware support includes:
Changes visible to kernel developers include:
Also worthy of note is that there has been a vast amount of work done in
the ARM architecture tree; the process of consolidating and cleaning up the
ARM code continues at a high rate.
The 3.4 merge window would normally be expected to end around
April 2. When he announced his vacation, Linus said that he would
extend the merge window for a bit if necessary - though he warned that he
would still only consider pull requests received during the window.
Whether that will happen remains to be seen; either way, next week's Kernel
Page will summarize the last new features merged for 3.4.
Comments (6 posted)
The "integrity" of a Linux system is based on whether it is running the
code that the administrator expects. If not, a compromise of the system
may have occurred. The Linux integrity subsystem is meant to detect
those unexpected changes to files in order to protect systems against
compromise. That is done by creating integrity "measurements" (hashes of
contents and metadata) of files of interest.
Much of what is needed to do integrity management has already landed in
the mainline, but there are a few remaining pieces.
architecture (IMA) appraisal extension patch set
from Mimi Zohar and Dmitry Kasatkin fills in one missing piece: storing and
validating the integrity measurement of files.
A hash of a file's contents and metadata will be stored in the security.ima
extended attribute (xattr) of the file, and the patch set will create and
maintain those xattrs.
In addition, it can enforce that the file contents are "correct" when the file is opened for reading or executing based on
integrity values that were stored.
The integrity subsystem has taken a rather twisted path into the kernel.
It was proposed as far back as 2005, but
the subsystem has been broken up into smaller pieces several times along
the way. Much of IMA was added to the kernel in 2.6.30, but another piece,
the extended verification module (EVM) was
not merged until 3.2. Digital signature support was added to EVM in 3.3,
and IMA appraisal is currently under review.
As described on the Linux
IMA web page, the integrity subsystem is meant to thwart various kinds
of attacks against the contents of files, both on- and off-line. Unexpected
changes to files, particularly executables, may be a sign that the system
has been compromised. In addition, the subsystem allows the use of the
"Trusted Platform Module" (TPM) to collect integrity measurements and sign
them in such a way that the system can "attest" to its integrity. That
attestation could be sent to another system to "prove" that the system is
intact—only approved code is running.
Current kernels can generate an integrity measurement of files that are
executed, collect and digitally sign them with keys from the
TPM (or the kernel keyring), and use that information for remote
attestation. EVM adds the ability to thwart offline attacks against the
file contents or metadata by hashing the values of the security xattrs of
the file (e.g. security.selinux, security.ima), signing
that hash, and storing it as security.evm.
But, there is nothing in place that would stop a running system from
executing or reading a file that has been changed. If a file with an IMA
hash is opened for reading or executing, the appraisal extension will
check to see if the contents match the stored hash. If they don't match,
kernel command-line parameter determines what happens. If it is set to
"enforce", access to the file is denied, while "fix" will update the IMA
xattr with the new value. In addition, "off" can be used to turn off any
In order to recognize that a file has changed while it is open, the
requires the filesystem to support i_version, which is a counter
that gets incremented any time the file's inode gets updated. Filesystems
must be mounted with i_version option in order for the appraisal
extension to work. That allows the extension to notice the change when the
file is closed and either update the xattr or
flag the file change as a policy violation.
In order to get the initial security.ima xattrs
on files that are to be appraised (by default, all files owned by root),
one boots the kernel with ima_appraise_tcb (which enables
ima_appraise=fix, and then by opening all files of interest (e.g. via
a find command as suggested
on the IMA web page).
The IMA appraisal extension will complete the off-line attack detection
that EVM provides. Because the extension will create and maintain
the security.ima xattr,
EVM will be able to detect changes to the file contents.
In response to an earlier version of the patch set, James Morris asked if
there were any distributions that were planning to use IMA and EVM once all
the pieces are in place. George Wilson said that IBM plans to use it
internally once distributions have incorporated it. In addition, Ryan Ware
and Kasatkin said that the Tizen mobile distribution plans to use it for
some product profiles.
But, before any of that can happen, the appraisal extension needs to find a
way to change its locking behavior to get past a NAK by Al Viro.
In the current patches,
the final __fput() is deferred if a file is closed before
is called in kernels using IMA appraisal. Viro is concerned that this
changes the locking
on whether the kernel is using IMA or not, which may make locking
problems harder to spot. He also said that the overhead is too high for a
commonly used path, and that not all of the places where __fput()
is used were covered by the patch.
So far, no solution to the problem has been
found, though Viro did suggest possibly
using a different mutex for changing xattrs, but that it would take a fair
amount of code review to
determine if that could be done.
Given that the patch set completes a job started by EVM, and will, for the
most part, complete the integrity subsystem, it seems likely that a
solution will be found. There are a few lingering pieces of IMA
appraisal that are still coming, according to the "An
Overview of the Linux Integrity Subsystem [PDF]" white paper. Two
specific pieces are mentioned, one to add digital signature capabilities
for vendor-signed files, and another that will protect directory contents
(e.g. filenames). While the currently proposed patches may still need some
work before they can be considered for the mainline, those working on the
integrity subsystem are probably finally starting to see the light at the
end of a long
Comments (5 posted)
Last week's Kernel Page included an article
on Peter Zijlstra's NUMA scheduling patch set. As it happens, Peter is not
the only developer working in this area; Andrea Arcangeli has posted a NUMA
scheduling patch set of his own called AutoNUMA
. Andrea's goal is the same - keep
processes and their memory together on the same NUMA node - but the
approach taken to get there is quite different. These two patch sets
embody a disagreement on how the problem should be solved that could take
some time to work out.
Peter's patch set works by assigning a "home node" to each process, then
trying to concentrate the process and its memory on that node. Andrea's
patch lacks the concept of home nodes; he thinks it is an idea that will
not work well for programs that don't fit into a single node unless
developers add code to use Peter's new system calls. Instead, Andrea would
like NUMA scheduling to "just work" in the same way that transparent huge
pages do. So his patch set seems to assume that
resources will be spread out across the system; it then focuses on cleaning
things up afterward. The key to the cleanup task is a bunch of statistics
and a couple of new kernel threads.
The first of these threads is called knuma_scand. Its primary job
is to scan through each process's address space, marking its in-RAM
anonymous pages with a special set of bits that makes the pages look, to
the hardware, like they are not present. If the process tries to access such a
page, a page fault will result; the kernel will respond by marking the page
"present" again so that the process can go about its business. But the
kernel also tracks the node that the page lives on and the node the
accessing process was running on, noting any mismatches. For each process,
the kernel maintains an array of counters to track which node each of its
recently-accessed pages were
located on. For pages, the information tracked is necessarily more coarse;
the kernel simply remembers the last node to access each page.
When the time comes for the scheduler to make a decision, it passes over
the per-process statistics to determine whether the target process would be better
off if it were moved to another node. If the process seems to be accessing
most of its pages remotely, and it is better suited to the remote node than
the processes already running there, it will be migrated over. This code
drew a strenuous objection from Peter, who
does not like the idea of putting a big for-each-CPU loop into the middle
of the scheduler's hot path. After some light resistance, Andrea agreed
that this logic eventually needs to find a different home where it would
run less often. For testing, though, he likes things the way they are,
since it causes the scheduler to converge more quickly on its chosen
Moving processes around will only help so much, though, if their memory is
spread across multiple NUMA nodes. Getting the best performance out of the
system clearly requires a mechanism to gather pages of memory onto the same
node as well. In the AutoNUMA patch, the first non-local fault (in
response to the special page marking described above) will cause that
page's "last node ID" value to be set to the accessing node; the page will
also be queued to be migrated to that node. A subsequent fault from a
different node will cancel that migration, though; after the first fault,
two faults in a row from the same node are required to cause the page to be
queued for migration.
Every NUMA node gets a new kernel thread (knuma_migrated) that is
charged with passing over the lists of pages queued for migration and
actually moving them to the target node. Migration is not unconditional -
it depends, for example, on there being sufficient memory available on the
destination node. But, most of the time, these migration threads should
manage to pull pages toward the nodes where they are actually used.
Beyond the above-mentioned complaint about putting heavy computation into
schedule(), Peter has found a number of things to dislike about
this patch set. He doesn't like the worker
threads, to begin with:
The problem I have with farming work out to other entities is that
its thereafter terribly hard to account it back to whoemever caused
the actual work. Suppose your kworker thread consumes a lot of cpu
time -- this time is then obviously not available to your
application -- but how do you find out what/who is causing this and
Andrea responds that the cost of these threads is small to the point that
it cannot really be measured. It is a little harder to shrug off Peter's other
complaint, though: that this patch set consumes a large amount of memory.
The kernel maintains one struct page for every page of memory
in the system. Since a typical system can have millions of pages, this
structure must be kept as small as possible. But the AutoNUMA patch adds a
list_head structure (for the migration queue) and two counters to
each page structure. The end result can be a lot of memory lost
to the AutoNUMA machinery.
The plan is to eventually move this information out of struct
page; then, among other things, the kernel can avoid allocating it at
all if AutoNUMA is not actually in use. But, for the NUMA case, that
memory will still be consumed regardless of its location, and some users
are unlikely to be happy even if others, as Andrea asserts, will be happy to give up a big chunk
of memory if they get a 20% performance improvement in return. This looks
like an argument that will not be settled in the near future, and, chances
are, the memory impact of AutoNUMA will need to be reduced somehow.
Perhaps, your editor naively suggests, knuma_migrated and its
per-page list_head structure could be replaced by the "lazy
migration" scheme used in Peter's patch.
NUMA scheduling is hard and doing it right requires significant expertise
in both scheduling and memory management. So it seems like a good thing
that the problem is receiving attention from some of the community's top
scheduler and memory management developers. It may be that one or both of
their solutions will be shown to be unworkable for some workloads to the
point that it simply needs to be dropped. What may be more likely, though,
is that these developers will eventually stop poking holes in each other's
patches and, together, figure out how to combine the best aspects of each
into a working solution that all can live with. What seems certain is that
getting to that point will probably take some time.
Comments (10 posted)
Patches and updates
Core kernel code
- Artem Bityutskiy: Aiaiai .
(March 28, 2012)
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>