Kernel development
Brief items
Kernel release status
The 3.4 merge window remains open, so there is no current development kernel. See the article below for a summary of changes pulled in for the 3.4 release.Stable updates: the 3.0.26 and 3.2.13 updates were released on March 23; each contains a small set of important fixes.
The 2.6.34.11 update came out on March 22; it has a much longer list of fixes.
Quotes of the week
It's getting harder and harder to have rational error handling at the OS level as application environments move to higher levels and greater abstraction.
The Nouveau driver graduates from staging
Linus has merged a patch which moves the Nouveau graphics driver out of its symbolic location in staging and into the mainline proper; among other things, this move is an indication that no further ABI breaks (which have not happened for a while anyway) are expected. Also merged is initial mode-setting support for the just-released "Kepler" chipset from NVIDIA.("Symbolic" because the Nouveau code has never been in the staging tree; only the configuration option was placed there.)
Kernel development news
3.4 Merge window part 2
In the 3.3 release announcement, Linus warned developers that he would be taking a bit of time off during the merge window; that did indeed happen over the last week. Still, he managed to pull some 4,000 changesets since last week's summary. Some of the more significant changes merged in the last week include:
- The PowerPC has gained a new firmware-assisted dump facility for the
quick capture and analysis of crash dumps.
- The GFS2 filesystem now supports the FITRIM ioctl()
command which can be used to send discard requests to the underlying
storage device.
- The prctl() system call has a new option called
PR_SET_CHILD_SUBREAPER. Marking a process this way will
cause any orphan descendant processes to be reparented to the marked
process rather than to the init process. There is a
corresponding PR_GET_CHILD_SUBREAPER option as well.
- The Microblaze architecture now has high memory support.
- The ext4 "noacl" and "noattr" mount options have been marked
deprecated with an eye toward removal in the near future. Without
these options, it will not be possible to disable ACL and extended
attribute support. No other filesystem allows that support to be
disabled. The "journal=update" and "resize" mount options have been
removed entirely. On the other hand, plans to remove the "bsd_df",
"minix_df", "grpid" and "nogrpid" options have been dropped in
response to complaints from users.
- New hardware support includes:
- Processors and systems:
GE Intelligent Platforms IMP3A boards,
Atmel AT91SAM9x5 processors,
Bluegiga APX4 development kits, and
OMAP4 "remote" processors (see below).
- Audio:
Wolfson WM2200 CODECs, and
Maxim MAX9768 amplifiers.
- Graphics:
Intel Medfield-based GMA500 adapters,
NVIDIA Kepler chipsets (mode-setting only),
ATI RadeonHD 7xxx and "Trinity" chipsets,
USB-attached Displaylink video adapters,
Samsung S5PC210 and EXYNOS MIPI-DSI controllers, and
Intel 750 graphics cards.
- Input:
Cypress TrueTouch Standard Product touchscreen controllers,
Synaptics USB touchpads,
TI touchscreen controllers,
MAXIM MAX8997 haptic controllers, and
Ilitek ILI210X based touchscreens.
- Miscellaneous: Freescale IFC external NAND controllers,
NVIDIA Tegra pinmuxes,
CSR SiRFprimaII-based I2C interfaces,
TI LP8550/LP8551/LP8552/LP8553/LP8556 backlight devices,
Pandora console backlight devices, and
Dialog DA9052/DA9053 RTCs.
Video4Linux: Afatech AF9005 based DVB-T/DVB-C receivers, and Keene FM transmitters.
- Processors and systems:
GE Intelligent Platforms IMP3A boards,
Atmel AT91SAM9x5 processors,
Bluegiga APX4 development kits, and
OMAP4 "remote" processors (see below).
Changes visible to kernel developers include:
- A new subsystem called "remoteproc" has been merged; it allows for the
control of remote processors (those on the same SoC but running
something other than Linux) through shared memory. The new "rpmsg"
subsystem is a virtio-based mechanism for communicating with those
processors. There will probably be a separate article on these
facilities soon; in the meantime, see Documentation/remoteproc.txt and rpmsg.txt for more information.
- The new for_each_clear_bit() macro iterates through each
un-set bit in a word.
- The poll_requested_events()
function has been added as a way for drivers to learn exactly what
events user space is polling for. Also added is:
bool poll_does_not_wait(const poll_table *p);which returns true iff it is known that the poll() call will not block.
Also worthy of note is that there has been a vast amount of work done in the ARM architecture tree; the process of consolidating and cleaning up the ARM code continues at a high rate.
The 3.4 merge window would normally be expected to end around April 2. When he announced his vacation, Linus said that he would extend the merge window for a bit if necessary - though he warned that he would still only consider pull requests received during the window. Whether that will happen remains to be seen; either way, next week's Kernel Page will summarize the last new features merged for 3.4.
IMA appraisal extension
The "integrity" of a Linux system is based on whether it is running the code that the administrator expects. If not, a compromise of the system may have occurred. The Linux integrity subsystem is meant to detect those unexpected changes to files in order to protect systems against compromise. That is done by creating integrity "measurements" (hashes of contents and metadata) of files of interest.
Much of what is needed to do integrity management has already landed in the mainline, but there are a few remaining pieces. The integrity measurement architecture (IMA) appraisal extension patch set from Mimi Zohar and Dmitry Kasatkin fills in one missing piece: storing and validating the integrity measurement of files. A hash of a file's contents and metadata will be stored in the security.ima extended attribute (xattr) of the file, and the patch set will create and maintain those xattrs. In addition, it can enforce that the file contents are "correct" when the file is opened for reading or executing based on the integrity values that were stored.
The integrity subsystem has taken a rather twisted path into the kernel. It was proposed as far back as 2005, but the subsystem has been broken up into smaller pieces several times along the way. Much of IMA was added to the kernel in 2.6.30, but another piece, the extended verification module (EVM) was not merged until 3.2. Digital signature support was added to EVM in 3.3, and IMA appraisal is currently under review.
As described on the Linux IMA web page, the integrity subsystem is meant to thwart various kinds of attacks against the contents of files, both on- and off-line. Unexpected changes to files, particularly executables, may be a sign that the system has been compromised. In addition, the subsystem allows the use of the "Trusted Platform Module" (TPM) to collect integrity measurements and sign them in such a way that the system can "attest" to its integrity. That attestation could be sent to another system to "prove" that the system is intact—only approved code is running.
Current kernels can generate an integrity measurement of files that are executed, collect and digitally sign them with keys from the TPM (or the kernel keyring), and use that information for remote attestation. EVM adds the ability to thwart offline attacks against the file contents or metadata by hashing the values of the security xattrs of the file (e.g. security.selinux, security.ima), signing that hash, and storing it as security.evm.
But, there is nothing in place that would stop a running system from executing or reading a file that has been changed. If a file with an IMA hash is opened for reading or executing, the appraisal extension will check to see if the contents match the stored hash. If they don't match, the ima_appraise kernel command-line parameter determines what happens. If it is set to "enforce", access to the file is denied, while "fix" will update the IMA xattr with the new value. In addition, "off" can be used to turn off any file appraisal.
In order to recognize that a file has changed while it is open, the appraisal extension requires the filesystem to support i_version, which is a counter that gets incremented any time the file's inode gets updated. Filesystems must be mounted with i_version option in order for the appraisal extension to work. That allows the extension to notice the change when the file is closed and either update the xattr or flag the file change as a policy violation.
In order to get the initial security.ima xattrs on files that are to be appraised (by default, all files owned by root), one boots the kernel with ima_appraise_tcb (which enables appraisal) and ima_appraise=fix, and then by opening all files of interest (e.g. via a find command as suggested on the IMA web page).
The IMA appraisal extension will complete the off-line attack detection that EVM provides. Because the extension will create and maintain the security.ima xattr, EVM will be able to detect changes to the file contents.
In response to an earlier version of the patch set, James Morris asked if there were any distributions that were planning to use IMA and EVM once all the pieces are in place. George Wilson said that IBM plans to use it internally once distributions have incorporated it. In addition, Ryan Ware and Kasatkin said that the Tizen mobile distribution plans to use it for some product profiles.
But, before any of that can happen, the appraisal extension needs to find a way to change its locking behavior to get past a NAK by Al Viro. In the current patches, the final __fput() is deferred if a file is closed before munmap() is called in kernels using IMA appraisal. Viro is concerned that this changes the locking conditions based on whether the kernel is using IMA or not, which may make locking problems harder to spot. He also said that the overhead is too high for a commonly used path, and that not all of the places where __fput() is used were covered by the patch. So far, no solution to the problem has been found, though Viro did suggest possibly using a different mutex for changing xattrs, but that it would take a fair amount of code review to determine if that could be done.
Given that the patch set completes a job started by EVM, and will, for the most part, complete the integrity subsystem, it seems likely that a solution will be found. There are a few lingering pieces of IMA appraisal that are still coming, according to the "An Overview of the Linux Integrity Subsystem [PDF]" white paper. Two specific pieces are mentioned, one to add digital signature capabilities for vendor-signed files, and another that will protect directory contents (e.g. filenames). While the currently proposed patches may still need some work before they can be considered for the mainline, those working on the integrity subsystem are probably finally starting to see the light at the end of a long tunnel.
AutoNUMA: the other approach to NUMA scheduling
Last week's Kernel Page included an article on Peter Zijlstra's NUMA scheduling patch set. As it happens, Peter is not the only developer working in this area; Andrea Arcangeli has posted a NUMA scheduling patch set of his own called AutoNUMA. Andrea's goal is the same - keep processes and their memory together on the same NUMA node - but the approach taken to get there is quite different. These two patch sets embody a disagreement on how the problem should be solved that could take some time to work out.Peter's patch set works by assigning a "home node" to each process, then trying to concentrate the process and its memory on that node. Andrea's patch lacks the concept of home nodes; he thinks it is an idea that will not work well for programs that don't fit into a single node unless developers add code to use Peter's new system calls. Instead, Andrea would like NUMA scheduling to "just work" in the same way that transparent huge pages do. So his patch set seems to assume that resources will be spread out across the system; it then focuses on cleaning things up afterward. The key to the cleanup task is a bunch of statistics and a couple of new kernel threads.
The first of these threads is called knuma_scand. Its primary job is to scan through each process's address space, marking its in-RAM anonymous pages with a special set of bits that makes the pages look, to the hardware, like they are not present. If the process tries to access such a page, a page fault will result; the kernel will respond by marking the page "present" again so that the process can go about its business. But the kernel also tracks the node that the page lives on and the node the accessing process was running on, noting any mismatches. For each process, the kernel maintains an array of counters to track which node each of its recently-accessed pages were located on. For pages, the information tracked is necessarily more coarse; the kernel simply remembers the last node to access each page.
When the time comes for the scheduler to make a decision, it passes over the per-process statistics to determine whether the target process would be better off if it were moved to another node. If the process seems to be accessing most of its pages remotely, and it is better suited to the remote node than the processes already running there, it will be migrated over. This code drew a strenuous objection from Peter, who does not like the idea of putting a big for-each-CPU loop into the middle of the scheduler's hot path. After some light resistance, Andrea agreed that this logic eventually needs to find a different home where it would run less often. For testing, though, he likes things the way they are, since it causes the scheduler to converge more quickly on its chosen solution.
Moving processes around will only help so much, though, if their memory is spread across multiple NUMA nodes. Getting the best performance out of the system clearly requires a mechanism to gather pages of memory onto the same node as well. In the AutoNUMA patch, the first non-local fault (in response to the special page marking described above) will cause that page's "last node ID" value to be set to the accessing node; the page will also be queued to be migrated to that node. A subsequent fault from a different node will cancel that migration, though; after the first fault, two faults in a row from the same node are required to cause the page to be queued for migration.
Every NUMA node gets a new kernel thread (knuma_migrated) that is charged with passing over the lists of pages queued for migration and actually moving them to the target node. Migration is not unconditional - it depends, for example, on there being sufficient memory available on the destination node. But, most of the time, these migration threads should manage to pull pages toward the nodes where they are actually used.
Beyond the above-mentioned complaint about putting heavy computation into schedule(), Peter has found a number of things to dislike about this patch set. He doesn't like the worker threads, to begin with:
Andrea responds that the cost of these threads is small to the point that it cannot really be measured. It is a little harder to shrug off Peter's other complaint, though: that this patch set consumes a large amount of memory. The kernel maintains one struct page for every page of memory in the system. Since a typical system can have millions of pages, this structure must be kept as small as possible. But the AutoNUMA patch adds a list_head structure (for the migration queue) and two counters to each page structure. The end result can be a lot of memory lost to the AutoNUMA machinery.
The plan is to eventually move this information out of struct page; then, among other things, the kernel can avoid allocating it at all if AutoNUMA is not actually in use. But, for the NUMA case, that memory will still be consumed regardless of its location, and some users are unlikely to be happy even if others, as Andrea asserts, will be happy to give up a big chunk of memory if they get a 20% performance improvement in return. This looks like an argument that will not be settled in the near future, and, chances are, the memory impact of AutoNUMA will need to be reduced somehow. Perhaps, your editor naively suggests, knuma_migrated and its per-page list_head structure could be replaced by the "lazy migration" scheme used in Peter's patch.
NUMA scheduling is hard and doing it right requires significant expertise in both scheduling and memory management. So it seems like a good thing that the problem is receiving attention from some of the community's top scheduler and memory management developers. It may be that one or both of their solutions will be shown to be unworkable for some workloads to the point that it simply needs to be dropped. What may be more likely, though, is that these developers will eventually stop poking holes in each other's patches and, together, figure out how to combine the best aspects of each into a working solution that all can live with. What seems certain is that getting to that point will probably take some time.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
