Kernel development
Brief items
Kernel release status
The current development kernel is 3.7-rc6, released on November 16; things have been slow since then as Linus has gone on vacation. "I'll have a laptop with me as I'm away, but if things calm down even further, I'll be happy. I'll do an -rc7, but considering how calm things have been, I suspect that's the last -rc. Unless something dramatic happens."
Stable updates: 3.0.52, 3.2.34, 3.4.19, and 3.6.7 were all released on November 17 with the usual set of important fixes.
Quotes of the week
"Consolidate a bit! The context switch code!"I guess because it just sounds Greek to me.
"Consolidate a bit! The context switch code!"
"Consolidate a bit! The context switch code!"
"Consolidate a bit! The context switch code!"
Barcelona Media Summit Report
Hans Verkuil has posted a report from the meeting of kernel-space media developers recently held in Barcelona. Covered topics include a new submaintainer organization, requirements for new V4L2 drivers, asynchronous loading, and more. "Basically the number of patch submissions increased from 200 a month two years ago to 700 a month this year. Mauro is unable to keep up with that flood and a solution needed to be found."
Kernel development news
LCE: Checkpoint/restore in user space: are we there yet?
Checkpoint/restore refers to the ability to snapshot the state of an application (which may consist of multiple processes) and then later restore the application to a running state, possibly on a different (virtual) system. Pavel Emelyanov's talk at LinuxCon Europe 2012 provided an overview of the current status of the checkpoint/restore in user space (CRIU) system that has been in development for a couple of years now.
Uses of checkpoint/restore
There are various uses for checkpoint/restore functionality. For example, Pavel's employer, Parallels, uses it for live migration, which allows a running application to be moved between host machines without loss of service. Parallels also uses it for so-called rebootless kernel updates, whereby applications on a machine are checkpointed to persistent storage while the kernel is updated and rebooted, after which the applications are restored; the applications then continue to run, unaware that the kernel has changed and the system has been restarted.
Another potential use of checkpoint/restore is to speed start-up of applications that have a long initialization time. An application can be started and checkpointed to persistent storage after the initialization is completed. Later, the application can be quickly (re-)started from the checkpointed snapshot. (This is analogous to the dump-emacs feature that is used to speed up start times for emacs by creating a preinitialized binary.)
Checkpoint/restore also has uses in high-performance computing. One such use is for load balancing, which is essentially another application of live migration. Another use is incremental snapshotting, whereby an application's state is periodically checkpointed to persistent storage, so that, in the event of an unplanned system outage, the application can be restarted from a recent checkpoint rather than losing days of calculation.
"You might ask, is it possible to already do all of these things
on Linux right now? The answer is that it's almost possible.
" Pavel
spent the remainder of the talk describing how the CRIU implementation
works, how close the implementation is to completion, and what work remains
to be done. He began with some history of the checkpoint/restore project.
History of checkpoint/restore
The origins of the CRIU implementation go back to work that started in 2005 as part of the OpenVZ project. The project provided a set of out-of-mainline patches to the Linux kernel that supported a kernel-space implementation of checkpoint/restore.
In 2008, when the first efforts were made to upstream the checkpoint/restore functionality, the OpenVZ project communicated with a number of other parties who were interested in the functionality. At the time, it seemed natural to employ an in-kernel implementation of checkpoint/restore. A few year's work resulted in a set of more than 100 patches that implemented almost all of the same functionality as OpenVZ's kernel-based checkpoint/restore mechanism.
However, concerns from the upstream kernel developers eventually led to the rejection of the kernel-based approach. One concern related to the sheer scale of the patches and the complexity they would add to the kernel: the patches amounted to tens of thousands of lines and touched a very wide range of subsystems in the kernel. There were also concerns about the difficulties of implementing backward compatibility for checkpoint/restore, so that an application could be checkpointed on one kernel version and then successfully restored on a later kernel version.
Over the course of about a year, the OpenVZ project then turned its efforts to developing an implementation of checkpoint/restore that was done mainly in user space, with help from the kernel where it was needed. In January 2012, that effort was repaid when Linus Torvalds merged a first set CRIU-related patches into the mainline kernel, albeit with an amusingly skeptical covering note from Andrew Morton:
A note on this: this is a project by various mad Russians to perform checkpoint/restore mainly from userspace, with various oddball helper code added into the kernel where the need is demonstrated.
So rather than some large central lump of code, what we have is little bits and pieces popping up in various places which either expose something new or which permit something which is normally kernel-private to be modified.
Since then, two versions of the corresponding user-space tools have been released: CRIU v0.1 in July, and CRIU v0.2, which added support for Linux Containers (LXC), in September.
Goal and concept
The ultimate goal of the CRIU project is to allow the entire state of an application to be dumped (checkpointed) and then later restored. This is a complex task, for several reasons. First of all, there are many pieces of process state that must be saved, for example, information about virtual memory mappings, open files, credentials, timers, process ID, parent process ID, and so on. Furthermore, an application may consist of multiple processes that share some resources. The CRIU facility must allow all of these processes to be checkpointed and restored to the same state.
For each piece of state that the kernel records about a process, CRIU needs two pieces of support from the kernel. The first piece is a mechanism to interrogate the kernel about the value of the state, in preparation for dumping the state during a checkpoint. The second piece is a mechanism to pass that state back to the kernel when the process is restored. Pavel illustrated this point using the example of open files. A process may open an arbitrary set of files. Each open() call results in the creation of a file descriptor that is a handle to some internal kernel state describing the open file. In order to dump that state, CRIU needs a mechanism to ask the kernel which files are opened by that process. To restore the application, CRIU then re-opens those files using the same descriptor numbers.
The CRIU system makes use of various kernel APIs for retrieving and restoring process state, including files in the /proc file system, netlink sockets, and system calls. Files in /proc can be used to retrieve a wide range of information about processes and their interrelationships. Netlink sockets are used both to retrieve and to restore various pieces of state information.
System calls provide a mechanism to both retrieve and restore various pieces of state. System calls can be subdivided into two categories. First, there are system calls that operate only on the process that calls them. For example, getitimer() can be used to retrieve only the caller's interval timer value. System calls in this category can't easily be used to retrieve or restore the state of arbitrary processes. However, later in his talk, Pavel described a technique that the CRIU project came up with to employ these calls. The other category of system calls can operate on arbitrary processes. The system calls that set process scheduling attributes are an example: sched_getscheduler() and sched_getparam() can be used to retrieve the scheduling attributes of an arbitrary process and sched_setscheduler() can be used to set the attributes of an arbitrary process.
CRIU requires kernel support for retrieving each piece of process state. In some cases, the necessary support already existed. However, in other cases, there is no kernel API that can be used to interrogate the kernel about the state; for each such case, the CRIU project must add a suitable kernel API. Pavel used the example of memory-mapped files to illustrate this point. The /proc/PID/maps file provides the pathnames of the files that a process has mapped. However, the file pathname is not a reliable identifier for the mapped file. For example, after the mapping was created, filesystem mount points may have been rearranged or the pathname may have been unlinked. Therefore, in order to obtain complete and accurate information about mappings, the CRIU developers added a new kernel API: /proc/PID/map_files.
The situation when restoring process state is often a lot simpler: in many cases the same API that was used to create the state in the first place can be used to re-create the state during a restore. However, in some cases, restoring process state is not so simple. For example, getpid() can be used to retrieve a process's PID, but there is no corresponding API to set a process's PID during a restore (the fork() system call does not allow the caller to specify the PID of the child process). To address this problem, the CRIU developers added an API that could be used to control the PID that was chosen by the next fork() call. (In response to a question at the end of the talk, Pavel noted that in cases where the new kernel features added to support CRIU have security implications, access to those features has been restricted by a requirement that the user must have the CAP_SYS_ADMIN capability.)
Kernel impact and new kernel features
The CRIU project has largely achieved its goal, Pavel said. Instead of having a large mass of code inside the kernel that does checkpoint/restore, there are instead many small extensions to the kernel that allow checkpoint/restore to be done in user space. By now, just over 100 CRIU-related patches have been merged upstream or are sitting in "-next" trees. Those patches added nine new features to the kernel, of which only one was specific to checkpoint/restore; all of the others have turned out to also have uses outside checkpoint/restore. Approximately 15 further patches are currently being discussed on the mailing lists; in most cases, the principles have been agreed on by the stakeholders, but details are being resolved. These "in flight" patches provide two additional kernel features.
Pavel detailed a few of the more interesting new features added to the
kernel for the CRIU project. One of these was parasite code injection, which was added by
Tejun Heo, "not specifically within the CRIU project, but with the
same intention
". Using this feature, a process can be made to
execute an arbitrary piece of code. The CRIU framework employs parasite
code injection to use those system calls mentioned earlier that operate
only on the caller's state; this obviated the need to add a range of new
APIs to retrieve and restore various pieces of state of arbitrary
processes. Examples of system calls used to obtain process state via
injected parasite code are getitimer() (to retrieve interval
timers) and sigaction() (to retrieve signal dispositions).
The kcmp() system call was added as part of the CRIU project. It allows the comparison of various kernel objects used by two processes. Using this system call, CRIU can build a full picture of what resources two processes share inside the kernel. Returning to the example of open files gives some idea of how kcmp() is useful.
Information about an open file is available via /proc/PID/fd and the files in /proc/PID/fdinfo. Together, these files reveal the file descriptor number, pathname, file offset, and open file flags for each file that a process has opened. This is almost enough information to be able to re-open the file during a restore. However, one notable piece of information is missing: sharing of open files. Sometimes, two open file descriptors refer to the same file structure. That can happen, for example, after a call to fork(), since the child inherits copies of all of its parent's file descriptors. As a consequence of this type of sharing, the file descriptors share file offset and open file flags.
This sort of sharing of open file descriptions can't be restored via simple calls to open(). Instead, CRIU makes use of the kcmp() system call to discover instances of file sharing when performing the checkpoint, and then uses a combination of open() and file descriptor passing via UNIX domain sockets to re-create the necessary sharing during the restore. (However, this is far from the full story for open files, since there are many other attributes associated with specific kinds of open files that CRIU must handle. For example, inotify file descriptors, sockets, pseudo-terminals, and pipes all require additional work within CRIU.)
Another notable feature added to the kernel for CRIU is sock_diag. This is a netlink-based subsystem that can be used to obtain information about sockets. sock_diag is an example of how a CRIU-inspired addition to the kernel has also benefited other projects. Nowadays, the ss command, which displays information about sockets on the system, also makes use of sock_diag. Previously, ss used /proc files to obtain the information it displayed. The advantage of employing sock_diag is that, by comparison with the corresponding /proc files, it is much easier to extend the interface to provide new information without breaking existing applications. In addition, sock_diag provides some information that was not available with the older interfaces. In particular, before the advent of sock_diag, ss did not have a way of discovering the connections between pairs of UNIX domain sockets on a system.
Pavel briefly mentioned a few other kernel features added as part of the CRIU work. TCP repair mode allows CRIU to checkpoint and restore an active TCP connection, transparently to the peer application. Virtualization of network device indices allows virtual network devices to be restored in a network namespace; it also had the side-benefit of a small improvement in the speed of network routing. As noted earlier, the /proc/PID/map_files file was added for CRIU. CRIU has also implemented a technique for peeking at the data in a socket queue, so that the contents of a socket input queue can be dumped. Finally, CRIU added a number of options to the getsockopt() system call, so that various options that were formerly only settable via setsockopt() are now also retrievable.
Current status
Pavel then summarized the current state of the CRIU implementation, looking at what is supported by the mainline 3.6 kernel. CRIU currently supports (only) the x86-64 architecture. Asked at the end of the talk how much work would be required to port CRIU to a new architecture, Pavel estimated that the work should not be large. The main tasks are to implement code that dumps architecture-specific state (mainly registers) and reimplement a small piece of code that is currently written in x86 assembler.
Arbitrary process trees are supported: it is possible to dump a process and all of its descendants. CRIU supports multithreaded applications, memory mappings of all kinds, and terminals, process groups, and sessions. Open files are supported, including shared open files, as described above. Established TCP connections are supported, as are UNIX domain sockets.
The CRIU user-space tools also support various kinds of non-POSIX files, including inotify, epoll, and signalfd file descriptors, but the required kernel-side support is not yet available. Patches for that support are currently queued, and Pavel hopes that they will be merged for kernel 3.8.
Testing
The CRIU project tests its work in a variety of ways. First, there is the ZDTM (zero-down-time-migration) test suite. This test suite consists of a large number of small tests. Each test program sets up a test before a checkpoint, and then reports on the state of the tested feature after a restore. Every new feature merged into the CRIU project adds a test to this suite.
In addition, from time to time, the CRIU developers take some real software and test whether it survives a checkpoint/restore. Among the programs that they have successfully checkpointed and restored are Apache web server, MySQL, a parallel compilation of the kernel, tar, gzip, an SSH daemon with connections, nginx, VNC with XScreenSaver and a client connection, MongoDB, and tcpdump.
Plans for the near future
The CRIU developers have a number of plans for the near future. (The CRIU wiki has a TODO list.) First among these is to complete the coverage of resources supported by CRIU. For example, CRIU does not currently support POSIX timers. The problem here is that the kernel doesn't currently provide an API to detect whether a process is using POSIX timers. Thus, if an application using POSIX timers is checkpointed and restored, the timers will be lost. There are some other similar examples. Fixing these sorts of problems will require adding suitable APIs to the kernel to expose the required state information.
Another outstanding task is to integrate the user-space crtools into LXC and OpenVZ to permit live migration of containers. Pavel noted that OpenVZ already supports live migration, but with its own out-of-tree kernel modules.
The CRIU developers plan to improve the automation of live migration. The issue here is that CRIU deals only with process state. However, there are other pieces of state in a container. One such piece of state is the filesystem. Currently, when checkpointing and restoring an application, it is necessary to ensure that the filesystem state has not changed in the interim (e.g., no files that are open in the checkpointed application have been deleted). Some scripting using rsync to automate the copying files from the source system to the destination system could be implemented to ease the task of live migration.
One further piece of planned work is to improve the handling of user-space memory. Currently, around 90% of the time required to checkpoint an application is taken up by reading user-space memory. For many use cases, this is not a problem. However, for live migration and incremental snapshotting, improvements are possible. For example, when performing live migration, the whole application must first be frozen, and then the entire memory is copied out to the destination system, after which the application is restarted on the destination system. Copying out a huge amount of memory may require several seconds; during that time the application is unavailable. This situation could be alleviated by allowing the application to continue to run at the same time as memory is copied to the destination system, then freezing the application and asking the kernel which pages of memory have changed since the checkpoint operation began. Most likely, only a small amount of memory will have changed; those modified pages can then be copied to the destination system. This could result in a considerable shortening of the interval during which the application is unavailable. The CRIU developers plan to talk with the memory-management developers about how to add support for this optimization.
Concluding remarks
Although many groups are interested in having checkpoint/restore functionality, an implementation that works with the mainline kernel has taken a long time in coming. When one looks into the details and realize how complex the task is, it is perhaps unsurprising that it has taken so long. Along the way, one major effort to solve the problem—checkpoint/restore in kernel space—was considered and rejected. However, there are some promising signs that the mad Russians led by Pavel may be on the verge of success with their alternative approach of a user-space implementation.
VFS hot-data tracking
At any level of the system, from the hardware to high-level applications, performance often depends on keeping frequently-used data in a place where it can be accessed quickly. That is the principle behind hardware caches, virtual memory, and web-browser image caches, for example. The kernel already tries to keep useful filesystem data in the page cache for quick access, but there can also be advantages to keeping track of "hot" data at the filesystem level and treating it specially. In 2010, a data temperature tracking patch set for the Btrfs filesystem was posted, but then faded from view. Now the idea has returned as a more general solution. The current form of the patch set, posted by Zhi Yong Wu, is called hot-data tracking. It works at the virtual filesystem (VFS) level, tracking accesses to data and making the resulting information available to user space via a couple of mechanisms.The first step is the instrumentation of the VFS to obtain the needed information. To that end, Zhi Yong's patch set adds hooks to a number of core VFS functions (__blockdev_direct_IO(), readpage(), read_pages(), and do_writepages()) to record specific access operations. It is worth noting that hooking at this level means that this subsystem is not tracking data accesses as such; instead, it is tracking operations that cause actual file I/O. The two are not quite the same thing: a frequently-read page that remains in the page cache will generate no I/O; it could look quite cold to the hot-data tracking code.
The patch set uses these hooks to maintain a surprisingly complicated data structure, involving a couple of red-black trees, that is hooked into a filesystem's superblock structure. Zhi Yong used this bit of impressive ASCII art to describe it in the documentation file included with the patch set:
heat_inode_map hot_inode_tree | | | V | +-------hot_comm_item--------+ | | frequency data | +---+ | list_head | | V ^ | V | ...<--hot_comm_item-->... | | ...<--hot_comm_item-->... | frequency data | | frequency data +-------->list_head----------+ +--------->list_head--->..... hot_range_tree hot_range_tree | heat_range_map V | +-------hot_comm_item--------+ | | frequency data | +---+ | list_head | | V ^ | V | ...<--hot_comm_item-->... | | ...<--hot_comm_item-->... | frequency data | | frequency data +-------->list_head----------+ +--------->list_head--->.....
In short, the idea is to track which inodes are seeing the most I/O traffic, along with the hottest data ranges within those inodes. The subsystem can produce a sorted list on demand. Unsurprisingly, this data structure can end up using a lot of memory on a busy system, so Zhi Yong has added a shrinker to clean things up when space gets tight. Specific file information is also dropped after five minutes (by default) with no activity.
There is a new ioctl() command (FS_IOC_GET_HEAT_INFO) that can be used to obtain the relevant information for a specific file. The structure it uses shows the information that is available:
struct hot_heat_info { __u64 avg_delta_reads; __u64 avg_delta_writes; __u64 last_read_time; __u64 last_write_time; __u32 num_reads; __u32 num_writes; __u32 temp; __u8 live; };
The hot-data tracking subsystem monitors the number of read and write operations, when the last operations occurred, and the average period between operations. A complicated calculation boils all that information down to a single temperature value, stored in temp. The live field is an input parameter to the ioctl() call: if it is non-zero, the temperature will be recalculated at the time of the call; otherwise a cached, previously-calculated value will be returned.
The ioctl() call does not provide a way to query which parts of the file are the hottest, or to get a list of the hottest files. Instead, the debugfs interface must be used. Once debugfs is mounted, each device or partition with a mounted filesystem will be represented by a directory under hot_track/ containing two files. The most active files can be found by reading rt_stats_inode, while the hottest file ranges can be read from rt_stats_range. These are the interfaces that user-space utilities are expected to use to make decisions about, for example, which files (or portions of files) should be stored on a fast, solid-state drive.
Should a filesystem want to influence how the calculations are done, the patch set provides a structure (called hot_func_ops) as a place for filesystem-provided functions to calculate access frequencies, temperatures, and when information should be aged out of the system. In the posted patch set, though, only Btrfs uses the hot-data tracking feature, and it does not override any of those operations, so it is not entirely clear why they exist. The changelog states that support for ext4 and xfs has been implemented; perhaps one of those filesystems needed that capability.
The patch set has been through several review cycles and a lot of changes have been made in response to comments. The list of things still to be done includes scalability testing, a simpler temperature calculation function, and the ability to save file temperature data across an unmount. If nothing else, some solid performance information will be required before this patch set can be merged into the core VFS code. So hot-data tracking is not 3.8 material, but it may be ready for one of the subsequent development cycles.
The module signing endgame
Inserting a loadable module into the running kernel is a potential security problem, so some administrators want to be able to restrict which modules are allowed. One way to do that is to cryptographically sign modules and have the kernel verify that signature before loading the module. Module signing isn't for everyone, and those who aren't interested probably don't want to pay much of a price for that new feature. Even those who are interested will want to minimize that price. While cryptographically signing kernel modules can provide a security benefit, that boon comes with a cost: slower kernel builds. When that cost is multiplied across a vast number of kernel builds, it draws some attention.
David Miller complained on Google+ about the cost of module signing in mid-October. Greg Kroah-Hartman agreed in the comments, noting that an allmodconfig build took more than 10% longer between 3.6 and 3.7-rc1. The problem is the addition of module signing to the build process. Allmodconfig builds the kernel with as many modules as possible, which has the effect of build-testing nearly all of the kernel. Maintainers like Miller and Kroah-Hartman do that kind of build frequently, typically after each patch they apply, in order to ensure that the kernel still builds. Module signing can, of course, be turned off using CONFIG_MODULE_SIG, but that adds a manual configuration step to the build process, which is annoying.
Linus Torvalds noted Miller's complaint and offered up a "*much*
simpler
" solution: defer module
signing until install time. There is already a mechanism to strip
modules during the make modules_install step. Torvalds's
change adds module signing into that step, which means that you don't pay
the signing price until you actually install the modules. There are some
use cases that would not be supported by this change, but Torvalds
essentially dismissed them:
One of the main proponents behind the module signing feature over the years has been David Howells; his code was used as the basis for module maintainer Rusty Russell's signature infrastructure patch. But, Howells was not particularly happy with Torvalds's changes. He would like to be able to handle some of the use cases that Torvalds dismissed, including loading modules from the kernel build tree. He thinks that automatic signing should probably just be removed from the build process; a script could be provided to do signing manually.
Howells is looking at the signed modules problem from a distribution view. Currently, the keys used to sign modules can be auto-generated at build time, with the public key getting built into the kernel and the private portion being used for signing—and then likely deleted once the build finishes. That isn't how distributions will do things, so auto-generating keys concerns Howells:
That may make sense for distributions or those who will be using long-lived keys, but it makes little sense for a more basic use case. With characteristic bluntness, Torvalds pointed that out:
One of the main sane use-cases for module signing is:
- CONFIG_CHECK_SIGNATURE=y
- randomly generated one-time key
- "make modules_install; make install"
- "make clean" to get rid of the keys.
- reboot.
and now you have a custom kernel that has the convenience of modules, yet is basically as safe as a non-modular build. The above makes it much harder for any kind of root-kit module to be loaded, and basically entirely avoids one fundamental security scare of modules.
Kroah-Hartman agreed with the need to support the use case Torvalds described, though he noted that keys are not removed by make clean, which he considered a bit worrisome. It turns out that make clean is documented to leave the files needed to build modules, so make distclean should be used to get rid of the key files.
Russell, who has always been a bit skeptical of module signing, pointed out that Torvalds's use case could be
handled by just storing the hashes of the modules in the kernel—no
cryptography necessary. While that's true, Russell's scheme would disallow some
other use cases. Signing provides flexibility, Torvalds said, and is "technically the right
thing to do
". Russell countered:
Russell's concerns notwithstanding, it is clear that module signing is here to stay. Torvalds's change was added for 3.7 (with some additions by Russell and Howells). For distributions, Josh Boyer has a patch that will add a "modules_sign" target. It will operate on the modules in their installed location (i.e. after a modules_install), and remove the signature, which will allow the distribution packaging system (e.g. RPM) to generate debuginfo for the modules before re-signing them. In that way, distributions can use Torvalds's solution at the cost of signing modules twice. Since that process should be far less frequent than developers building kernels (or build farms building kernels or ...), that tradeoff is likely to be well worth that small amount of pain.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>