Kernel development

Brief items

Kernel release status

The current development kernel is 3.4-rc5, released on April 29. "And like -rc4, quite a bit of the changes came in on Friday (with some more coming in yesterday). And we haven't been calming down, quite the reverse. -rc5 has almost 50% more commits than -rc4 had. Not good." That said, what's going in is mostly fixes; see the announcement for the short-form changelog.

Stable updates: the 3.0.30 and 3.3.4 updates were released on April 27 with the usual set of important fixes.

Comments (none posted)

Quotes of the week

I am not making up the fact that I had a nightmare last night in which r12 (ARM IP register) was trying to kill me. I might have spent too long staring at kernel disassembly over the weekend.

-- Jon Masters

Indeed, my goal was "less bonkers" rather than "not bonkers". A "not bonkers" description remains a long-term aspiration rather than a short-term goal for the moment.

-- Paul "moderately bonkers" McKenney

What is it with all these Linuses these days? There's a Linus at google too. Some day I will get myself my own broadsword, and run around screaming "There can be only one".

I used to be _special_ dammit. Snif.

-- Linus Torvalds (Thanks to Nicolas Pitre)

Comments (1 posted)

Some useful perf documentation

For those who would like more information on how to use the Linux perf subsystem, there is an extensive tutorial posted by Google, written by Stephane Eranian. It probably merits a bookmark for anybody wanting to learn how to do interesting things with perf.

Comments (2 posted)

Kernel development news

Better active/inactive list balancing

By Jonathan Corbet
May 2, 2012

Memory management is a notoriously tricky task, though the underlying objective is quite clear: look into the future and ensure that the pages that will be needed by applications are in memory. Unfortunately, existing crystal ball peripherals tend not to work very well; they also usually require proprietary drivers. So the kernel is stuck with a set of heuristics that try to guess future needs based on recent behavior. Adjusting those heuristics is always a bit of a challenge; it is easy to put in changes that will break obscure workloads years in the future. But that doesn't stop developers from trying.

A core part of the kernel's memory management subsystem is a pair of lists called the "active" and "inactive" lists. The active list contains anonymous and file-backed pages that are thought (by the kernel) to be in active use by some process on the system. The inactive list, instead, contains pages that the kernel thinks might not be in use. When active pages are considered for eviction, they are first moved to the inactive list and unmapped from the address space of the process(es) using them. Thus, once a page moves to the inactive list, any attempt to reference it will generate a page fault; this "soft fault" will cause the page to be moved back to the active list. Pages that sit in the inactive list for long enough are eventually removed from the list and evicted from memory entirely.

One could think of the inactive list as a sort of probational status for pages that kernel isn't sure are worth keeping. Pages can get there from the active list as described above, but there's another way to inactive status as well: file-backed pages, when they are faulted in, are placed in the inactive list. It is quite common that a process will only access a file's contents once; requiring a second access before moving file-backed pages to the active list lets the kernel get rid of single-use data relatively quickly.

Splitting memory into two pools in this manner leads to an immediate policy decision: how big should each list be? A very large inactive list gives pages a long time to be referenced before being evicted; that can reduce the number of pages kicked out of memory only to be read back in shortly thereafter. But a large inactive list comes at the cost of a smaller active list; that can slow down the system as a whole by causing lots of soft page faults for data that's already in memory. So, as is the case with many memory management decisions, regulating the relative sizes of the two lists is a balancing act.

The way that balancing is done in current kernels is relatively straightforward: the active list is not allowed to grow larger than the inactive list. Johannes Weiner has concluded that this heuristic is too simple and insufficiently adaptive, so he has come up with a proposal for a replacement. In short, Johannes wants to make the system more flexible by tracking how long evicted pages stay out of memory before being faulted back in.

Doing so requires some significant changes to the kernel's page-tracking infrastructure. Currently, when a page is removed from the inactive list and evicted from memory, the kernel simply forgets about it; that clearly will not do if the kernel is to try to track how long the page remains out of memory. The page cache is tracked via a radix tree; the kernel's radix tree implementation already has a concept of "exceptional entries" that is used to track tmpfs pages while they are swapped out. Johannes's patch extends this mechanism to store "shadow" entries for evicted pages, providing the needed long-term record-keeping for those pages.

What goes into those shadow entries is a representation of the time the page was swapped out. That time can be thought of as a counter of removals from the inactive list; it is represented as an atomic_t variable called workingset_time. Every time a page is removed from the inactive list, either to evict it or to activate it, workingset_time is incremented by one. When a page is evicted, the current value of workingset_time is stored in its associated shadow entry. This time, thus, can be thought of as a sort of sequence counter for memory management events.

If and when that page is faulted back in, the difference between the current workingset_time and the value in the shadow entry gives a count of how many pages were removed from the inactive list while that page was out of memory. In the language of Johannes's patch, this difference is called the "refault distance." The observation at the core of this patch set is that, if a page returns to memory with a refault distance of R, its eviction and refaulting would have been avoided had the inactive list been R pages longer. R is thus a sort of metric describing how much longer the inactive list should be made to avoid a particular page fault.

Given that number, one has to decide how it should be used. The algorithm used in Johannes's patch is simple: if R is less than the length of the active list, one page will be moved from the active to the inactive list. That shortens the active list by one entry and places the formerly-active page on the inactive list immediately next to the page that was just refaulted in (which, as described above, goes onto the inactive list until a second access occurs). If the formerly-active page is still needed, it will be reactivated in short order. If, instead, the working set is shifting toward a new set of pages, the refaulted page may be activated instead, taking the other page's place. Either way, it is hoped, the kernel will do a better job of keeping the right pages active. Meanwhile, the inactive list gets slightly longer in the hope of avoiding refaults in the near future.

How well all of this works is not yet clear: Johannes has not posted any benchmark results for any sort of workload. This is early-stage work at this point, a long way from acceptance into a mainline kernel release. So it could evolve significantly or fade away entirely. But more sophisticated balancing between the active and inactive lists seems like an idea whose time may be coming.

Comments (6 posted)

TCP connection repair

By Jonathan Corbet
May 1, 2012

Migrating a running container from one physical host to another is a tricky job on a number of levels. Things get even harder if, as is likely, the container has active network connections to processes outside of that container. It is natural to want those connections to follow the container to its new host, preferably without the remote end even noticing that something has changed, but the Linux networking stack was not written with this kind of move in mind. Even so, it appears that transparent relocation of network connections, in the form of Pavel Emelyanov's TCP connection repair patches, will be supported in the 3.5 kernel.

The first step in moving a TCP connection is to gather all of the information possible about its current state. Much of that information is available from user space now; by digging around in /proc and /sys, one can determine the address and port of the remote end, the sizes of the send and receive queues, TCP sequence numbers, and a number of parameters negotiated between the two end points. There are still a few things that user space will need to obtain, though, before it can finish the job; that requires some additional support from the kernel.

With Pavel's patch, that support is available to suitably privileged processes. To dig into the internals of an active network connection, user space must put the associated socket into a new "repair mode." That is done with the setsockopt() system call, using the new TCP_REPAIR option. Changing a process's repair mode status requires the CAP_NET_ADMIN capability; the socket must also either be closed or in the "established" state. Once the socket is in repair mode, it can be manipulated in a number of ways.

One of those is to read the contents of the send and receive queues. The send queue contains data that has not yet been successfully transmitted to the remote end; that data needs to move with the connection so it can be transmitted from the new location. The receive queue, instead, contains data received from the remote end that has not yet been consumed by the application being moved; that data, too, should move so it will be waiting on the new host when the application gets around to reading it. Obtaining the contents of these queues is done with a two-step sequence: (1) call setsockopt(TCP_REPAIR_QUEUE) with either TCP_RECV_QUEUE or TCP_SEND_QUEUE, then (2) call recvmesg() to read the contents of the selected queue.

It turns out there is only one other important piece of information that cannot already be obtained from user space: the maximum value of the MSS (maximum segment size) negotiated between the two endpoints at connection setup time. To make this value available, Pavel's patch changes the semantics of the TCP_MAXSEG socket option (for getsockopt()) when the connection is in repair mode: it returns the maximal "clamp" MSS value rather than the currently active value.

Finally, if a connection is closed while it is in the repair mode, it is simply deleted with no notification to the remote end. No FIN or RST packets will be sent, so the remote side will have no idea that things have changed.

Then there is the matter of establishing the connection on the new host. That is done by creating a new socket and putting it immediately into the repair mode. The socket can then be bound to the proper port number; a number of the usual checks for port numbers are suspended when the socket is in repair mode. The TCP_REPAIR_QUEUE setsockopt() call comes into play again, but this time sendmsg() is used to restore the contents of the send and receive queues.

Another important task is to restore the send and receive sequence numbers. These numbers are normally generated randomly when the connection is established, but that cannot be done when a connection is being moved. These numbers can be set with yet another call to setsockopt(), this time with the TCP_QUEUE_SEQ option. This operation applies to whichever queue was previously selected with TCP_REPAIR_QUEUE, so the refilling of a queue's content and the setting of its sequence number are best done at the same time.

A few negotiated parameters also need to be restored so that the two ends will remain in agreement with each other; these include the MSS clamp described above, along with the active maximum segment size, the window size, and whether the selective acknowledgment and timestamp features can be used. One last setsockopt() option, TCP_REPAIR_OPTIONS, has been added to make it possible to set these parameters from user space.

Once the socket has been restored to a state approximating that which existed on the old host, it's time to put it into operation. When connect() is called on a socket in repair mode, much of the current setup and negotiation code is shorted out; instead, the connection goes directly to the "established" state without any communication from the remote end. As a final step, when the socket is taken out of the repair mode, a window probe is sent to restart traffic between the two ends; at that point, the socket can resume normal operation on the new host.

These patches have been through a few revisions over a number of months; with version 4, networking maintainer David Miller accepted them into net-next. From there, those changes will almost certainly hit the mainline during the 3.5 merge window. The TCP connection repair patches do not represent a complete solution to the problem of checkpointing and restoring containers, but they are an important step in that direction.

Comments (3 posted)

Fixing the unfixable autofs ABI

By Jonathan Corbet
April 30, 2012

One of the few hard rules of kernel development is that breaking the user-space binary interface is not acceptable. If there is user-space code that depends on specific behavior, that behavior must be maintained regardless of how inconvenient that may be. But what is to be done if two different programs depend on mutually-incompatible behaviors, so that it is seemingly impossible to keep them both working? The answer may be to violate another rule by putting an ugly hack into the kernel—or to do something rather more tricky.

The "autofs" protocol is used to communicate between the kernel and an automounter daemon. It allows the automounter to set up special virtual filesystems that, when referenced by user space, can be replaced by a remote-mounted real filesystem. Much of this protocol is implemented with ioctl() calls on a special autofs device, but it also makes use of pipes between the kernel and user space when specific filesystems are mounted.

This protocol is certainly part of the kernel ABI, so its components have been defined with some care. One of the key elements of the autofs protocol is the autofs_v5_packet structure, which is sent from the kernel to user space via a pipe; it is used, among other things, to report that a filesystem has been idle for some time and should be unmounted. This structure looks like:

    struct autofs_v5_packet {
	struct autofs_packet_hdr hdr;
	autofs_wqt_t wait_queue_token;
	__u32 dev;
	__u64 ino;
	__u32 uid;
	__u32 gid;
	__u32 pid;
	__u32 tgid;
	__u32 len;
	char name[NAME_MAX+1];
    };

The size of every field is precisely defined, so this structure should look the same on both 32- and 64-bit systems. And it does, except for one tiny little problem. The size of the structure as defined is 300 bytes, which is not divisible by eight. So if two of these structures were to be placed contiguously in memory, the 64-bit ino field would have to be misaligned in one of them. To avoid this problem, the compiler will, on 64-bit systems, round the size of the structure up to a multiple of eight, adding four bytes of padding at the end. So sizeof() on struct autofs_v5_packet will return 300 on a 32-bit system, and 304 on a 64-bit system.

That disparity is not a problem most of the time, but there is an exception. Automounting is one of the many tasks being assimilated by the systemd daemon. When systemd reads one of the above structures from the kernel, it checks the size of what it read against its idea of the size of the structure to ensure that everything is operating as it should be. That check works just fine, as long as systemd and the kernel agree on that size. And normally they do, but there is an exception: if systemd is running as a 32-bit process on a 64-bit kernel, it will get a 304-byte structure when it is expecting 300 bytes. At that point, systemd concludes that something has gone wrong and gives up.

In February, Ian Kent merged a patch to deal with this problem. One could be forgiven for calling the solution hacky: on 64-bit systems, the kernel's automount code will subtract four from the size of that structure if (and only if) it is talking with a user-space client running in 32-bit mode. This patch makes systemd work in this situation; it was merged for 3.3-rc5 and fast-tracked into the various stable kernel releases. Everybody then lived happily ever after.

...except they didn't. It seems that the automount program from the autofs-tools package, which is still in use on a great many systems, had run into this problem a number of years ago. At that time, the autofs-tools developers decided to work around the problem in user space. So, if automount determines that it is running in 32-bit mode on a 64-bit kernel (Linus has little respect for how that determination is done, incidentally), it will correct its idea of what the structure size should be. If the kernel messes with that size, the automount "fix" no longer works, so Ian's patch fixes systemd at the cost of breaking automount.

So we are now in a situation where two deployed programs have different ideas of how the autofs protocol should work. On pure 32- or 64-bit systems, both programs work just fine, but, depending on which kernel is being run, one or the other of the two will break in the 32-on-64 configuration. If Ian's patch remains, some users will be most unhappy, but reverting it will upset other users. It is, in other words, a somewhat unfortunate situation.

Unfortunate, but not necessarily unrecoverable. One possible way to fix things can be seen in this patch from Michael Tokarev. In short, this patch looks at the name of the current command (current->comm) and compares it against "automount". If the currently-running program is called "automount," the structure-size tweak is not applied and things work again. For any other program (including systemd), the previous fix remains. So things are fixed at the expense of having the kernel ABI depend on the name of the running program. At best, this solution can be described as "inelegant." At worst, there may be some other, unknown program with a different name that breaks in the same way automount does; any such program will remain broken with this fix in place.

Still, Linus has conceded that "it's probably what we have to go with". But he preferred to look for a less kludgy and more robust solution. One possibility was for the kernel to look at the size of the read() operation that would obtain the autofs_v5_packet structure prior to writing that structure; if that size is either 300 or 304, the kernel could give the calling program the size it is expecting. The problem here is that the read() operation is hidden behind the pipe, so the autofs code does not actually have access to the size of the buffer provided by user space.

So Linus came up with a different solution, the concept of "packetized pipes". A packetized pipe resembles the normal kind with a couple of exceptions: each write() is kept in a separate buffer, and a read() consumes an entire buffer, even if the size of the read is smaller than the amount of data in the buffer. With a packetized pipe, the kernel can always just write the larger (304-byte) structure size; if user space is only trying to read 300 bytes, then it will get what it expects and be happy. So there is no need for special hacks in the kernel, just a slightly different type of pipe dynamics. Following a suggestion from Alan Cox, Linus made an open with O_DIRECT turn on the packetized behavior, so user space can create such pipes if need be.

After a couple of false starts, Linus got this patch working and merged it just prior to the 3.4-rc5 release. So the 3.4 kernel should work fine for either automount or systemd.

The kernel community got a bit lucky here; it was possible for a suitably clever and motivated developer to figure out a way to give both programs what they expect and make the system work for everybody. The next time this kind of problem arises, the solution may not be so simple. Maintaining ABI stability is not always easy or fun, but it is necessary to keep the system viable in the long term.

Comments (57 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.4-rc5 ?

Greg KH Linux 3.3.4 ?

Greg KH Linux 3.0.30 ?

Steven Rostedt 3.0.30-rt50 ?

Architecture-specific

Bryan Wu Introduce a led trigger for CPU activity and consolidate LED driver in ARM ?

Yan, Zheng perf: Intel uncore pmu counting support ?

Core kernel code

Frederic Weisbecker Nohz cpusets v3 (adaptive tickless kernel) ?

Colin Cross coupled cpuidle state support ?

Arve Hjønnevåg epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready ?

Development tools

Jiri Olsa perf: Add backtrace post dwarf unwind ?

Sasikantha babu debugfs: New debugfs interface for creation of files, directory and symlinks ?

Device drivers

Roland Stigge USB: Add driver for NXP ISP1301 USB transceiver ?

Santosh Shilimkar Add TI EMIF SDRAM controller driver ?

Dave Airlie mgag200: initial g200se driver ?

Inki Dae drm/exynos: add FIMC driver ?

Sylwester Nawrocki V4L camera control enhancements ?

Jonghwa Lee MFD : add MAX77686 mfd driver ?

Sylwester Nawrocki [PATCH v2 00/12] V4L: Exynos 4x12 camera host interface (FIMC-LITE) driver ?

Hans Verkuil Improved/New DV Timings API ?

Dave Airlie drm/kms: driver for virtual cirrus under qemu ?

Filesystems and block I/O

Michael Tokarev Introduce a version6 of autofs interface, to fix design error. ?

Stanislav Kinsbursky NFS: callback threads containerization ?

Tejun Heo block: implement per-blkg request allocation ?

Memory management

Johannes Weiner refault distance-based file cache sizing ?

Virtualization and containers

Raghavendra K T Paravirtualized ticket spinlocks ?

Page editor: Jonathan Corbet
Next page: Distributions>>