Kernel development
Brief items
Kernel release status
The current development kernel is 3.11-rc7, released on August 25 with a special announcement: "I'm doing a (free) operating system (just a hobby, even if it's big and professional) for 486+ AT clones and just about anything else out there under the sun. This has been brewing since april 1991, and is still not ready. I'd like any feedback on things people like/dislike in Linux 3.11-rc7." He is, of course, celebrating the 22nd anniversary of his original "hello everybody" mail.
Stable updates: no stable updates have been released in the last week. As of this writing, the 3.10.10, 3.4.60, and 3.0.94 updates are in the review process; they can be expected on or after August 29.
Quotes of the week
On the other hand, after grad school, this is second nature.
flink() flunks
A few weeks back, we noted the merging of a patch adding support for the flink() system call, which would allow a program to link to a file represented by an open file descriptor. This functionality had long been blocked as the result of security worries; it went through this time on the argument that the kernel already provided that feature via the linkat() call. At that point, it seemed like this feature was set for the 3.11 kernel.Not everybody was convinced, though; in particular, Brad Spengler apparently raised the issue of processes making links to file descriptors that had been passed in from a more privileged domain. So Andy Lutomirski, the author of the original patch, posted a followup to restrict the functionality to files created with the O_TMPFILE option. The only problem was that nobody much liked the patch; Linus was reasonably clear about his feelings.
What followed was a long discussion on how to better solve the problem,
with a number of patches going back and forth. About the only thing that
became clear was that the best solution was not yet clear. So, on
August 28, Linus reverted the original
patch, saying "Clearly we're not done with this discussion, and
the patches flying around aren't appropriate for 3.11
" So the
flink() functionality will have to wait until at least 3.12.
Kernel development news
TSO sizing and the FQ scheduler
One might think that, by now, we would have a pretty good idea of how to optimally manage data streams with the TCP protocol. In truth, there still seems to be substantial room for improvement. Larger problems like bufferbloat have received a lot of attention recently, but there is ongoing work in other aspects of real-world networking as well. A couple of patches posted recently by Eric Dumazet show the kind of work that is being done.
TSO sizing
TCP segmentation offload (TSO) is a hardware-assisted technique for improving the performance of outbound data streams. As a stream of data (a "flow") is transmitted, it must be broken up into smaller segments, each of which fits into one packet. A network interface that supports TSO can accept a large buffer of data and do this segmentation work within the hardware. That reduces the number of operations performed and interrupts taken by the host CPU, making the transmission process more efficient. The use of techniques like TSO makes it possible for Linux systems to run high-end network interfaces at their full speed.
A potential problem with TSO is that it can cause the interface to dump a large number of packets associated with a single stream onto the wire in a short period of time. In other words, data transmission can be bursty, especially if the overall flow rate for the connection is not all that high. All of those packets will just end up sitting in a buffer somewhere, contributing to bufferbloat and increasing the chances that some of those packets will be dropped. If those packets were transmitted at a more steady pace, the stress on the net as a whole would be reduced and throughput could well increase.
The simple TSO automatic sizing patch posted by Eric (with a Cc to Van Jacobson at a new google.com address) tries to spread out transmissions in just that way. There are two changes involved, the first of which is to make intelligent choices about how much data should be handed to the interface in a single TSO transmission. Current kernels will provide a maximally-sized buffer — usually 64KB — to be transmitted all at once. With the automatic sizing patch, that buffer size is reduced to an amount that, it is estimated, will take roughly 1ms to transmit at the current flow rate. In this way, each transmission will produce a smaller burst of data if the flow rate is not high.
The other piece of the puzzle is called "TCP pacing"; it is a TCP implementation change intended to set the pace at which packets are transmitted to approximately match the pace at which they can get through the network. The existing TCP flow control mechanisms tell each endpoint how much data it is allowed to transmit, but they don't provide a time period over which that transmission should be spread, so, once again, the result tends to be bursts of packets. TCP pacing ensures that packets are transmitted at a rate that doesn't cause them to pile up in buffers somewhere between the two endpoints. It can, of course, also be used to restrict the data rate of a given flow to something lower than what the network could handle, but that is not the objective of this patch.
Interestingly, the patch does not actually implement pacing, which would add some significant complexity to the TCP stack — code that does not really need more complexity. All it does is to calculate the rate that should be used, in the hope that some other level of the stack can then enforce that rate. For now, that other part would appear to be the new "fair queue" packet scheduler.
The FQ scheduler
A packet scheduler is charged with organizing the flow of packets through the network stack to meet a set of policy objectives. The kernel has quite a few of them, including CBQ for fancy class-based routing, CHOKe for routers, and a couple of variants on the CoDel queue management algorithm. FQ joins this list as a relatively simple scheduler designed to implement fair access across large numbers of flows with local endpoints while keeping buffer sizes down; it also happens to implement TCP pacing.
FQ keeps track of every flow it sees passing through the system. To do so, it calculates an eight-bit hash based on the socket associated with the flow, then uses the result as an index into an array of red-black trees. The data structure is designed, according to Eric, to scale well up to millions of concurrent flows. A number of parameters are associated with each flow, including its current transmission quota and, optionally, the time at which the next packet can be transmitted.
That transmission time is used to implement the TCP pacing support. If a given socket has a pace specified for it, FQ will calculate how far the packets should be spaced in time to conform to that pace. If a flow's next transmission time is in the future, that flow is added to another red-black tree with the transmission time used as the key; that tree, thus, allows the kernel to track delayed flows and quickly find the one whose next packet is due to go out the soonest. A single timer is then used, if needed, to ensure that said packet is transmitted at the right time.
The scheduler maintains two linked lists of active flows, the "new" and "old" lists. When a flow is first encountered, it is placed on the new list. The packet dispatcher services flows on the new list first; once a flow uses up its quota, that flow is moved to the old list. The idea here appears to be to give preferential treatment to new, short-lived connections — a DNS lookup or HTTP "GET" command, for example — and not let those connections be buried underneath larger, longer-lasting flows. Eventually the scheduler works its way through all active flows, sending a quota of data from each; then the process starts over.
There are a number of additional details, of course. There are limits on the amount of data queued for each flow, as well as a limit on the amount of data buffered within the scheduler as a whole; any packet that would exceed one of those limits is dropped. A special "internal" queue exists for high-priority traffic, allowing it to reach the wire more quickly. And so on.
One other detail is garbage collection. One problem with this kind of flow tracking is that nothing tells the scheduler when a particular flow is shut down; indeed, nothing can tell the scheduler for flows without local endpoints or for non-connection-oriented protocols. So the scheduler must figure out on its own when it can stop tracking any given flow. One way to do that would be to drop the flow as soon as there are no packets associated with it, but that would cause some thrashing as the queues empty and refill; it is better to keep flow data around for a little while in anticipation of more traffic. FQ handles this by putting idle flows into a special "detached" state, off the lists of active flows. Whenever a new flow is added, a pass is made over the associated red-black tree to clean out flows that have been detached for a sufficiently long time — three seconds in the current patch.
The result, says Eric, is fair scheduling of packets from any number of flows with nice spacing in situations where pacing is being used. Given that the comments so far have been mostly concerned with issues like white space, it seems unlikely that anybody is going to disagree with its merging. So TCP pacing and the FQ scheduler seem like candidates for the mainline in the near future — though the upcoming 3.12 cycle may be just a bit too near at this point.
Device namespaces
Lightweight virtualization using containers is a technique that has finally come together for Linux, though there are still some rough edges that may need filing down. Containers are created by using two separate kernel features: control groups (cgroups) and namespaces. Cgroups are in the process of being revamped, while there may still need to be more namespaces added to the varieties currently available. For example, there is no way to separate most devices into their own namespace. That's a hole that Oren Laadan would like to see filled, so he put out an RFC on device namespaces recently.
Namespaces partition global system resources so that different sets of processes have their own view of those resources. For example, mount namespaces partition the mounted filesystems into different views, with the result that processes in one namespace are unable to see or interact with filesystems that are only mounted in another. Similarly, PID namespaces give each namespace its own set of process IDs (PIDs). Certain devices are currently handled by their own namespace or similar functionality: network namespaces for network devices and the devpts pseudo-filesystem for pseudo-terminals (i.e. pty). But there is no way to partition the view of all devices in the system, which is what device namespaces would do.
The motivation for the feature is to allow multiple virtual phones on a single physical phone. For example, one could have two complete Android systems running on the phone, with one slated for work purposes, and the other for personal uses. Each system would run in its own container that would be isolated from the other. That isolation would allow companies to control the apps and security of the "company half" of a phone, while allowing the user to keep their personal information separate. A video gives an overview of the idea. Much of that separation can be done today, but there is a missing piece: virtualizing the devices (e.g. frame buffer, touchscreen, buttons).
The proposal adds the concept of an "active" device namespace, which is the one that the user is currently interacting with. The upshot is that a user could switch between the phone personalities (or personas) as easily as they switch between apps today. Each personality would have access to all of the capabilities of the phone while it was the active namespace, but none while it was the inactive (or background) namespace.
Setting up a device namespace is done in the normal way, using the clone(), setns(), or unshare() system calls. One surprise is that there is no new CLONE_* flag added for device namespaces, and the CLONE_NEWPID flag is overloaded. A comment in the code explains why:
/* * Couple device namespace semantics with pid-namespace. * It's convenient, and we ran out of clone flags anyway. */While coupling PID and device namespaces may work, it does seem like some kind of long-term solution to the clone flag problem is required. Once a process has been put into a device namespace, any open() of a namespace-aware device will restrict that device to the namespace.
At some level, adding device namespaces is simply a matter of virtualizing the major and minor device numbers so that each namespace has its own set of them. The major/minor numbers in a namespace would correspond to the driver loaded for that namespace. Drivers that might be available to multiple namespaces would need to be changed to be namespace-aware. For some kinds of drivers, for example those without any real state (e.g. for Android, the LED subsystem or the backlight/LCD subsystem), the changes would be minimal—essentially just a test. If the namespace that contains the device is the active one, proceed, otherwise, ignore any requested changes.
Devices, though, are sometimes stateful. One can't suddenly switch sending frame buffer data mid-stream (or mix two streams) and expect the screen contents to stay coherent. So, drivers and subsystems will need to handle the switching behavior. For example, the framebuffer device should only reflect changes to the screen from the active namespace, but it should buffer changes from the background namespace so that those changes will be reflected in the display after a switch.
Laadan and his colleagues at Cellrox have put together a set of patches based on the 3.4 kernel for the Android emulator (goldfish). There is also a fairly detailed description of the patches and the changes made for both stateless and stateful devices. An Android-based demo that switches between a running phone and an app that displays a changing color palette has also been created.
So far, there hasn't been much in the way of discussion of the idea on the containers and lxc-devel mailing lists that the RFC was posted to. On one hand, it makes sense to be able to virtualize all of the devices in a system, but on the other that means there are a lot of drivers that might need to change. There may be some "routing" issues to resolve, as well—when the phone rings, which namespace handles it? The existing proof-of-concept API for switching the active namespace would also likely need some work.
While it may be a worthwhile feature, it could also lead to a large ripple effect of driver changes. How device namespaces fare in terms of moving toward the mainline may well hinge on others stepping forward with additional use cases. In the end, though, the core changes to support the feature are fairly small, so the phone personality use case might be enough all on its own.
Cramming more into struct page
As a general rule, kernel developers prefer data structures that are designed for readability and maintainability. When one understands the data structures used by a piece of code, an understanding of the code itself is usually not far away. So it might come as a surprise that one of the kernel's most heavily-used data structures is also among its least comprehensible. That data structure is struct page, which represents a page of physical memory. A recent patch set making struct page even more complicated provides an excuse for a quick overview of how this structure is used.On most Linux systems, a page of physical memory contains 4096 bytes; that means that a typical system contains millions of pages. Management of those pages requires the maintenance of a page structure for each of those physical pages. That puts a lot of pressure on the size of struct page; expanding it by a single byte will cause the kernel's memory use to grow by (possibly many) megabytes. That creates a situation where almost any trick is justified if it can avoid making struct page bigger.
Enter Joonsoo Kim, who has posted a patch set aimed at squeezing more information into struct page without making it any bigger. In particular, he is concerned about the space occupied by struct slab, which is used by the slab memory allocator (one of three allocators that can be configured into the kernel, the others being called SLUB and SLOB). A slab can be thought of as one or more contiguous pages containing an array of structures, each of which can be allocated separately; for example, the kmalloc-64 slab holds 64-byte chunks used to satisfy kmalloc() calls requesting between 32 and 64 bytes of space. The associated slab structures are also used in great quantity; /proc/slabinfo on your editor's system shows over 28,000 active slabs for the ext4 inode cache alone. A reduction in that space use would be welcome; Joonsoo thinks this can be done — by folding the contents of struct slab into the page structure representing the memory containing the slab itself.
What's in struct page
Joonsoo's patch is perhaps best understood by stepping through struct page and noting the changes that are made to accommodate the extra data. The full definition of this structure can be found in <linux/mm_types.h> for the curious. The first field appears simple enough:
unsigned long flags;
This field holds flags describing the state of the page: dirty, locked, under writeback, etc. In truth, though, this field is not as simple as it seems; even the question of whether the kernel is running out of room for page flags is hard to answer. See this article for some details on how the flags field is used.
Following flags is:
struct address_space *mapping;
For pages that are in the page cache (a large portion of the pages on most systems), mapping points to the information needed to access the file that backs up the page. If, however, the page is an anonymous page (user-space memory backed by swap), then mapping will point to an anon_vma structure, allowing the kernel to quickly find the page tables that refer to this page; see this article for a diagram and details. To avoid confusion between the two types of page, anonymous pages will have the least-significant bit set in mapping; since the pointer itself is always aligned to at least a word boundary, that bit would otherwise be clear.
This is the first place where Joonsoo's patch makes a change. The mapping field is not currently used for kernel-space memory, so he is able to use it as a pointer to the first free object in the slab, eliminating the need to keep it in struct slab.
Next is where things start to get complicated:
struct { union { pgoff_t index; void *freelist; bool pfmemalloc; }; union { unsigned long counters; struct { union { atomic_t _mapcount; struct { /* SLUB */ unsigned inuse:16; unsigned objects:15; unsigned frozen:1; }; int units; }; atomic_t _count; }; }; };
(Note that this piece has been somewhat simplified through the removal of some #ifdefs and a fair number of comments). In the first union, index is used with page-cache pages to hold the offset into the associated file. If, instead, the page is managed by the SLUB or SLOB allocators, freelist points to a list of free objects. The slab allocator does not use freelist, but Joonsoo's patch makes slab use it the same way the other allocators do. The pfmemalloc member, instead, acts like a page flag; it is set on a free page if memory is tight and the page should only be used as part of an effort to free more pages.
In the second union, both counters and the innermost anonymous struct are used by the SLUB allocator, while units is used by the SLOB allocator. The _mapcount and _count fields are both usage counts for the page; _mapcount is the number of page-table entries pointing to the page, while _count is a general reference count. There are a number of subtleties around the use of these fields, though, especially _mapcount, which helps with the management of compound pages as well. Here, Joonsoo adds another field to the second union:
unsigned int active; /* SLAB */
It is the count of active objects, again taken from struct slab.
Next we have:
union { struct list_head lru; struct { struct page *next; int pages; int pobjects; }; struct list_head list; struct slab *slab_page; };
For anonymous and page-cache pages, lru holds the page's position in one of the least-frequently-used lists. The anonymous structure is used by SLUB, while list is used by SLOB. The slab allocator uses slab_page to refer back to the containing slab structure. Joonsoo's patch complicates things here in an interesting way: he overlays an rcu_head structure over lru to manage the freeing of the associated slab using read-copy-update. Arguably that structure should be added to the containing union, but the current code just uses lru and casts instead. This trick will also involve moving slab_page to somewhere else in the structure, but the current patch set does not contain that change.
The next piece is:
union { unsigned long private; #if USE_SPLIT_PTLOCKS spinlock_t ptl; #endif struct kmem_cache *slab_cache; struct page *first_page; };
The private field essentially belongs to whatever kernel subsystem has allocated the page; it sees a number of uses throughout the kernel. Filesystems, in particular, make heavy use of it. The ptl field is used if the page is used by the kernel to hold page tables; it allows the page table lock to be split into multiple locks if the number of CPUs justifies it. In most configurations, a system containing four or more processors will split the locks in this way. slab_cache is used as a back pointer by slab and SLUB, while first_page is used within compound pages to point to the first page in the set.
After this union, one finds:
#if defined(WANT_PAGE_VIRTUAL) void *virtual; #endif /* WANT_PAGE_VIRTUAL */
This field, if it exists at all, contains the kernel virtual address for the page. It is not useful in many situations because that address is easily calculated when it is needed. For systems where high memory is in use (generally 32-bit systems with 1GB or more of memory), virtual holds the address of high-memory pages that have been temporarily mapped into the kernel with kmap(). Following private are a couple of optional fields used when various debugging options are turned on.
With the changes described above, Joonsoo's patch moves much of the information previously kept in struct slab into the page structure. The remaining fields are eliminated in other ways, leaving struct slab with nothing to hold and, thus, no further reason to exist. These structures are not huge, but, given that there can be tens of thousands of them (or more) in a running system, the memory savings from their elimination can be significant. Concentrating activity on struct page may also have beneficial cache effects, improving performance overall. So the patches may well be worthwhile, even at the cost of complicating an already complex situation.
And the situation is indeed complex: struct page is a complicated structure with a number of subtle rules regarding its use. The saving grace, perhaps, is that it is so heavily used that any kind of misunderstanding about the rules will lead quickly to serious problems. Still, trying to put more information into this structure is not a task for the faint of heart. Whether Joonsoo will succeed remains to be seen, but he clearly is not the first to eye struct page as a place to stash some useful memory management information.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>