Kernel development
Brief items
Kernel release status
The 3.17 kernel was released on October 5 (announcement), so there is no development kernel as of this writing. The flow of patches into the mainline repository for the 3.18 development cycle has begun; see the separate article below for details on what has been merged so far.Stable updates: 3.16.4, 3.14.20, and 3.10.56 were released on October 5. The 3.10.57, 3.14.21, and 3.16.5 updates are in the review process as of this writing; they can be expected on or after October 9.
Quotes of the week
A Git repository for Fedora's kernel
Josh Boyer describes the new repository he has put together to hold the Fedora kernel. "I thought the tree itself could still be more useful to people as well. As the last commit of each update, I include the generated kernel configs that we build the kernel with. Our config setup in the package is somewhat confusing to people that don't work with it daily, and it's not obvious how all the fragments go together. This has them all in one place."
Kernel development news
3.18 Merge window part 1
Linus had stated his intent to take a week off from merging patches before starting the 3.18 merge window around October 12. Even so, somehow, a few thousand (2936, to be precise) changesets showed up in the mainline repository when nobody was looking. The initial pulls have a focus toward driver code (including the staging tree), but there are a few other items in the mix as well.User-visible changes merged so far include:
- The arm64 architecture now has support for just-in-time compilation
of extended Berkeley packet filter (eBPF) programs.
- The cryptographic layer has gained support for multibuffer
operations. The idea here is to use parallel hardware operations to
perform the same transform on multiple buffers concurrently. In 3.18,
there is an implementation of SHA1 that can make use of this feature.
- The NFS server now supports the NFS 4.2 SEEK operation,
allowing the implementation of the Linux SEEK_HOLE and
SEEK_DATA lseek() options.
- The F2FS filesystem supports atomic writes (where a series of
operations succeeds or fails as a unit) via filesystem-specific
ioctl() operations. F2FS has also gained support for the
FITRIM (discard) operation.
- New hardware support includes:
- Human input/output:
PenMount 6000 touch controllers,
TI DRV260X and DRV2667 haptic controllers,
TI Palmas power buttons, and
MAXIM MAX77693 haptic controllers.
- Miscellaneous:
Freescale i.MX21 pin control units,
Qualcomm APQ8084 pin controllers,
Broadcom BCM53xx SPI controllers,
APM X-Gene true random number generators,
Maxim MAX5821 digital-to-analog converters,
Bosch BMC150 accelerometers,
Bosch BMG160 tri-axis gyro sensors,
Texas Instruments ADC128S052 analog-to-digital converters (ADCs),
Rockchip SARADC ADCs,
Dyna Image AL3320A ambient light sensors,
Fintek F81216A LPC to 4 UARTs,
Amlogic Meson serial ports, and
Mediatek serial ports.
- Regulators:
Dialog Semiconductor DA9213 regulators,
HiSilicon Hi6421 PMIC voltage regulators,
Intersil ISL9305 regulators,
Qualcomm RPM regulators,
Maxim 77802 regulators,
Rockchip RK808 power regulators, and
Ricoh RN5T618 voltage regulators.
- USB: Xilinx USB2 peripheral controllers, ST Microelectronics OHCI/EHCI controllers, Renesas R-Car generation 2 USB PHYs, STMicroelectronics USB2 picoPHYs, STMicroelectronics STiH41x USB transceivers, National Instruments USB-6501 controllers, and Richtek RT8973A USB port accessory detectors.
- Human input/output:
PenMount 6000 touch controllers,
TI DRV260X and DRV2667 haptic controllers,
TI Palmas power buttons, and
MAXIM MAX77693 haptic controllers.
Changes visible to kernel developers include:
- A few kernel "tinification" patches have been merged for those trying
to build the smallest kernels possible. In 3.18 the table describing
processor capability bits and the madvise() and
fadvise() system calls can be configured out.
- Module parameters can be defined with a new "unsafe" flag; any attempt
to modify such a parameter will generate a warning and taint the
kernel. The module_param_unsafe() macro can be used to set
up such parameters.
- Kernel modules can now be installed in compressed form by the build
system.
- The driver core has a new "device coredump" mechanism that can be used to obtain core dumps and other diagnostic information from peripheral devices. It is intended to be used as an aid for firmware debugging.
As can be seen, the 3.18 merge window has barely gotten started so far. The pace can be expected to pick up in the near future, once Linus completes his travels and arrives in Düsseldorf for LinuxCon / CloudOpen / ELCE / KVM Forum / LPC / etc.
Bulk network packet transmission
One of the keys to good performance on contemporary systems is batching — getting a lot of work done relative to a given fixed cost. If, for example, a lock must be acquired to do one unit of work in a specific subsystem, doing multiple units of work while the lock is held will reduce the overall overhead of the system. Much of the scalability work that has been done in recent years has, in some way, related to increasing batching where possible. Some recent changes in the networking subsystem show that batching can improve performance there as well.Every time a packet is transmitted over the network, a sequence of operations must be performed. These include acquiring the lock for the queue of outgoing packets, passing a packet to the driver, putting the packet in the device's transmit queue, and telling the device to start transmitting. Some of those operations are inherently per-packet, but others are not. The acquisition of the queue lock could be amortized across multiple packet transmissions, for example, and the act of telling the device to start transmission may be expensive indeed. It can involve hardware operations or, even, on some systems, hypervisor calls.
Often, when there is one packet to transmit, there are others waiting in the queue as well; network traffic can be inherently bursty. So it would make sense for the networking stack to try to split the fixed costs of starting packet transmission across as many packets as possible. Some techniques, such as segmentation offload (wherein the network interface splits large chunks of data into packets) perform that kind of batching. But, in current kernels, if the networking stack has a set of packets ready to go, they will be sent out the slow way, one at a time.
That situation will begin to change in 3.18, when a relatively small set of changes will be merged. Consider the function exported by drivers now to send a packet:
netdev_tx_t (*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev);
This function takes the packet pointed to by skb and transmits it via the specified dev. Every call is a standalone operation, with all the associated fixed costs. The initial plan for 3.18 was to specify a new function that drivers could provide:
void (*ndo_xmit_flush)(struct net_device *dev, u16 queue);
If a driver provided this function, it was indicating to the networking stack that it is prepared for (and can benefit from) batched transmission. In this case, the networking stack could make multiple calls to ndo_start_xmit() to queue packets for transmission; the driver would accept them, but not actually start the transmission operation. At the end of a sequence of such calls, ndo_xmit_flush() would be called to indicate the end; at that point, actual hardware transmission would be started.
There were concerns, though, that putting another indirect function call into the transmit path would add too much overhead, so this particular function was ripped out almost as soon as it landed in the net-next repository. In its place, the sk_buff structure has gained a new Boolean variable called xmit_more. If that variable is true, then there are more packets coming and the driver can defer starting hardware transmission. This variable takes out the extra function call while making the needed information available to drivers that can make use of it.
This mechanism, added by David Miller, makes batching possible. A couple of drivers were fixed to support batching, but David did not change the networking stack to actually do the batching. That work fell to Jesper Dangaard Brouer, whose bulk dequeue support patches have also been merged for 3.18. This work, too, is limited in scope; in particular, it will only work with queuing disciplines that have a single transmit queue.
The change Jesper made is simple enough: in a situation where a packet is being transmitted, the stack will attempt to send out a series of packets together while the queue lock is held. The byte queue limits mechanism is used to put an upper bound on the amount of data that can be in flight at once. Once the limit is hit (or the supply of packets runs out), skb->xmit_more will be set to false and the traffic will be on its way.
Eric Dumazet looked at the patch set and realized that things could be taken a bit further: the process of validating packets for transmission could be moved outside of the queue lock entirely, increasing concurrency in the system. The resulting patch had benefits that Eric described as awesome: full 40Gb/sec wire speed, even in the absence of segmentation offload. Needless to say, this patch, too, has been accepted into the net-next tree for the 3.18 merge window.
All told, the changes are relatively small. But small changes can have big effects when they are applied to the right places. These little changes should help to ensure that the networking stack in the 3.18 release is the fastest yet.
Page faults in user space: MADV_USERFAULT, remap_anon_range(), and userfaultfd()
The quest for performance often seems to lead developers to want to move functionalities out of the kernel and back into user space, where a dedicated application can, in theory, make things happen more quickly. Networking functions are often handled in this way, for example. A desire to move memory-management functions into user space is somewhat less common, but, as can be seen from Andrea Arcangeli's user-space page fault handling patch set, it is not unheard of.Page-fault handling usually requires fetching data from secondary storage and placing it in the correct place in the faulting process's address space. Why would one want to do that in user space? The primary use case here is the live migration of virtual machines running under KVM. Migration requires moving the virtual machine's memory, which can take a long time, but the owner of that machine would like to see as brief an outage as possible while the migration is happening. Preferably, the migration would not be noticeable at all. One way to approach that goal is to move the minimal amount of memory needed to represent the virtual machine on the new host. Once the machine starts running in the new location, it will certainly try to access pages which have not yet been moved. If the (user-space) virtual machine manager can catch the resulting page faults, it can prioritize the transfer of the pages the running machine actually needs. It is, in other words, a form of cross-host demand paging that makes migration happen with lower latency.
Other uses — shared memory distributed across the network, for example — are possible as well.
The patch set starts by adding a couple of new variants to the get_user_pages() function, which is charged with making user-space pages accessible to the kernel:
long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, int write, int force, struct page **pages, int *locked); long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, int write, int force, struct page **pages);
The former version is intended to be called with the mmap_sem semaphore held. It may release that semaphore while running, in which case *locked will be set to zero. The second form, instead, assumes that mmap_sem is not held. Using these functions in the kernel improves performance by allowing mmap_sem to be dropped while page-fault handling is in progress. That is useful even in current kernels, but, if handling of faults is going to be entrusted to user space, it will become necessary. Holding mmap_sem while calling out to user space would not be a recipe for happy times.
The next step is to add the MADV_USERFAULT flag to the madvise() system call. If that flag is set on a region of memory, the kernel will no longer attempt to resolve page faults in that region. Instead, in the absence of other measures (described below), the faulting process will receive a SIGBUS signal. That, of course, leaves the process in the position of having to resolve the page fault on its own. A tool provided to help with that task is the new remap_anon_pages() system call:
int remap_anon_pages(void *dest, void *src, unsigned long len, unsigned long flags);
This system call will take the pages holding len bytes starting at src and move them in the process's address space to the region starting at dest. A number of conditions must be met for this operation to succeed, starting with the fact that the full range in dest must currently be unmapped — remap_anon_pages() will not overwrite an existing page mapping. The range in src, instead, must all be present and mapped, and the pages cannot be shared with other processes. All of these rules exist to simplify the implementation, but also to try to catch race conditions in user-space fault handling.
If src is a huge page, and len is a multiple of 2MB, then the full huge page(s) will be moved to dest without being split.
With this mechanism in place, an application's SIGBUS signal handler can respond to a fault by allocating memory, filling it with the needed contents, and mapping it into the proper location with remap_anon_pages(). Once the signal handler returns, the page fault will be retried, but, this time, the needed memory will be in place, so application execution will continue.
Anybody who has worked with signal handlers on Unix-like systems is probably thinking at this point that all that work does not belong in such a handler. And, indeed, signal handlers are not the way that processes are expected to deal with page-fault handling. To make life easier, Andrea adds another system call:
int userfaultfd(int flags);
This call will return an open file descriptor which may be used to communicate with the kernel about page fault handling. The flags argument is mostly unused, though O_NONBLOCK may be provided to request non-blocking behavior.
The first step after acquiring the file descriptor is for the application to write a 64-bit integer indicating which version of the userfault protocol it understands. The kernel will respond with the same number if the protocol is supported, -1 otherwise. Once agreement has been reached in that area, the application can read a 64-bit address whenever a page fault occurs. It should resolve the fault, then write back two pointers indicating the range of memory which has been mapped in response to the fault.
The idea here is that a process can dedicate a thread to page-fault handling. Whenever a fault occurs, the faulting thread will pause while the handler thread puts things in place. No SIGBUS signals will be delivered if userfaultfd() has been called. So, for the faulting thread, life just continues as usual, with the possible exception that some page faults may take longer to handle than one might expect.
As was mentioned above, there might be multiple use cases for user-space page fault handling. What if a single application wishes to exercise more than one of those cases? To that end, the application can open more than one file descriptor with userfaultfd() and restrict each to a specific range of memory. That restriction is requested by writing two pointers indicating the range to be covered; the least-significant bit should be set on the start pointer. Thereafter, only faults within the given range will be directed to that file descriptor. The application must still set MADV_USERFAULT on the ranges in question. Multiple ranges can be set up to go to a single file descriptor, but a given range of memory can only have its faults handled by a single descriptor.
The bulk of the commentary on the patch set has been around the
remap_anon_pages() system call. Linus initially wondered whether remap_anon_pages()
made more sense than remap_file_pages(), which he called an
"unmitigated disaster
" and which may be removed in the near future. Later he
added that he would prefer an interface
where the fault handler process would simply write() the data to
the page of interest, causing it to be allocated and mapped. Andrea responded that such an interface might be
possible; the handler would write the data to the userfaultfd()
file descriptor and the kernel would handle the rest. But he worried about
losing the zero-copy behavior that was carefully designed into the current
interface. Linus's answer to that made it
clear that he was not concerned about zero-copy behavior, which, he said,
is almost never worth the cost of implementing it.
What we may see is that the get_user_pages() optimizations will find their way in relatively soon, though Linus wasn't entirely happy with those either. The remaining work will take a while longer, and the end result seems unlikely to include remap_anon_pages(). But, given that the use case is real, a significant improvement to live migration is going to be hard to turn down in the long run.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>