Kernel development [LWN.net]

Kernel release status

The 3.17 kernel was released on October 5 (announcement), so there is no development kernel as of this writing. The flow of patches into the mainline repository for the 3.18 development cycle has begun; see the separate article below for details on what has been merged so far.

Stable updates: 3.16.4, 3.14.20, and 3.10.56 were released on October 5. The 3.10.57, 3.14.21, and 3.16.5 updates are in the review process as of this writing; they can be expected on or after October 9.

Comments (none posted)

Quotes of the week

So color me confused, and when I'm confused I don't apply patches.

— Linus Torvalds

No, it didn't get me confused - it got you confused that I'm confused.

— Borislav Petkov

One thing I've learned from working in Open Source community for the last 16 years, is that people that say they know better than everyone else are the ones that are not treated well by the community. But those that are willing to collaborate, and listen to others before pushing their own methods are welcomed with open arms.

— Steven Rostedt

Comments (219 posted)

A Git repository for Fedora's kernel

Josh Boyer describes the new repository he has put together to hold the Fedora kernel. "I thought the tree itself could still be more useful to people as well. As the last commit of each update, I include the generated kernel configs that we build the kernel with. Our config setup in the package is somewhat confusing to people that don't work with it daily, and it's not obvious how all the fragments go together. This has them all in one place."

Comments (1 posted)

3.18 Merge window part 1

By Jonathan Corbet
October 8, 2014

Linus had stated his intent to take a week off from merging patches before starting the 3.18 merge window around October 12. Even so, somehow, a few thousand (2936, to be precise) changesets showed up in the mainline repository when nobody was looking. The initial pulls have a focus toward driver code (including the staging tree), but there are a few other items in the mix as well.

User-visible changes merged so far include:

The arm64 architecture now has support for just-in-time compilation of extended Berkeley packet filter (eBPF) programs.
The cryptographic layer has gained support for multibuffer operations. The idea here is to use parallel hardware operations to perform the same transform on multiple buffers concurrently. In 3.18, there is an implementation of SHA1 that can make use of this feature.
The NFS server now supports the NFS 4.2 SEEK operation, allowing the implementation of the Linux SEEK_HOLE and SEEK_DATA lseek() options.
The F2FS filesystem supports atomic writes (where a series of operations succeeds or fails as a unit) via filesystem-specific ioctl() operations. F2FS has also gained support for the FITRIM (discard) operation.
New hardware support includes:
- Human input/output: PenMount 6000 touch controllers, TI DRV260X and DRV2667 haptic controllers, TI Palmas power buttons, and MAXIM MAX77693 haptic controllers.
- Miscellaneous: Freescale i.MX21 pin control units, Qualcomm APQ8084 pin controllers, Broadcom BCM53xx SPI controllers, APM X-Gene true random number generators, Maxim MAX5821 digital-to-analog converters, Bosch BMC150 accelerometers, Bosch BMG160 tri-axis gyro sensors, Texas Instruments ADC128S052 analog-to-digital converters (ADCs), Rockchip SARADC ADCs, Dyna Image AL3320A ambient light sensors, Fintek F81216A LPC to 4 UARTs, Amlogic Meson serial ports, and Mediatek serial ports.
- Regulators: Dialog Semiconductor DA9213 regulators, HiSilicon Hi6421 PMIC voltage regulators, Intersil ISL9305 regulators, Qualcomm RPM regulators, Maxim 77802 regulators, Rockchip RK808 power regulators, and Ricoh RN5T618 voltage regulators.
- USB: Xilinx USB2 peripheral controllers, ST Microelectronics OHCI/EHCI controllers, Renesas R-Car generation 2 USB PHYs, STMicroelectronics USB2 picoPHYs, STMicroelectronics STiH41x USB transceivers, National Instruments USB-6501 controllers, and Richtek RT8973A USB port accessory detectors.

Changes visible to kernel developers include:

A few kernel "tinification" patches have been merged for those trying to build the smallest kernels possible. In 3.18 the table describing processor capability bits and the madvise() and fadvise() system calls can be configured out.
Module parameters can be defined with a new "unsafe" flag; any attempt to modify such a parameter will generate a warning and taint the kernel. The module_param_unsafe() macro can be used to set up such parameters.
Kernel modules can now be installed in compressed form by the build system.
The driver core has a new "device coredump" mechanism that can be used to obtain core dumps and other diagnostic information from peripheral devices. It is intended to be used as an aid for firmware debugging.

As can be seen, the 3.18 merge window has barely gotten started so far. The pace can be expected to pick up in the near future, once Linus completes his travels and arrives in Düsseldorf for LinuxCon / CloudOpen / ELCE / KVM Forum / LPC / etc.

Comments (2 posted)

Bulk network packet transmission

By Jonathan Corbet
October 8, 2014

One of the keys to good performance on contemporary systems is batching — getting a lot of work done relative to a given fixed cost. If, for example, a lock must be acquired to do one unit of work in a specific subsystem, doing multiple units of work while the lock is held will reduce the overall overhead of the system. Much of the scalability work that has been done in recent years has, in some way, related to increasing batching where possible. Some recent changes in the networking subsystem show that batching can improve performance there as well.

Every time a packet is transmitted over the network, a sequence of operations must be performed. These include acquiring the lock for the queue of outgoing packets, passing a packet to the driver, putting the packet in the device's transmit queue, and telling the device to start transmitting. Some of those operations are inherently per-packet, but others are not. The acquisition of the queue lock could be amortized across multiple packet transmissions, for example, and the act of telling the device to start transmission may be expensive indeed. It can involve hardware operations or, even, on some systems, hypervisor calls.

Often, when there is one packet to transmit, there are others waiting in the queue as well; network traffic can be inherently bursty. So it would make sense for the networking stack to try to split the fixed costs of starting packet transmission across as many packets as possible. Some techniques, such as segmentation offload (wherein the network interface splits large chunks of data into packets) perform that kind of batching. But, in current kernels, if the networking stack has a set of packets ready to go, they will be sent out the slow way, one at a time.

That situation will begin to change in 3.18, when a relatively small set of changes will be merged. Consider the function exported by drivers now to send a packet:

    netdev_tx_t	(*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev);

This function takes the packet pointed to by skb and transmits it via the specified dev. Every call is a standalone operation, with all the associated fixed costs. The initial plan for 3.18 was to specify a new function that drivers could provide:

    void (*ndo_xmit_flush)(struct net_device *dev, u16 queue);

If a driver provided this function, it was indicating to the networking stack that it is prepared for (and can benefit from) batched transmission. In this case, the networking stack could make multiple calls to ndo_start_xmit() to queue packets for transmission; the driver would accept them, but not actually start the transmission operation. At the end of a sequence of such calls, ndo_xmit_flush() would be called to indicate the end; at that point, actual hardware transmission would be started.

There were concerns, though, that putting another indirect function call into the transmit path would add too much overhead, so this particular function was ripped out almost as soon as it landed in the net-next repository. In its place, the sk_buff structure has gained a new Boolean variable called xmit_more. If that variable is true, then there are more packets coming and the driver can defer starting hardware transmission. This variable takes out the extra function call while making the needed information available to drivers that can make use of it.

This mechanism, added by David Miller, makes batching possible. A couple of drivers were fixed to support batching, but David did not change the networking stack to actually do the batching. That work fell to Jesper Dangaard Brouer, whose bulk dequeue support patches have also been merged for 3.18. This work, too, is limited in scope; in particular, it will only work with queuing disciplines that have a single transmit queue.

The change Jesper made is simple enough: in a situation where a packet is being transmitted, the stack will attempt to send out a series of packets together while the queue lock is held. The byte queue limits mechanism is used to put an upper bound on the amount of data that can be in flight at once. Once the limit is hit (or the supply of packets runs out), skb->xmit_more will be set to false and the traffic will be on its way.

Eric Dumazet looked at the patch set and realized that things could be taken a bit further: the process of validating packets for transmission could be moved outside of the queue lock entirely, increasing concurrency in the system. The resulting patch had benefits that Eric described as awesome: full 40Gb/sec wire speed, even in the absence of segmentation offload. Needless to say, this patch, too, has been accepted into the net-next tree for the 3.18 merge window.

All told, the changes are relatively small. But small changes can have big effects when they are applied to the right places. These little changes should help to ensure that the networking stack in the 3.18 release is the fastest yet.

Comments (10 posted)

Page faults in user space: MADV_USERFAULT, remap_anon_range(), and userfaultfd()

By Jonathan Corbet
October 7, 2014

The quest for performance often seems to lead developers to want to move functionalities out of the kernel and back into user space, where a dedicated application can, in theory, make things happen more quickly. Networking functions are often handled in this way, for example. A desire to move memory-management functions into user space is somewhat less common, but, as can be seen from Andrea Arcangeli's user-space page fault handling patch set, it is not unheard of.

Page-fault handling usually requires fetching data from secondary storage and placing it in the correct place in the faulting process's address space. Why would one want to do that in user space? The primary use case here is the live migration of virtual machines running under KVM. Migration requires moving the virtual machine's memory, which can take a long time, but the owner of that machine would like to see as brief an outage as possible while the migration is happening. Preferably, the migration would not be noticeable at all. One way to approach that goal is to move the minimal amount of memory needed to represent the virtual machine on the new host. Once the machine starts running in the new location, it will certainly try to access pages which have not yet been moved. If the (user-space) virtual machine manager can catch the resulting page faults, it can prioritize the transfer of the pages the running machine actually needs. It is, in other words, a form of cross-host demand paging that makes migration happen with lower latency.

Other uses — shared memory distributed across the network, for example — are possible as well.

The patch set starts by adding a couple of new variants to the get_user_pages() function, which is charged with making user-space pages accessible to the kernel:

    long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
		               unsigned long start, unsigned long nr_pages,
	    		       int write, int force, struct page **pages,
			       int *locked);
    long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
				 unsigned long start, unsigned long nr_pages,
				 int write, int force, struct page **pages);

The former version is intended to be called with the mmap_sem semaphore held. It may release that semaphore while running, in which case *locked will be set to zero. The second form, instead, assumes that mmap_sem is not held. Using these functions in the kernel improves performance by allowing mmap_sem to be dropped while page-fault handling is in progress. That is useful even in current kernels, but, if handling of faults is going to be entrusted to user space, it will become necessary. Holding mmap_sem while calling out to user space would not be a recipe for happy times.

The next step is to add the MADV_USERFAULT flag to the madvise() system call. If that flag is set on a region of memory, the kernel will no longer attempt to resolve page faults in that region. Instead, in the absence of other measures (described below), the faulting process will receive a SIGBUS signal. That, of course, leaves the process in the position of having to resolve the page fault on its own. A tool provided to help with that task is the new remap_anon_pages() system call:

    int remap_anon_pages(void *dest, void *src, unsigned long len,
    			 unsigned long flags);

This system call will take the pages holding len bytes starting at src and move them in the process's address space to the region starting at dest. A number of conditions must be met for this operation to succeed, starting with the fact that the full range in dest must currently be unmapped — remap_anon_pages() will not overwrite an existing page mapping. The range in src, instead, must all be present and mapped, and the pages cannot be shared with other processes. All of these rules exist to simplify the implementation, but also to try to catch race conditions in user-space fault handling.

If src is a huge page, and len is a multiple of 2MB, then the full huge page(s) will be moved to dest without being split.

With this mechanism in place, an application's SIGBUS signal handler can respond to a fault by allocating memory, filling it with the needed contents, and mapping it into the proper location with remap_anon_pages(). Once the signal handler returns, the page fault will be retried, but, this time, the needed memory will be in place, so application execution will continue.

Anybody who has worked with signal handlers on Unix-like systems is probably thinking at this point that all that work does not belong in such a handler. And, indeed, signal handlers are not the way that processes are expected to deal with page-fault handling. To make life easier, Andrea adds another system call:

    int userfaultfd(int flags);

This call will return an open file descriptor which may be used to communicate with the kernel about page fault handling. The flags argument is mostly unused, though O_NONBLOCK may be provided to request non-blocking behavior.

The first step after acquiring the file descriptor is for the application to write a 64-bit integer indicating which version of the userfault protocol it understands. The kernel will respond with the same number if the protocol is supported, -1 otherwise. Once agreement has been reached in that area, the application can read a 64-bit address whenever a page fault occurs. It should resolve the fault, then write back two pointers indicating the range of memory which has been mapped in response to the fault.

The idea here is that a process can dedicate a thread to page-fault handling. Whenever a fault occurs, the faulting thread will pause while the handler thread puts things in place. No SIGBUS signals will be delivered if userfaultfd() has been called. So, for the faulting thread, life just continues as usual, with the possible exception that some page faults may take longer to handle than one might expect.

As was mentioned above, there might be multiple use cases for user-space page fault handling. What if a single application wishes to exercise more than one of those cases? To that end, the application can open more than one file descriptor with userfaultfd() and restrict each to a specific range of memory. That restriction is requested by writing two pointers indicating the range to be covered; the least-significant bit should be set on the start pointer. Thereafter, only faults within the given range will be directed to that file descriptor. The application must still set MADV_USERFAULT on the ranges in question. Multiple ranges can be set up to go to a single file descriptor, but a given range of memory can only have its faults handled by a single descriptor.

The bulk of the commentary on the patch set has been around the remap_anon_pages() system call. Linus initially wondered whether remap_anon_pages() made more sense than remap_file_pages(), which he called an "unmitigated disaster" and which may be removed in the near future. Later he added that he would prefer an interface where the fault handler process would simply write() the data to the page of interest, causing it to be allocated and mapped. Andrea responded that such an interface might be possible; the handler would write the data to the userfaultfd() file descriptor and the kernel would handle the rest. But he worried about losing the zero-copy behavior that was carefully designed into the current interface. Linus's answer to that made it clear that he was not concerned about zero-copy behavior, which, he said, is almost never worth the cost of implementing it.

What we may see is that the get_user_pages() optimizations will find their way in relatively soon, though Linus wasn't entirely happy with those either. The remaining work will take a while longer, and the end result seems unlikely to include remap_anon_pages(). But, given that the use case is real, a significant improvement to live migration is going to be hard to turn down in the long run.

Comments (10 posted)

Linus Torvalds Linux 3.17 ?

Greg KH Linux 3.16.4 ?

Greg KH Linux 3.14.20 ?

Kamal Mostafa Linux 3.13.11.8 ?

Jiri Slaby Linux 3.12.30 ?

Greg KH Linux 3.10.56 ?

Beniamino Galvani ARM: meson: add support for Meson8 ?

Scott Branden Add initial support for Broadcom Cygnus SoC ?

Chen-Yu Tsai ARM: sunxi: Add basic support for Allwinner A80 SoC ?

AKASHI Takahiro arm64: add seccomp support ?

Leonid Yegoshin MIPS executable stack protection ?

Ley Foon Tan nios2 Linux kernel port ?

Joe Lawrence workqueue: add RCU quiescent state between items ?

Richard Guy Briggs namespaces: log namespaces per task ?

Vincent Guittot sched: consolidation of cpu_capacity ?

Guenter Roeck kernel: Add support for poweroff handler call chain ?

Mike Turquette introduce capacity_ops to CFS ?

Shuah Khan kselftest framework and test changes to use it ?

Eryu Guan xfstests: new btrfs stress test cases ?

Andrey Ryabinin Kernel address sanitizer - runtime memory debugger. ?

Andy Grover Userspace Passthrough backstore for LIO ?

Ricard Wanderlof : mtd: nand: Driver for Evatronix NAND flash controller ?

Krzysztof Kozlowski power/mfd: Add max77693 charger driver ?

Daniel Baluta iio: Introduce activity channel ?

Dmitry Lavnikevich Add Phytec pbab01 audio support ?

J. German Rivera drivers/bus: Freescale Management Complex bus driver patch series ?

Beniamino Galvani i2c: add driver for Amlogic Meson I2C controller ?

Philipp Zabel CODA7 JPEG support ?

Bjorn Andersson Qualcomm PM8941 power key driver ?

Jacob Pan Initial support for XPowers AXP288 PMIC ?

Gyungoh Yoo Adding Skyworks SKY81452 device drivers ?

Ivan T. Ivanov Input: pmic8xxx-keypad - add support for pm8941 keypad ?

Michael Neuling POWER8 Coherent Accelerator device driver ?

Lee Jones watchdog: Introduce new driver to support ST's Watchdog ?

Mark Yao Add drm driver for Rockchip Socs ?

Mike Looijmans Add LTC2941/LTC2943 Battery Gauge Driver ?

Bart Tanghe pwm: add BCM2835 PWM driver ?

Lina Iyer qcom: spm: Add Subsystem Power Manager driver ?

Lina Iyer QCOM 8974 and 8084 cpuidle driver ?

Tomeu Vizoso Per-user clock constraints ?

Luis R. Rodriguez driver-core: async probe support ?

Rafael J. Wysocki Add ACPI _DSD and unified device properties support ?

Doug Anderson drivers/pinctrl: Add the concept of an "init" state ?

Michael Kerrisk (man-pages) man-pages-3.74 is released ?

Yann Droneaud fanotify: enable close-on-exec on events' fd when requested in fanotify_init() ?

Ben Myers Unicode/UTF-8 support for XFS ?

Vlastimil Babka Single zone pcpclists drain ?

Andrea Arcangeli RFC: userfault v2 ?

Matthew Wilcox Huge page support for DAX ?

Jesper Dangaard Brouer qdisc: bulk dequeue support ?

Tom Herbert net: Generic UDP Encapsulation ?

Andy Zhou Add Geneve tunnel protocol support ?

John Fastabend [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access ?

Simon Horman datapath: offload hooks ?

Matias Bjørling Support for Open-Channel SSDs (was dm-lightnvm) ?

Richard Guy Briggs audit by executable name ?

Dmitry Kasatkin ima: few policy loading improvements ?

David Howells MODSIGN: Use PKCS#7 cert to avoid SKIDs ?

Jarkko Sakkinen TPM 2.0 support ?

Stefano Stabellini introduce XENMEM_cache_flush ?

Kernel development

Brief items

Kernel release status

Quotes of the week

A Git repository for Fedora's kernel

Kernel development news

3.18 Merge window part 1

Bulk network packet transmission

Page faults in user space: MADV_USERFAULT, remap_anon_range(), and userfaultfd()

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers