Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.1-rc3, released on May 10. Linus said: "Go out and test. By -rc3, things really should be pretty non-threatening and this would be a good time to just make sure everything is running smoothly if you haven't tried one of the earlier development kernels already."

Stable updates: 3.10.77, 3.14.41, 3.19.7, and 4.0.2 were released on May 7. 3.19.8 (the final 3.19 update) followed on May 11. The 4.0.3, 3.14.42, and 3.10.78 updates came out on May 13.

Canonical has announced that it will be maintaining the 3.19 kernel series through July 2016.

Comments (none posted)

Quotes of the week

I dislike "turn off safety for performance" options because Joe SpeedRacer will always select performance over safety.

— Dave Chinner

Ingo, I feel like you just gave me a free puppy...

— Rusty Russell

PR_SET_DATAPLANE, BARE_METAL, NO_INTERRUPTS, NO_HZ_PURE, NO_HZ_LEAVE_ME_THE_FSCK_ALONE, NO_HZ_OVERFLOWING, CONFIG_ISOLATION, PR_SET_NONSTOP...

— The kernel feature-naming bikeshed committee

Comments (none posted)

Memory protection keys

By Jonathan Corbet
May 13, 2015

The memory-management units built into most contemporary processors are able to control access to memory on a per-page basis. Operating systems like Linux make that control available to applications in user space; the protection bits supplied to system calls like mmap() and mprotect() allow a process to say whether any given page should be readable, writable, or executable. This level of protection has served for a long time, so one might be tempted to conclude that it provides everything that applications need. But a new hardware feature under development at Intel suggests otherwise; the first round of patches to support this feature explore how programs might gain access to this feature.

This feature is called "memory protection keys" (MPK); it will only be available in future 64-bit Intel processors. When this feature is enabled, four (previously unused) bits in each page-table entry can be used to assign one of sixteen "key" values to any given page. There is also a new 32-bit processor register with two bits for each key value. Setting the "write disable" bit for a given key will block all attempts to write a page with that key value; setting the "access disable" bit will block reads as well. The MPK feature thus allows a process to partition its memory into a maximum of sixteen regions and to selectively disable or enable access to any of those regions. The control register is local to each thread, so different threads can enable or disable different regions independently.

A patch set enabling the MPK feature has been posted by Dave Hansen for review even though, as he noted, nobody outside of Intel will be able to actually run that code at this time. Dave is hoping to get comments on the (minimal) user-space API changes needed to support MPK once the hardware is available.

In the proposed design, applications can set the page keys using any of the system calls that set the other page protections — mprotect(), for example. There are four new flags defined (PROT_PKEY0 through PROT_PKEY3) to represent the key bits. Within the kernel, these bits are stored in the virtual memory area (VMA), and pushed into the relevant location in the hardware page tables. If a process attempts to access a page in a way that is not allowed by the protection keys, it will get the usual SIGSEGV signal. Should it catch that signal, it can look for the new SEGV_PKUERR code (in the si_code field of the siginfo_t structure passed to the handler) to detect a fault caused by a protection key. There is not currently a way to determine which key caused the fault, but adding that is on the list of things to do in the future.

One might well wonder why this feature is needed when everything it does can be achieved with the memory-protection bits that already exist. The problem with the current bits is that they can be expensive to manipulate. A change requires invalidating translation lookaside buffer (TLB) entries across the entire system, which is bad enough, but changing the protections on a region of memory can require individually changing the page-table entries for thousands (or more) pages. Instead, once the protection keys are set, a region of memory can be enabled or disabled with a single register write. For any application that frequently changes the protections on regions of its address space, the performance improvement will be large.

There is still the question (as asked by Ingo Molnar) of just why a process would want to make this kind of frequent memory-protection change. There would appear to be a few use cases driving this development. One is the handling of sensitive cryptographic data. A network-facing daemon could use a cryptographic key to encrypt data to be sent over the wire, then disable access to the memory holding the key (and the plain-text data) before writing the data out. At that point, there is no way that the daemon can leak the key or the plain text over the wire; protecting sensitive data in this way might also make applications a bit more resistant to attack.

Another commonly mentioned use case is to protect regions of data from being corrupted by "stray" write operations. An in-memory database could prevent writes to the actual data most of the time, enabling them only briefly when an actual change needs to be made. In this way, database corruption due to bugs could be fended off, at least some of the time. Ingo was unconvinced by this use case; he suggested that a 64-bit address space should be big enough to hide data in and protect it from corruption. He also suggested that a version of mprotect() that optionally skipped TLB invalidation could address many of the performance issues, especially if huge pages were used. Alan Cox responded, though, that there is real-world demand for the ability to change protection on gigabytes of memory at a time, and that mprotect() is simply too slow.

Being able to turn off unexpected writes could be especially useful when the underlying memory is a persistent memory device; any erroneous write there will go immediately to permanent storage. There have also been suggestions that tools like Valgrind could make good use of MPK.

Ingo's concerns notwithstanding, the MPK hardware feature is being added in response to customer interest; it would be surprising if the kernel did not end up supporting it, especially given that the required changes are not hugely invasive. So the real question is whether the proposed user-space API is correct and supportable in the long run. Hopefully, developers who think they might make use of this feature will take a look at the patches and make themselves heard if they find something they don't like.

Comments (11 posted)

Persistent memory and page structures

By Jonathan Corbet
May 13, 2015

As is suggested by its name, persistent memory (or non-volatile memory) is characterized by the persistence of the data stored in it. But that term could just as well be applied to the discussions surrounding it; persistent memory raises a number of interesting development problems that will take a while to work out yet. One of the key points of discussion at the moment is whether persistent memory should, like ordinary RAM, be represented by page structures and, if so, how those structures should be managed.

One page structure exists for each page of (non-persistent) physical memory in the system. It tracks how the page is used and, among other things, contains a reference count describing how many users the page has. A pointer to a page structure is an unambiguous way to refer to a specific physical page independent of any address space, so it is perhaps unsurprising that this structure is used with many APIs in the kernel. Should a range of memory exist that lacks corresponding page structures, that memory cannot be used with any API expecting a struct page pointer; among other things, that rules out DMA and direct I/O.

Persistent memory looks like ordinary memory to the CPU in a number of ways. In particular, it is directly addressable at the byte level. It differs, though, in its persistence, its performance characteristics (writes, in particular, can be slow), and its size — persistent memory arrays are expected to be measured in terabytes. At a 4KB page size, billions of page structures would be needed to represent this kind of memory array — too many to manage efficiently. As a result, currently, persistent memory is treated like a device, rather than like memory; among other things, that means that the kernel does not need to maintain page structures for persistent memory. Many things can be made to work without them, but this aspect of persistent memory does bring some limitations; one of those is that it is not currently possible to perform I/O directly between persistent memory and another device. That, in turn, thwarts use cases like using persistent memory as a cache between the system and a large, slow storage array.

Page-frame numbers

One approach to the problem, posted by Dan Williams, is to change the relevant APIs to do away with the need for page structures. This patch set creates a new type called __pfn_t:

    typedef struct {
	union {
	    unsigned long data;
	    struct page *page;
	};
    __pfn_t;

As is suggested by the use of a union type, this structure leads a sort of double life. It can contain a page pointer as usual, but it can also be used to hold an integer page frame number (PFN). The two cases are distinguished by setting one of the low bits in the data field; the alignment requirements for page structures guarantee that those bits will be clear for an actual struct page pointer.

A small set of helper functions has been provided to obtain the information from this structure. A call to __pfn_t_to_pfn() will obtain the associated PFN (regardless of which type of data the structure holds), while __pfn_t_to_page() will return a struct page pointer, but only if a page structure exists. These helpers support the main goal for the __pfn_t type: to allow the lower levels of the I/O stack to be converted to use PFNs as the primary way to describe memory while avoiding massive changes to the upper layers where page structures are used.

With that infrastructure in place, the block layer is changed to use __pfn_t instead of struct page; in particular, the bio_vec structure, which describes a segment of I/O, becomes:

    struct bio_vec {
        __pfn_t         bv_pfn;
        unsigned short  bv_len;
        unsigned short  bv_offset;
    };

The ripple effects from this change end up touching nearly 80 files in the filesystem and block subtrees. At a lower level, there are changes to the scatter/gather DMA API to allow buffers to be specified using PFNs rather than page structures; this change has architecture-specific components to enable the mapping of buffers by PFN.

Finally, there is the problem of enabling kmap_atomic() on PFN-specified pages. kmap_atomic() maps a page into the kernel's address space; it is only really needed on 32-bit systems where there is not room to map all of main memory into that space. On 64-bit systems it is essentially a no-op, turning a page structure into its associated kernel virtual address. That problem gets a little trickier when persistent memory is involved; the only code that really knows where that memory is mapped is the low-level device driver. Dan's patch set adds a function by which the driver can inform the rest of the kernel of the mapping between a range of PFNs and kernel space; kmap_atomic() is then adapted to use that information.

All together, this patch set is enough to enable direct block I/O to persistent memory. Linus's initial response was on the negative side, though; he said "I detest this approach". Instead, he argued in favor of a solution where special page structures are created for ranges of persistent memory when they are needed. As the discussion went on, though, he moderated his position, saying: "So while I (very obviously) have some doubts about this approach, it may be that the most convincing argument is just in the code." That code has since been reposted with some changes, but the discussion is not yet finished.

Back to page structures

Various alternatives have been suggested, but the most attention was probably drawn by Ingo Molnar's "Directly mapped pmem integrated into the page cache" proposal. The core of Ingo's idea is that all persistent memory would have page structures, but those structures would be stored in the persistent memory itself. The kernel would carve out a piece of each persistent memory array for these structures; that memory would be hidden from filesystem code.

Despite being stored in persistent memory, the page structures themselves would not be persistent — a point that a number of commenters seemed to miss. Instead, they would be initialized at boot time, using a lazy technique so that this work would not overly slow the boot process as a whole. All filesystem I/O would be direct I/O; in this picture, the kernel's page cache has little involvement. The potential benefits are huge: vast amounts of memory would be available for fast I/O without many of the memory-management issues that make life difficult for developers today.

It is an interesting vision, and it may yet bear fruit, but various developers were quick to point out that things are not quite as simple as Ingo would like them to be. Matthew Wilcox, who has done much of the work to make filesystems work properly with persistent memory, noted that there is an interesting disconnect between the lifecycle of a page-cache page and that of a block on disk. Filesystems have the ability to reassign blocks independently of any memory that might represent the content of those blocks at any given time. But in this directly mapped view of the world, filesystem blocks and pages of memory are the same thing; synchronizing changes to the two could be an interesting challenge.

Dave Chinner pointed out that the directly mapped approach makes any sort of data transformation by the filesystem (such as compression or encryption) impossible. In Dave's view, the filesystem needs to have a stronger role in how persistent memory is managed in general. The idea of just using existing filesystems (as Ingo had suggested) to get the best performance out of persistent memory is, in his view, not sustainable. Ingo, instead, seems to feel that management of persistent memory could be mostly hidden from filesystems, just like the management of ordinary memory is.

In any case, the proof of this idea would be in the code that implements it, and, currently, no such code exists. About the only thing that can be concluded from this discussion is that the kernel community still has not figured out the best ways of dealing with large persistent-memory arrays. Likely as not, it will take some years of experience with the actual hardware to figure that out. Approaches like Dan's might just be merged as a way to make things work for now. The best way to make use of such memory in the long term remains undetermined, though.

Comments (1 posted)

Trading off safety and performance in the kernel

By Jonathan Corbet
May 12, 2015

The kernel community ordinarily tries to avoid letting users get into a position where the integrity of their data might be compromised. There are exceptions, though; consider, for example, the ability to explicitly flush important data to disk (or more importantly, to avoid flushing at any given time). Buffering I/O in this manner can significantly improve disk write I/O throughput, but if application developers are careless, the result can be data loss should the system go down at an inopportune time. Recently there have been a couple of proposed performance-oriented changes that have tested the community's willingness to let users put themselves into danger.

O_NOMTIME

A file's "mtime" tracks the last modification time of the file's contents; it is typically updated when the file is written to. Zach Brown recently posted a patch creating a new open() flag called O_NOMTIME; if that flag is present, the filesystem will not update mtime when the file is changed. This change is wanted by the developers of the Ceph filesystem, which has no use for mtime updates:

The ceph servers don't use mtime at all. They're using the local file system as a backing store and any backups would be driven by their upper level ceph metadata. For ceph, slow IO from mtime updates in the file system is as daft as if we had block devices slowing down IO for per-block write timestamps that file systems never use.

Disabling mtime updates, Zach said, can reduce total I/O associated with a write operation by a factor of two or more.

There are, of course, a couple of problems with turning off mtime updates. Trond Myklebust noted that it would break NFS "pretty catastrophically" to not maintain that information; NFS clients would lose the ability to detect when they have stale cached data, leading to potential data corruption. The biggest concern, though, appears to be the effect on filesystem backups; if a file's mtime is not updated when the file is modified, that file will not be picked up in an incremental backup (assuming the backup scheme uses mtime, which most do). A system's administrator might decide to run that risk, but there is the possibility that users may run it for them. As Dave Chinner put it:

The last thing an admin wants when doing disaster recovery is to find out that the app started using O_NOMTIME as a result of the upgrade they did 6 months ago. Hence the last 6 months of production data isn't in the backups despite the backup procedure having been extensively tested and verified when it was first put in place.

Another way of putting it is that the mtime value is often not there for the benefit of the creator of the file; it is often used by others as part of the management of the system. Allowing the creator to disable mtime updates may have implications for those others, who would then have cause to wish that they had been part of that decision before it was made.

Despite the concerns, most developers appear to recognize that there is a real use case for being able to turn off mtime updates. So the discussion shifted quickly to how this capability could be provided without creating unpleasant surprises for system administrators. There appear to be two approaches toward achieving that goal.

The first of those is to not allow applications to disable mtime updates unless the system administrator has agreed to it. That agreement is most likely to take the form of a special mount option; unless a specific filesystem has been mounted with the "allow_nomtime" option, attempts to disable mtime updates on that filesystem will be denied. The second aspect is to hide the option in a place where it does not look like part of the generic POSIX API. In practice, that means that, rather than being a flag for the open() system call, O_NOMTIME will probably become a mode that is enabled with an ioctl() call.

Syncing and suspending

Putting a system into the suspended state is a complicated task with a number of steps; in current kernels, one of those steps is to call sys_sync() to flush all dirty file pages back out to persistent storage. It might seem intuitively obvious that saving the contents of files before suspending is a good thing to do, but that has not stopped Len Brown from posting a patch to remove the sys_sync() call from the suspend path.

Len's contention is that flushing disks can be an expensive operation (it can take multiple seconds) and that this cost should not necessarily be paid every time the system is suspended. Doing the sync unconditionally in the kernel, in other words, is a policy decision that may not match what all users want. Anybody who wants file data to be flushed is free to run sync before suspending the system, so removing the call just increases the flexibility of the system.

This change concerns some; Alan Cox was quick to point out some reasons why it makes sense to flush out file data, including the facts that resume doesn't always work and that users will sometimes disconnect drives from a suspended system. It has also been pointed out that, sometimes, a suspended system will never resume due to running out of battery or the kernel being upgraded. For cases like this, it was argued, removing the sys_sync() call is just asking for data to be lost.

Nobody, of course, is trying to make the kernel more likely to lose data. The driving force here is something different: the meaning of "suspending" a system is changing. A user who suspends a laptop by closing the lid prior to tossing it into a backpack almost certainly wants all data written to disk first. But when a system is using suspend as a power-management mechanism, the case is not quite so clear. If a system is able to suspend itself between every keystroke — as some systems are — it may not make sense to do a bunch of disk I/O every time. That may be doubly true on small mobile devices where the power requirements are strict and the I/O devices are slow. On such systems, it may well make sense to suspend the system without flushing I/O to persistent storage first.

The end result is that most (but not all) developers seem to agree that there is value in being able to suspend the system without syncing the disks first. There is rather less consensus, though, on whether that should be the kernel's default behavior. If this change goes in, it is likely to be controlled by a sysctl knob, and the default value of that knob will probably be to continue to sync files as is done in current kernels.

Comments (108 posted)

Linus Torvalds Linux 4.1-rc3 ?

Greg KH Linux 4.0.3 ?

Greg KH Linux 4.0.2 ?

Greg KH Linux 3.19.8 ?

Greg KH Linux 3.19.7 ?

Luis Henriques Linux 3.16.7-ckt11 ?

Greg KH Linux 3.14.42 ?

Greg KH Linux 3.14.41 ?

Kamal Mostafa Linux 3.13.11-ckt20 ?

Greg KH Linux 3.10.78 ?

Greg KH Linux 3.10.77 ?

Ben Hutchings Linux 3.2.69 ?

Masahiro Yamada ARM: SoC: add a new platform, UniPhier (arch/arm/mach-uniphier) ?

Maxime Coquelin Add support to STMicroelectronics STM32 family ?

Frank.Li-KZfg59tc24xl57MIdRCFDg@public.gmane.org Add Freescale i.mx7d support ?

Joachim Eastwood Add support for NXP LPC18xx family ?

Stefan Agner ARM: vf610m4: Add Vybrid Cortex-M4 support ?

Vishnu Patekar Introduce Allwinner A33 support ?

Bintian Wang arm64,hi6220: Enable Hisilicon Hi6220 SoC ?

Yoshinori Sato Re-introduce h8300 architecture ?

Dave Hansen [RFC] x86: Memory Protection Keys ?

Vikas Shivappa x86/intel_rdt: Intel Cache Allocation support ?

Paul Gortmaker Introduce builtin_driver and use it for non-modular code ?

Michael Turquette scheduler-driven cpu frequency selection ?

Morten Rasmussen sched: Energy cost model for energy-aware scheduling ?

Richard Guy Briggs namespaces: log namespaces per task ?

Tejun Heo cgroup, sched: restructure threadgroup locking and replace it with a percpu_rwsem ?

Darren Hart selftest: Add futex functional tests ?

Tejun Heo netconsole: implement extended console support ?

Heikki Krogerus usb: ulpi bus ?

Lee Jones mfd: watchdog: rtc: Add LPC Clocksource driver ?

Ingi Kim Add ktd2692 Flash LED driver using LED Flash class ?

Hans Verkuil cobalt: new HDMI Rx/Tx PCIe driver ?

NeilBrown UART slave device support - version 4 ?

YH Huang Add Mediatek display PWM driver ?

Ingi Kim Add ktd2692 Flash LED driver using LED Flash class ?

Fabien Dessenne Add media bdisp driver for stihxxx platforms ?

Brian Norris mtd: nand: add Broadcom NAND controller support ?

Srinivas Kandagatla ASoC: qcom: add support to apq8016 audio ?

Mathieu Poirier Support for coresight ETMv4 tracer ?

Pantelis Antoniou The Beaglebone capemanager ?

Luis R. Rodriguez firmware: add PKCS#7 firmware signature support ?

Michael Kerrisk (man-pages) man-pages-4.00 is released ?

Ingo Molnar [RFC PATCH v6] Documentation/arch: Add Documentation/arch-features.txt ?

Rajat Jain Documentation/scsi: Documentation about scsi_cmnd lifecycle ?

Ming Lin simplify block layer based on immutable biovecs ?

NeilBrown make BTRFS, UDF, NILFS2 work with NFSv2. ?

Zach Brown vfs: add a O_NOMTIME flag ?

Shaohua Li a caching layer for raid 5/6 ?

William Orr Implement fchmodat4 system call ?

Omar Sandoval Btrfs: RAID 5/6 missing device replace ?

Al Viro non-recursive pathname resolution & RCU symlinks ?

Dan Williams [PATCH v3 00/11] evacuate struct page from the block layer, introduce __pfn_t ?

Anisse Astier Sanitizing freed pages ?

Vladimir Davydov idle memory tracking ?

Sergey Senozhatsky add on-demand zram device creation ?

Eric B Munson Allow user to request memory to be locked on page fault ?

Alexander Duyck Refactor netdev page frags and move them into mm/ ?

Vlastimil Babka Outsourcing page fault THP allocations to khugepaged ?

Pavel Emelyanov [PATCH 0/5] UserfaultFD: Extension for non cooperative uffd usage (v2) ?

Jiri Pirko introduce programable flow dissector and cls_flower ?

Stephan Mueller Seeding DRBG with more entropy ?

Konstantin Khlebnikov pagemap: make useable for non-privilege users ?

Paolo Bonzini KVM: x86: SMM support ?

Jan Kiszka Jailhouse 0.5 released ?

Andi Kleen Cycles annotation support for perf tools ?

Andi Kleen perf: Add basic Skylake PMU support v2 ?

Eric Leblond ulogd 2.0.5 release ?

Hajime Tazaki an introduction of Linux library operating system (LibOS) ?

shark linux shark project: trace and monitor system easily ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

Memory protection keys

Persistent memory and page structures

Page-frame numbers

Back to page structures

Trading off safety and performance in the kernel

O_NOMTIME

Syncing and suspending

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous