Kernel development

Brief items

Kernel release status

The current development kernel is 4.5-rc5, released on February 20. "Things continue to look normal, and things have been fairly calm. Yes, the VM THP cleanup seems to still be problematic on s390, but other than that I don't see anything particularly worrisome."

Stable updates: 4.3.6 (the final 4.3.x update) and 3.10.97 were released on February 19. The 4.4.3, 3.14.62, and 3.10.98 updates are in the review process as of this writing; they can be expected on or after February 24.

Comments (none posted)

Quotes of the week

Oh geez, we have a spelling.txt! I think we can declare the kernel as done and go do something else with our lives...

— Borislav Petkov

Which realistically won't actually matter because in 22 years time nobody will be able to find a 32bit system in common use. If you look at x86 platforms today a Pentium Pro is already a collectors item. All of today's locked down half-maintained embedded and phone devices will be at best the digital equivalent of toxic waste if connected to anything.

— Alan Cox

Comments (none posted)

Kernel development news

A BoF on kernel network performance

By Nathan Willis
February 24, 2016

Netdev/Netconf

Whether one measures by attendance or by audience participation, one of the most popular sessions at the Netdev 1.1 conference in Seville, Spain was the network-performance birds-of-a-feather (BoF) session led by Jesper Brouer. The session was held in the largest conference room to a nearly packed house. Brouer and seven other presenters took the stage, taking turns presenting topics related to finding and removing bottlenecks in the kernel's packet-processing pipeline; on each topic, the audience weighed in with opinions and, often, proposed fixes.

The BoF was not designed to produce final solutions, but rather to encourage debate and discussion—hopefully fostering further work. Debate was certainly encouraged, to the point where Brouer was not able to get to every topic on the agenda before time had elapsed. But what was covered provides a good snapshot of where network-optimization efforts stand today.

DDoS mitigation

The first to speak was Gilberto Bertin from web-hosting provider CloudFlare. The company periodically encounters network bottlenecks on its Linux hosts, he said, with the most egregious being those caused by distributed denial-of-service (DDoS) attacks. Even a relatively small packet flood, such as two million UDP packets per second (2Mpps), will max out the kernel's packet-processing capabilities, saturating the receive queue faster than it can be emptied and causing the system to drop packets. 2Mpps is nowhere near the full 10G Ethernet wire speed of 14Mpps.

DDoS attacks are usually primitive, and an iptables drop rule targeting each source address should suffice, but CloudFlare has found it insufficient. Instead, the company is forced to offload traffic to a user-space packet handler. Bertin proposed two approaches to solving the problem: using Berkeley Packet Filter (BPF) programs shortly after packet ingress to parse incoming packets (dropping DDoS packets before they enter the receive queue), and using circular buffers to process incoming traffic (thus eliminating many memory allocations).

Brouer reported that he had tested several possible solutions himself, including using receive packet steering (RPS) and dedicating a CPU to handling the receive queue. Using RPS alone, he was able to handle 7Mpps in laboratory tests; by also binding a CPU, the number rose to 9Mpps. Audience members proposed several other approaches; Jesse Brandeburg suggested designating a queue for DDoS processing and steering other traffic away from it. Brouer discussed some tests he had run attempting to put drop rules as early as possible in the pipeline; none made a drastic difference in the throughput. When an audience member asked if BPF programs could be added to the network interface card's (NIC's) driver, David Miller suggested that running drop-only rules against the NIC's DMA buffer would be the fastest the kernel could possibly respond.

There was also a lengthy discussion about how to reduce the overhead caused by memory operations. Brouer reported that memcpy() calls accounted to as much as 43% of the time required to process a received packet. Jamal Hadi Salim asked whether sk_buff buffers could simply be recycled in a ring; Alexander Duyck replied that not all NIC drivers would support that approach. Ultimately, Brouer wrapped up the topic by saying there was no clear solution: latency hides in a number places in the pipeline, so reducing cache misses, using bulk memory allocation, and re-examining the entire allocation strategy on the receive side may be required.

Transmit powers

Brouer then presented the next topic, improving transmit performance. He noted that bulk transmission with the xmit_more API had solved the outgoing-traffic bottleneck, enabling the kernel to transmit packets at essentially full wire speed. But, he said, the "full wire speed" numbers are really achievable only in artificial workloads. For practical usage, it is hard to activate the bulk dequeuing discipline. Since the technique lowers CPU utilization, it would be beneficial to many users if it could be enabled well before one approaches the bandwidth limit.

He suggested several possible alternative means to activate xmit_more, including setting off a trigger whenever the hardware transmit queue gets full, tuning Byte Queue Limits (BQLs), and providing a user-space API to activate bulk sending. He had experimented some with the BQL idea, he reported: adjusting the BQL downward until the bulk queuing discipline kicks in resulted in a 64% increase in throughput.

Tom Herbert was not thrilled about that approach, noting that BQL was, by design, intended to be configuration-free; using it as a tunable feature seems like asking for trouble. John Fastabend asked if a busy driver could drop packets rather than queuing them, thus triggering the bulk discipline. Another audience member proposed adding an API through which the kernel could tell a NIC driver to split its queues. There was little agreement on approaches, although most in attendance seemed to feel that further discussion in this area was well warranted.

The trials of small devices

Next, Felix Fietkau of the OpenWrt project spoke, raising concerns that recent development efforts in the kernel networking space focused too much on optimizing behavior for high-end Intel-powered machines, at the risk of hurting performance on low-end devices like home routers and ARM systems. In particular, he pointed out that these smaller devices have significantly smaller data cache sizes, comparable instruction cache sizes but without smart pre-fetchers, and smaller cache-line sizes. Some of the recent optimizations, particularly cache-line optimizations, can hurt performance on small systems, he said.

He showed some benchmarks of kernel 4.4 running on a 720MHz Qualcomm QCA9558 system-on-chip. Base routing throughput was around 268Mbps; activating nf_conntrack_rtcache raised throughput to 360Mbps. Also removing iptable_mangle and iptable_raw increased throughput to 400Mbps. The takeaway, he said, was that removing or conditionally disabling unnecessary hooks (such as statistics-gathering hooks) was vital, as was eliminating redundant accesses to packet data.

Miller commented that the transactional overhead of the hooks in question was the real culprit, and asked whether or not many of the small devices in question would be a good fit for hardware offloading via the switchdev driver model. Fietkau replied that many of the devices do support offload, but that it is usually crippled in some fashion, such as not being configurable.

Fietkau also presented some out-of-tree hacks used to improve performance on small devices, including using lightweight socket buffers and using dirty pointer tricks to avoid invalidating the data cache.

Caching

Brouer then moved on to the topic of instruction-cache optimization. The network stack, he said, does a poor job of utilizing the instruction cache, since the typical cache size is shorter than the code used to process the average Ethernet packet. Furthermore, even though many packets appearing in the same time window get handled in the same manner, processing each packet individually means each packet hits the same instruction-cache misses.

He proposed several possible ways to better utilize the cache, starting with processing packets in bundles, enabling several to be processed simultaneously at each stage. NIC drivers could bundle received packets, he said, for more optimal processing. The polling routine already processes many packets at once, but it currently calls "the full stack" for each packet. And the driver can view all of the packets available in the receive ring, so it could simply treat them all as having arrived at the same time and process them in bulk. A side effect of this approach, he said, would be that it hides latency caused by cache misses.

A related issue, he said, is that the first cache miss often happens too soon for prefetching, in the eth_type_trans() function. By delaying the call to eth_type_trans() in the network stack's receive loop, the miss can be avoided. Even better, he said, would be to avoid calling eth_type_trans() altogether. The function is used to determine the packet's protocol ID, he said, which could also be determined from the hardware RX descriptor.

Brouer also proposed staging bundles of packets for processing at the generic receive offload (GRO) and RPS layers. GRO does this already, he said, though it could be further optimized. Implementing staged processing for RPS faces one hurdle in the fact that RPS takes cross-CPU locks for each packet. But Eric Dumazet pointed out that bulk enqueuing for remote CPUs should be easily doable. RPS already defers sending the inter-processor interrupt, which essentially amortizes the cost across multiple packets.

TC and other topics

Fastabend then spoke briefly (as time was running short) about the queuing discipline (qdisc) code path in the kernel's traffic control (TC) mechanism. Currently, qdisc takes six lock operations, even if the queue is empty and the packet is transmitted directly. He ran some benchmarks that showed that the locks account for 70–82% of the time spent in qdisc, and thus set out to re-implement qdisc in a lockless manner. He has posted an RFC implementation that reduces the lock count to two; the work is, therefore, not complete yet, but there are other items remaining on the to-do list. One is support for bulk dequeuing, the other is gathering some real-world numbers to determine if the performance improvement is as anticipated.

Brouer then gave a quick overview of the "packet-page" concept: at a very early point in the receive process, a packet could be extracted from the receive ring into a memory page, allowing it to be sent on an alternative processing path. "It's a crazy idea," he warned the crowd, but it has several potential use cases. First, it could be a point for kernel-bypass tools (such as the Data Plane Development Kit) to hook into. It could also allow the outgoing network interface to simply move the packet directly into the transmit ring, and it could be useful for virtualization (allowing guest operating systems to rapidly forward traffic on the same host). Currently, implementing packet-page requires hardware support (in particular, hardware that marks packet types in the RX descriptor), but Brouer reported that he has seen some substantial and encouraging results in his own experiments.

As the session time finally elapsed for good, Brouer also briefly addressed some ideas for reworking the memory-allocation strategy for received packets (as alluded to in the first mini-presentation of the BoF). One idea is to write a new allocator specific to the network receive stack. There are a number of allocations identified as introducing overhead, so there is plenty of room for improvement.

But other approaches are possible, too, he said. Perhaps using a DMA mapping would be preferable, thus avoiding all allocations. There are clear pitfalls, such as needing a full page for each packet and the overhead of clearing out enough headroom for inserting each sk_buff.

Finally, Brouer reminded the audience of just how far the kernel networking stack has come in recent years. In the past two years alone, he said, the kernel has moved from a maximum transmit throughput of 4Mpps to the full wire speed of 14.8Mpps. IPv4 forwarding speed has increased from 1Mpps to 2Mpps on single core machines (and even more on multi-core machines). The receive throughput started at 6.4Mpps and, with the latest experimental patches, now hovers around 12Mpps. Those numbers should be an encouragement; if the BoF attendees are anything to judge by, further performance gains are no doubt on the horizon still.

[The author would like to thank the Netconf and Netdev organizers for travel assistance to Seville.]

Comments (3 posted)

Sigreturn-oriented programming and its mitigation

By Jonathan Corbet
February 24, 2016

In the good old days (from one point of view, at least), attackers had an easy life; all they had to do was to locate a buffer overrun vulnerability, then they could inject whatever code they liked into the vulnerable process. Over the years, kernel developers have worked to ensure that data that can be written by an application cannot be executed by that application; that has made simple code-injection unfeasible in most settings. Attackers have responded with techniques like return-oriented programming (ROP), but ROP attacks are relatively hard to get right. On some systems, attackers may be able to use the simpler sigreturn-oriented programming (SROP) technique instead; kernel patches have been circulating in an attempt to head off that class of attacks.

Some background

If data on the stack cannot be executed, a buffer overflow vulnerability cannot be used to inject code directly into an application. Such vulnerabilities can, however, be used to change the program counter by overwriting the current function's return address. If the attacker can identify code existing within the target process's address space that performs the desired task, they can use a buffer overflow to "return" to that code and gain control.

Unfortunately for attackers, most programs lack a convenient "give me a shell" location to jump to via an overwritten return address. But it is still likely that the program contains the desired functionality; it is just split into little pieces and scattered throughout the address space. The core idea behind return-oriented programming is to find these pieces in places where they are followed by a return instruction. The attacker, who controls the stack, can not only jump to the first of these pieces; they can also place a return address on the stack so that when this piece executes its return instruction, control goes to another attacker-chosen location — the next piece of useful code. By stringing together a set of these "gadgets," the attacker can create a new program within the target process.

There are various tools out there to help with the creation of ROP attacks. Scanners can pass through an executable image and identify gadgets of interest. "ROP compilers" can then create a program to accomplish the attacker's objective. But the necessary gadgets may not be available, and techniques like address-space layout randomization (ASLR) make ROP attacks harder. So ROP attacks tend to be fiddly affairs, often specific to the system being attacked (or even to the specific running process). Attackers, being busy people like the rest of us, cannot be blamed if they look for easier ways to compromise a system.

Exploiting sigreturn()

Enter sigreturn(), a Linux system call that nobody calls directly. When a signal is delivered to a process, execution jumps to the designated signal handler; when the handler is done, control returns to the location where execution was interrupted. Signals are a form of software interrupt, and all of the usual interrupt-like accounting must be dealt with. In particular, before the kernel can deliver a signal, it must make a note of the current execution context, including the values stored in all of the processor registers.

It would be possible to store this information in the kernel itself, but that might make it possible for an attacker (of a different variety) to cause the kernel to allocate arbitrary amounts of memory. So, instead, the kernel stores this information on the stack of the process that is the recipient of the signal. Prior to invoking the signal handler, the kernel pushes an (architecture-specific) variant of the sigcontext structure onto the process's stack; this structure contains register information, floating-point status, and more. When the signal handler has completed its job, it calls sigreturn(), which restores all that information from the on-stack structure.

Attackers employing ROP techniques have to work to find gadgets that will store the desired values into specific processor registers. If they can call sigreturn(), though, life gets easier, since that system call sets the values of all registers directly from the stack. As it happens, the kernel has no way to know whether a specific sigreturn() call comes from the termination of a legitimate signal handler or not; the whole system was designed so that the kernel would not have to track that information. So, as Erik Bosman and Herbert Bos noted in this paper [PDF], sigreturn() looks like it might be helpful to attackers.

There is one obstacle that must be overcome first, though: an attacker must find a ROP gadget that makes a call to sigreturn() — and few applications do that directly. One way to do that would be to locate a more generic gadget for invoking system calls, then arrange for the appropriate number to be passed to indicate sigreturn(). But in many cases that is unnecessary; for years, the kernel developers conveniently put a sigreturn() call in a place where attackers could find it — at a fixed address that is not subject to ASLR. That address is in the "virtual dynamic shared object" (vDSO) area, a page mapped by the kernel in a known location into every process to optimize some system calls. On other systems, the sigreturn() call can be found in the C library; exploiting that one requires finding a way to leak some ASLR information first.

Bosman and Bos demonstrated that sigreturn() can be used to exploit processes with a buffer overflow vulnerability. Often, the sigreturn() gadget is the only one that is required to make the exploit work; in some cases, the exploit can be written in a system-independent way, able to be reused with no additional effort. More recent kernels have made these exploits harder (the vDSO area is no longer usable, for example), but they are still far from impossible. And, in any case, many interesting targets are running older kernels.

Stopping SROP

Scott Bauer recently posted a patch set meant to put an end to SROP attacks. Once the problem is understood, the solution becomes clear relatively quickly: the kernel needs a way to verify that a sigcontext structure on the stack was put there by the kernel itself. That would ensure that sigreturn() can only be called at the end of a real signal delivery.

Scott's patch works by generating a random "cookie" value for each process. As part of the signal-delivery process, that cookie is stored onto the stack, next to the sigcontext structure. Prior to being stored, it is XORed with the address of the stack location where it is to be stored, making it a bit harder to read back; future plans call for hashing the value as well, making the recovery of the cookie value impossible. Even without hashing, though, the cookie should be secure enough; an attacker who can force a signal and read the cookie off the stack is probably already in control.

The sigreturn() implementation just needs to verify that the cookie exists in the expected location; if it's there, then the call is legitimate and the call can proceed. Otherwise the operation ends and a SIGSEGV signal is delivered to the process, killing it unless the process has made other arrangements.

There are some practical problems with the patch still; for example, it will not do the right thing in settings where checkpoint-restore in user space is in use (a restored process will have a new and different random cookie value, but old cookies may still be on the stack). Such problems can be worked around, but they may force the addition of a sysctl knob to turn this protection off in settings where it breaks things. It also does nothing to protect against ROP attacks in general, it just closes off one relatively easy-to-exploit form of those attacks. But, as low-hanging fruit, it is probably worth pursuing; there is no point in making an attacker's life easier.

Comments (4 posted)

DAX and fsync: the cost of forgoing page structures

February 24, 2016

This article was contributed by Neil Brown

DAX, the support library that can help Linux filesystems provide direct access to persistent memory (PMEM), has seen substantial ongoing development since we covered it nearly 18 months ago. Its main goal is to bypass the page cache, allowing reads and writes to become memory copies directly to and from the PMEM, and to support mapping that PMEM directly into a process's address space with mmap(). Consequently, it was a little surprising to find that one of the challenges in recent months was the correct implementation of fsync() and related functions that are primarily responsible for synchronizing the page cache with permanent storage.

While that primary responsibility of fsync() is obviated by not caching any data in volatile memory, there is a secondary responsibility that is just as important: ensuring that all writes that have been sent to the device have landed safely and are not still in the pipeline. For devices attached using SATA or SCSI, this involves sending (and waiting for) a particular command; the Linux block layer provides the blkdev_issue_flush() API (among a few others) for achieving this. For PMEM we need something a little different.

There are actually two "flush" stages needed to ensure that CPU writes have made it to persistent storage. One stage is a very close parallel to the commands sent by blkdev_issue_flush(). There is a subtle distinction between PMEM "accepting" a write and "committing" a write. If power fails between these events, data could be lost. The necessary "flush" can be performed transparently by a memory controller using Asynchronous DRAM Refresh (ADR) [PDF], or explicitly by the CPU using, for example, the new x86_64 instruction PCOMMIT. This can be seen in the wmb_pmem() calls sprinkled throughout the DAX and PMEM code in Linux; handling this stage is no great burden.

The burden is imposed by the other requirement: the need to flush CPU caches to ensure that the PMEM has "accepted" the writes. This can be avoided by performing "non-temporal writes" to bypass the cache, but that cannot be ensured when the PMEM is mapped directly into applications. Currently, on x86_64 hardware, this requires explicitly flushing each cache line that might be dirty by invoking the CLFLUSH (Cache Line Flush) instruction or possibly a newer variant if available (CLFLUSHOPT, CLWB). An easy approach, referred to in discussions as the "Big Hammer", is to implement the blkdev_issue_flush() API by calling CLFLUSH on every address of the entire persistent memory. While CLFLUSH is not a particularly expensive operation, performing it over potentially terabytes of memory was seen as worrisome.

The alternative is to keep track of which regions of memory might have been written recently and to only flush those. This can be expected to bring the amount of memory being flushed down from terabytes to gigabytes at the very most, and hence to reduce run time by several orders of magnitude. Keeping track of dirty memory is easy when the page cache is in use by using a flag in struct page. Since DAX bypasses the page cache, there are no page structures for most of PMEM, so an alternative is needed. Finding that alternative was the focus of most of the discussions and of the implementation of fsync() support for DAX, culminating in patch sets posted by Ross Zwisler (original and fix-ups) that landed upstream for 4.5-rc1.

Is it worth the effort?

There was a subthread running through the discussion that wondered whether it might be best to avoid the problem rather than fix it. A filesystem does not have to use DAX simply because it is mounted from a PMEM device. It can selectively choose to use DAX or not based on usage patterns or policy settings (and, for example, would never use DAX on directories, as metadata generally needs to be staged out to storage in a controlled fashion). Normal page-cache access could be the default and write-out to PMEM would use non-temporal writes. DAX would only be enabled while a file is memory mapped with a new MMAP_DAX flag. In that case, the application would be explicitly requesting DAX access (probably using the nvml library) and it would take on the responsibility of calling CLFLUSH as appropriate. It is even conceivable that future processors could make cache flushing for a physical address range much more direct, so keeping track of addresses to flush would become pointless.

Dan Williams championed this position putting his case quite succinctly:

DAX in my opinion is not a transparent accelerator of all existing apps, it's a targeted mechanism for applications ready to take advantage of byte addressable persistent memory.

He also expressed a concern that fsync() would end up being painful for large amounts of data.

Dave Chinner didn't agree. He provided a demonstration suggesting that the proposed overheads needed for fsync() would be negligible. He asserted instead:

DAX is a method of allowing POSIX compliant applications get the best of both worlds - portability with existing storage and filesystems, yet with the speed and byte [addressablity] of persistent storage through the use of mmap.

Williams' position resurfaced from time to time as it became clear that there were real and ongoing challenges in making fsync() work, but he didn't seem able to rally much support.

Shape of the solution

In general, the solution chosen is to still use the page cache data structures, but not to store struct page pointers in them. The page cache uses a radix tree that can store a pointer and a few tags (single bits of extra information) at every page-aligned offset in a file. The space reserved for the page pointer can be used for anything else by setting the least significant bit to mark it as an exception. For example, the tmpfs filesystem uses exception entries to keep track of file pages that have been written out to swap.

Keeping track of dirty regions of a file can be done by allocating entries in this radix tree, storing a blank exception entry in place of the page pointer, and setting the PAGECACHE_TAG_DIRTY tag. Finding all entries with a tag set is quite efficient, so flushing all the cache lines in each dirty page to perform fsync() should be quite straightforward.

As this solution was further explored, it was repeatedly found that some of those fields in struct page really are useful, so an alternative needed to be found.

Page size: `PG_head`

To flush "all the cache lines in each dirty page" you need to know how big the page is — it could be a regular page (4K on x86) or it could be a huge page (2M on x86). Huge pages are particularly important for PMEM, which is expected to sometimes be huge. If the filesystem creates files with the required alignment, DAX will automatically use huge pages to map them. There are even patches from Matthew Wilcox that aim to support the direct mapping for extra-huge 1GB pages — referred to as "PUD pages" after the Page Upper Directory level in the four-level page tables from which they are indexed.

With a struct page the PG_head flag can be used to determine the page size. Without that, something else is needed. Storing 512 entries in the radix tree for each huge page would be an option, but not an elegant option. Instead, one bit in the otherwise unused pointer field is used to flag a huge-page entry, which is also known as a "PMD" entry because it is linked from the Page Middle Directory.

Locking: `PG_locked`

The page lock is central to handling concurrency within filesystems and memory management. With no struct page there is no page lock. One place where this has caused a problem is in managing races between one thread trying to sync a page and mark it as clean and another thread dirtying that page. Ideally, clean pages should be removed from the radix tree completely as they are not needed there, but attempts to do that have, so far, failed to avoid the race. Jan Kara suggested that another bit in the pointer field could be used as a bit-spin-lock, effectively duplicating the functionality of PG_locked. That seems a likely approach but it has not yet been attempted.

Physical memory address

Once we have enough information in the radix tree to reliably track which pages are dirty and how big they are, we just need to know where each page is in PMEM so it can be flushed. This information is generally of little interest to common code so handling it is left up to the filesystem. Filesystems will normally attach something to the struct page using the private pointer. In filesystems that use the buffer_head library, the private pointer links to a buffer_head that contains a b_blocknr field identifying the location of the stored data.

Without a struct page, the address needs to be found some other way. There are a number of options, several of which have been explored. The filesystem could be asked to perform the lookup from file offset to physical address using its internal indexing tables. This is an indirect approach and may require the filesystem to reload some indexing data from the PMEM (it wouldn't use direct-access for that). While the first patch set used this approach, it did not survive long.

Alternately, the physical address could be stored in the radix tree when the page is marked as dirty; the physical address will already be available at that time as it is just about to be accessed for write. This leads to another question: exactly how is the physical address represented? We could use the address where the PMEM is mapped into the kernel address space, but that leads to awkward races when a PMEM device is disabled and unmapped. Instead, we could use a sector offset into the block device that represents the PMEM. That is what the current implementation does, but it implicitly assumes there is just one block device, or at least just one per file. For a filesystem that integrates volume management (as Btrfs does), this may not be the case.

Finally, we could use the page frame number (PFN), which is a stable index that is assigned by the BIOS when the memory is discovered. Wilcox has patches to move in this direction, but the work is ~~70%~~ maybe 50% done. Assuming that the PFN can be reliably mapped to the kernel address that is needed for CLFLUSH, this seems like the best solution.

Is this miniature `struct page` enough?

One way to look at this development is that a 64-bit miniature struct page has been created for the DAX use case to avoid the cost of a full struct page. It currently contains a "huge page" flag and a physical sector number. It may yet gain a lock bit and have a PFN in place of the sector number. It seems prudent to ask if there is anything else that might be needed before DAX functionality is complete.

As quoted above, Chinner appears to think that transparent support for full POSIX semantics should be the goal. He went on to opine that:

This is just another example of how yet another new-fangled storage technology maps precisely to a well known, long serving storage architecture that we already have many, many experts out there that know to build reliable, performant storage from... :)

Taking that position to its logical extreme would suggest that anything that can be done in the existing storage architecture should work with PMEM and DAX. One such item of functionality that springs to mind is the pvmove tool. When a filesystem is built on an LVM2 volume, it is possible to use pvmove to move some of the data from one device to another, to balance the load, decommission old hardware, or start using new hardware. Similar functionality could well be useful with PMEM.

There would be a number of challenges to making this work with DAX, but possibly the biggest would be tearing down memory mappings of a section of the old memory before moving data across to the new. The Linux kernel has some infrastructure for memory migration that would be a perfect fit — if only the PMEM had a table of struct page as regular memory does. Without those page structures, moving memory that is currently mapped becomes a much more interesting task, though likely not an insurmountable one.

On the whole, it seems like DAX is showing a lot of promise but is still in its infancy. Currently, it can only be used on ext2, ext4, and XFS, and only where they are directly mounted on a PMEM device (i.e. there is no LVM support). Given the recent rate of change, it is unlikely to stay this way. Bugs will be fixed, performance will be improved, coverage and features will likely be added. When inexpensive persistent memory starts appearing on our motherboards it seems that Linux will be ready to make good use of it.

Comments (7 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.5-rc5 ?

Greg KH Linux 4.3.6 ?

Steven Rostedt 3.18.27-rt25 ?

Steven Rostedt 3.14.61-rt62 ?

Steven Rostedt 3.12.54-rt72 ?

Greg KH Linux 3.10.97 ?

Steven Rostedt 3.10.97-rt105 ?

Steven Rostedt 3.2.77-rt110 ?

Architecture-specific

David Long arm64: Add kernel probes (kprobes) support ?

David Daney arm64, numa: Add numa support for arm64 platforms ?

Shannon Zhao KVM: ARM64: Add guest PMU support ?

Xiao Guangrong KVM: x86: track guest page access ?

Build system

Nicolas Pitre Trim unused exported kernel symbols ?

Luis R. Rodriguez linux: add linker tables ?

Emese Revfy Introduce GCC plugin infrastructure ?

Josh Poimboeuf Compile-time stack metadata validation ?

Core kernel code

Daniel Wagner Simple wait queue support ?

Parav Pandit rdmacg: IB/core: rdma controller support ?

James Bottomley [Patch 0/2] allow the creation of architecture emulation containers where the emulator binary is outside the container ?

Petr Mladek kthread: Use kthread worker API more widely ?

Dave Hansen System Calls for Memory Protection Keys ?

Steve Muckle sched: scheduler-driven CPU frequency selection ?

Kazimierz Krosman Additional kmsg devices ?

Mathieu Desnoyers getcpu_cache system call for 4.6 ?

Tom Zanussi tracing: 'hist' triggers ?

Development tools

Rasmus Villemoes format_template attribute ?

Kieran Bingham gdb/scripts: Linux awareness debug commands ?

Device drivers

Enric Balletbo i Serra Add UCS1002 USB Port Power Controller ?

Joao Pinto add support for DWC UFS Controller ?

Alexey Brodkin drm: Add support of ARC PGU display controller ?

Xinliang Liu Add DRM Driver for HiSilicon Kirin hi6220 SoC ?

Antoine Tenart irqchip: introduce the Alpine MSIX driver ?

Joshua Henderson PIC32MZDA Clock Driver ?

codekipper@gmail.com ASoC: Add SPDIF support for Allwinner SoCs ?

Yong Wu MT8173 IOMMU SUPPORT ?

Tiffany Lin Add MT8173 Video Encoder Driver and VPU Driver ?

Madhavan Srinivasan powerpc/powernv: Nest Instrumentation support ?

Dan Williams nfit, libnvdimm: async address range scrub ?

Jan Glauber ThunderX i2c driver ?

Device driver infrastructure

Christopher S. Hall Patchset enabling hardware based cross-timestamps for next gen Intel platforms ?

Filesystems and block I/O

Shaohua Li block-throttle: proportional throttle ?

Christoph Hellwig selective block polling and preadv2/pwritev2 revisited V2 ?

mchristi@redhat.com separate operations from flags in the bio/request structs ?

Memory management

Boaz Harrosh New MAP_PMEM_AWARE mmap flag ?

Mel Gorman Move LRU page reclaim from zones to nodes v2 ?

Networking

Jamal Hadi Salim net_sched: Add support for IFE action ?

Jiri Pirko Introduce devlink interface and first drivers to use it ?

Gilberto Bertin SO_BINDTOSUBNET ?

David Decotigny new ETHTOOL_GLINKSETTINGS/SLINKSETTINGS API ?

Security-related

Benjamin Gaignard Secure Memory Allocation Framework ?

Miscellaneous

Douglas Gilbert sg3_utils-1.42 available ?

Douglas Gilbert sdparm 1.10 available ?

Namhyung Kim perf tools: Add support for hierachy view (v7) ?

Page editor: Jonathan Corbet
Next page: Distributions>>