Kernel development

Brief items

Kernel release status

The current development kernel is 4.6-rc4, released on April 17. "So there really isn't anything particularly interesting here. Just like I like it in the rc series. Let's hope it stays that way."

Stable updates: none have been released in the last week. The 4.5.2, 4.4.8, and 3.14.67 updates are in the review process as of this writing; they can be expected imminently.

Sasha Levin has released a set of stable-security kernel updates as well.

Comments (none posted)

Kernel development news

The 2016 Linux Storage, Filesystem, and Memory-Management Summit

By Jonathan Corbet
April 20, 2016

LSFMM 2016

The 2016 Linux Storage, Filesystem, and Memory-Management Summit was held April 18 and 19 in Raleigh, North Carolina, USA. On the order of 100 developers representing those subsystems discussed a wide range of highly technical topics. LWN was there, resulting in the following reports.

The summit ran from one to three tracks, depending on the subjects to be discussed.

The plenary track included developers from all three subsystems and covered issues relevant to the kernel as a whole. The sessions from this track were:

Standards update: what the T10 (SCSI), T13 (ATA), and NVM Express (NVMe) standards groups have been working on.
Persistent-memory error handling: what should the system do when persistent memory turns out to be less persistent than it should be?
Bulk memory-allocation APIs as a way of addressing networking performance bottlenecks.
reflink() and related topics: further development of the reflink() system call, an online scrubber for XFS, and more in a "plenary" session that the memory-management developers were too busy to attend.
Filesystems and containers / Self-encrypting drives: two lightning talks to finish out the first day.
Multi-order radix trees: an enhancement to radix-tree functionality that might be useful beyond the memory-management subsystem.
Performance-differentiated memory: how to cope with systems featuring memory with varying performance characteristics.
DAX on BTT. How to make the DAX direct-access layer play well with BTT, which, inherently, is a layer of indirection.

The memory management track discussed the following topics:

Two transparent huge page cache implementations. Transparent huge pages don't currently work with file-backed pages, but that's not through a lack of trying: there are currently two working implementations to choose between. In this session, the memory-management developers spent two hours trying to make that decision.
Ideas for rationalizing GFP flags: how to better define and improve the semantics around the ubiquitous GFP_ memory-allocation flags.
CMA and compaction: problems with and solutions for the kernel's mechanisms for supporting large contiguous allocations.
Virtual machines as containers, and, in particular, the memory-management challenges that come with packing a lot of virtual machines into a host.
Partial address-space mirroring is a hardware feature intended to improve reliability. As the discussion showed, though, the memory-management developers were not convinced it will work as well as advertised.
Heterogeneous memory management: giving GPUs and other peripherals access to process address spaces.
Memory-management testing: how can we find more problems before they bite users?
Memory control group fairness: how to get control groups to make the right decisions when faced with problematic workloads.
TLB flush optimization: reducing the performance cost that comes with flushing the translation lookaside buffer too often.
Improving the OOM killer: this year's episode in the perennial discussion on how the kernel should handle running out of memory.
Memory-management subsystem workflow: how is development working in this subsystem, and how can it be made to run more smoothly?

See also: Rik van Riel's notes for a terse summary of the memory-management sessions.

The filesystem-only track had a relatively small number of discussions, which were:

Parallel lookups: how to support multiple simultaneous directory lookups.
Network filesystems, and supporting network filesystems with case-insensitive semantics in particular.
The xstat() system call: how can this long-desired functionality finally make its way into the kernel?
Exposing extent information to user space: the best way to let applications learn about the layout of a file on disk.

The storage-only track also had a small number of sessions.

James Bottomley has posted his notes from those discussions.

The combined filesystem and storage track had the following discussions:

Persistent storage and remote data access protocols: what changes are needed to support accessing fast persistent devices over (relatively) slow remote access protocols.
Block and filesystem interfaces: better ways for the block and filesystem layers to work together.
DAX, mmap(), and a "go faster" flag: how to accommodate applications that are aware of persistent-memory behavior and want the best performance possible.
Partial drive depopulation: what to do when a storage device goes partially bad, but you want to keep using the part that still works?
fallocate() and the block layer: challenges in supporting the "bulk zero" functionality for block I/O
Using the multiqueue block subsystem by default: we've had multiqueue block support for a few years now, maybe it's time to start phasing out the single-queue interface?
Stream IDs and I/O hints: ways of telling block devices which data belongs together.
Background writeback: how to make it great again, even if it has never been great before.
Multipage bio_vecs: increasing the maximum size of an I/O operation in the block layer.

Note that these sessions are still being written up; they will be added to this page once they become available.

Group photo

This photo of the LSFMM 2016 group was provided by the Linux Foundation; more photos can be found on Flickr.

Acknowledgments

Thanks are due to LWN subscribers and the Linux Foundation for supporting our travel to this event.

Comments (none posted)

A storage standards update

By Jake Edge
April 20, 2016

LSFMM 2016

The opening plenary at the 2016 Linux Storage, Filesystem, and Memory Management Summit (LSFMM) was an update from Fred Knight and Peter Onufryk on some of the relevant storage standards. Knight covered changes that are coming from the T10 (SCSI) and T13 (ATA) technical committees, while Onufryk talked about changes in the NVM Express (NVMe) interface specification for SSDs.

T10 and T13

Knight began with "conglomerates", which is a feature that has been in the T10 standard for a while. It is a way to group logical units. The 64-bit logical unit number (LUN) is split into two pieces: a major number, which identifies the group (conglomerate), and a minor number that identifies the unit within the group. There are new SCSI commands (BIND, UNBIND, ...) to create and manage these conglomerates.

The WRITE ATOMIC and WRITE SCATTER commands were up next. The former will either write all of the data or none of it. Some think that is how SCSI currently works, but if a write returns an error, there is no guarantee of how much data has been written.

WRITE SCATTER allows specifying logical block address (LBA) and length pairs, followed by the data for each segment, and will write the segments. There are no guarantees of atomicity, however, for the entire scatter write. Developers have asked for scatter and atomic to be combined somehow, but that made the storage vendors pull their hair out, Knight said, so the T10 committee said "no way". One thing that could be added that "may be useless" is to guarantee that each individual segment will either be written or not in its entirety. But the order of the writes is not known, so an error will indicate that some or all of the writes were not done.

There were complaints from the audience that some vendors could do atomic scatter writes, but that the large vendors don't want to do so, which is unfortunate. Knight said creating the standard for it is not the hard part—it is getting the vendors to agree to implement it.

Write streams are a way to associate writes so that the data ends up physically "close" on the device. Its origin is in current flash hardware that can be more efficient if writes to the same file end up in the same erase blocks, rather than scattered in many all over the drive. The number of streams is vendor dependent; it is currently up to 255, but there are vendors that want to have much lower limits. There is a reserved byte in the command to potentially expand the stream ID to 16 bits. It is a feature that is focused on supporting current hardware, so it is unclear how long it will stay relevant as flash hardware evolves in coming years, he said.

SCSI universally unique IDs (UUIDs) are assigned to vendors by the IEEE but, with the advent of software-defined storage (SDS), there needs to be a way to generate these IDs. The standard has been updated so that SDS devices can create an ID for their use. This was a feature that came directly from the Linux community, Knight said.

"Hints" for I/O (which are called "logical block markup" in T13) give information to the device about the expected access pattern for the data. The device is not obligated to remember or act on the hints; there are a bunch that are listed in the standards, but there are only two or three that make a difference. One of those is a way to "tag" I/O operations, which can, for example, separate operations for keys, data, and metadata that will help make databases run faster.

There was a question about why both streams and hints were supported, as it seems that they do much the same thing. But Knight said that streams are meant to place the data in the stream physically close on the device, while hints are meant to indicate the access patterns for the data. They can be used together, but hints can be ignored by the device.

There are some new log pages for additional statistics and counters, including counts of misaligned I/Os (e.g. 512-byte writes to a device with a 4KB block size) and compression and deduplication statistics. There is a new application tag mode page for the protection information that is used for data integrity, as well. The application tag can be placed in that page and does not need to be sent with the protection information on each block.

Next up was "depop" (or depopulation), which is a way to remove a portion of a device based on the failure of a head or other element. There are proposals for depop being worked on in T10, T13, and NVMe, Knight said. For today's 10TB or 20TB drives, if some component in the device goes bad, people want to keep using it. Manufacturers are already doing this; if they make a 10TB drive and find that 4TB are bad, they sell it as a 6TB drive to consumers.

Offline depop is effectively just a reformat of the drive for its new capacity. There is work on online depop going on, but there are "some interesting problems to be solved". The organization of logical blocks on a device is such that loss of a head (or a die on an SSD) does not equate to a simple range of LBAs, so the drive will need to report the list of bad LBAs to the host. Knight said that the committees are interested in any input that Linux developers can provide.

Knight closed his portion of the talk by noting that he has some amount of his time available to work with the Linux community to take its ideas to the standards bodies. He said that protection information without an application tag that he mentioned earlier was a suggestion from outside, as was the idea of hybrid drives that are partially regular disks and partially shingled magnetic recording (SMR) drives. He encouraged those interested to give him ideas to take back to the committees.

NVMe

Onufryk introduced the NVMe organization, which consists of several workgroups under a board. The management interface workgroup is concerned with controlling and monitoring NVMe devices. For example, the interface provides information on what devices are available, the temperature of the disks, and so on. Vendors want to be able to use it to update the firmware on the devices as well. There will also be work coming up on enclosure management and LED control, among other things.

James Bottomley noted that the management interface allows changes that happen without any notification to the operating system. He asked if there was a way for Linux to find out what these changes are. That is a "real mess right now", Onufryk said. The first version of the specification just focused on how to control and monitor the drive, but there will be work on how the management interface should integrate with the host. It is a "complex problem", though.

The technical workgroup is working on two big things: NVMe over fabrics and enhancements to NVMe for streams, directives, and virtualization. Fabric support is aimed at scaling NVMe to thousands of drives, which is beyond what is practical with PCIe. The idea is to have a thin encapsulation of the NVMe protocol that will work on fabrics. The first target is remote DMA (RDMA), though work is also being done on Fibre Channel.

There has been a lot of "blood, sweat, and tears" to keep the PCIe and fabric versions aligned. That means device makers should be able to switch out PCIe for RDMA or some other fabric and not change things much. We will "see if it works".

While there is roughly 90% commonality between the PCIe and fabric implementations, there are some differences. The major ones are the identifier used to target a particular device and the way that devices are discovered. The queueing and data transfer are quite similar between the two.

The specification for NVMe over fabrics will likely be accepted by the end of May, at which point it will become public. Host and target drivers for Linux are under development. They will be released as open source when the specification is released.

The first of the NVMe enhancements Onufryk presented was on host-device information exchange. To best use NVM, the host and device need to exchange various kinds of information beyond just the I/O data. That exchange can take place at various points (before, during, or after a data access) and can target various elements (LBA, the physical media or partition, or the access relationship).

One piece of information that can be given to the device from the host was mentioned by Knight earlier: stream IDs. For NVMe, those IDs are 16-bit values in the write commands to identify writes that are associated with each other because they are expected to have the same lifetime. There is little available space in the NVMe command packet so a "directive" type and ID was added that can be used for streams or other hints to be sent with the commands, but only per command.

There are also some virtualization enhancements to NVMe. Today there are emulated hardware devices or paravirtualized drivers, which just puts software in the way, Onufryk said. There is a need to provide direct access to NVMe devices from virtual machines.

The NVMe architecture supports single-root I/O virtualization (SR-IOV), but more is needed. In particular, the new virtualization enhancements will define a standard mechanism to allocate interrupt and queue resources to a virtual NVMe device. There are also additions to control the performance of the virtual devices so that the performance of the entire device can be split up between virtual machines—with each getting a specific portion of that performance.

Those enhancements have all been approved for the next revision of the specification. Someone asked about the status of copy offload for NVMe. Onufryk said that it is not yet being worked on and that thin provisioning is in the same bucket.

Comments (none posted)

Persistent-memory error handling

By Jonathan Corbet
April 20, 2016

LSFMM 2016

One of the key advantages of persistent memory is that it is, for lack of a better word, persistent; data stored there will be available for recall in the future, regardless of whether the system has remained up in the meantime. But, like memory in general, persistent memory can fail for a number of reasons and, given the quantities in which it is expected to be deployed, failures are a certainty. How should the operating system and applications deal with errors in persistent memory? One of the first plenary sessions at the 2016 Linux Storage, Filesystem, and Memory-Management Summit, led by Jeff Moyer, took on this question.

Error handling with traditional block storage is relatively easy: an I/O request will fail with an EIO error, and the application, assuming it is prepared, can handle the error in whatever way seems best. But persistent memory looks like memory to the system, and memory errors are handled differently; in particular, they can trigger a low-level machine-check error. Some systems can recover from that machine check, others will be forced to reboot. Either way, the system has to be able to handle the problem.

Time for a bit of terminology that caused some confusion in the session. Jeff was talking in particular about errors in "load" operations — reading from persistent memory using normal CPU instructions. Those were differentiated from "reads," which are file operations performed with a system call like read(). Similarly, "stores" (using memory operations) and "writes" (file operations) are seen differently. Errors with reads and writes can be returned via the normal system call status; errors with loads and stores are a bit more complicated.

In cases where a machine check from a load operation is recoverable, the kernel can simply deliver the error to the application via a SIGBUS signal. But, even there, it became clear that the situation is not entirely simple: Keith Packard noted that the discussion was about load errors, and asked what happens when a store goes wrong. The problem there is that store operations are not usually synchronous, so there will be no immediate indication of an error. Paranoid software can do a flush and a load after every store to ensure that the data has been stored properly; there does not seem to be any better way.

On "less expensive" systems where the machine check is not recoverable, it's entirely possible that the system will end up in a reboot loop where, after each boot, it tries again to access the failing persistent-memory range. This behavior is generally seen as undesirable. As it turns out, even fancier hardware is sometimes subject to non-recoverable machine checks, so something has to be done to ensure reliable operation on all systems.

The ACPI specification includes a mechanism for scrubbing an address range for errors; the UEFI firmware can run it as part of the boot process. Address ranges with errors can be flagged, and the operating system can, once it boots, query that list of ranges with errors and create a bad-block list that it knows must be avoided. When an application tries to access a range with an error via mmap(), the bad pages can be left unmapped and, should the application try to access them, the SIGBUS can be delivered. The scrubbing operation is not necessarily fast, so it would be unsurprising if it didn't run on every boot, but it can be run when errors begin happening.

The solution as described so far, though, only works at the level of pages. The error granularity reported by the hardware can be as fine as a single 64-byte cache line; marking an entire (4KB) page as being bad when only 64 bytes have been truly lost is less than ideal. One way of narrowing things down would be for the application to open the file with the reported data loss and issue a series of 512-byte reads, narrowing the problem down to a single 512-byte "sector." But, Jeff said, that "still seems a little perverse." It would be nice to be able to directly inform an application about exactly what has been lost.

A number of possibilities for providing that information were discussed. Christoph Hellwig suggested that the information provided with the SIGBUS signal could be expanded to include the exact range that was lost. Dan Williams said that the application could read the bad-block list from sysfs, then use the FIEMAP ioctl() operation to figure out which block in the file was bad. That works today, he said, except that the bad-block list is not updated while the system is live. Ted Ts'o said it would be useful to have a new ioctl() command to query the failing byte range directly.

James Bottomley said that the most friendly approach would be to remap the bad block and hide it entirely, only informing applications if data has actually been lost. It was agreed that remapping would work if an error is detected on a write operation, but the real problem is with reads (or, properly, loads) where data is known to have been lost. In that case, applications should not be forced to dig through the bad-block list; there should be a more direct interface. There also needs to be some sort of interface to clear the error (typically remapping the block) so that the given address range becomes usable again.

As the session wound down, a few residual questions came up, but no real decisions were reached. Ted asked whether the problem of non-recoverable machine checks would go away as the hardware improves; Jeff answered that it might, but that doesn't change the real issue of how to convey problems to user space. Ted also asked whether this information should be provided to applications at all — isn't that assuming a fundamental change in application behavior? Ric Wheeler answered that applications that care about data integrity already keep multiple copies of the data; they just need to know where things go wrong.

As has been seen for a while, persistent memory raises a number of questions with regard to how it should be presented to user space. While many of the problems are being solved, it seems likely that persistent memory will be a discussion topic at events like this for some time yet.

Comments (24 posted)

Two transparent huge page cache implementations

By Jonathan Corbet
April 20, 2016

LSFMM 2016

The transparent huge pages (THP) mechanism, in the kernel since 2.6.38, allows the system to use huge (typically 2MB) pages without application knowledge or involvement. Huge pages can be a significant performance improvement for a number of workloads. The feature only works with anonymous (application data) pages, though; support for transparent huge pages for the page cache (file data) has never made it into the mainline, despite the fact that the page cache often is the biggest user of memory on the system. There are, however, two working implementations out there; the first task of the memory-management track at the 2016 Linux Storage, Filesystem, and Memory-Management Summit was to try to choose between them.

One of the contenders is Kirill Shutemov's THP-enabled tmpfs patch set. Kirill's work is based on compound pages, a fairly elaborate mechanism for binding individual memory pages into a single, larger page. The other solution is the team-pages patch set from Hugh Dickins. A "team page" can be thought of as a new, arguably lighter-weight way of grouping pages together. Hugh's work has the advantage of having been deployed in production at Google for over a year; that gives developers a relatively high level of confidence that it lacks obscure bugs — always a nice feature in memory-management patches. Even so, it became clear, the choice between the two is far from straightforward.

Differing objectives

Kirill started off by saying that one of his primary goals was the ability for applications to access individual 4KB subpages of a huge page without the need to split the huge page itself. The current anonymous THP implementation does not have that property, but the ability to access smaller subpages is more important in the page-cache setting. Hugh, instead, was more focused on getting things working quickly, meaning that a less intrusive (but not necessarily smaller) implementation was required. Compound pages, he said, are hard to manipulate; they also force a deeper level of integration with various subsystems, including memory control groups, that he would rather avoid. There was some back-and-forth on whether compound pages are truly more complex than team pages, but without anything conclusions being reached.

Both implementations currently work with the tmpfs filesystem, which is a start, but a proper transparent huge page cache implementation needs to work with "real" filesystems as well. Kirill stated that he is working on that objective which, he asserted, will be more easily achieved with compound pages than with the team-pages approach, which is known to be incompatible with ext4 in its current form.

There was some discussion of how smaller files are handled. Team pages are assembled from individual pages, and, thus, should deal with small files reasonably well. A huge page will be allocated (memory availability allowing) at the beginning, but isn't fully charged against the allocating process. If need be, the huge page can be split to return memory to the system. Compound pages seemingly have worse small-file performance at the moment, allocating and charging a huge page from the outset. That has led to some testers reporting out-of-space errors with this patch set.

Andrea Arcangeli, the original author of the THP feature, thought that people were worrying too much about the small-file issue. He said that THP should be seen purely as a performance optimization. Anybody who is concerned about whether huge pages are using too much memory should simply not enable the feature.

Sticking points

Hugh made the assertion that the choice was between one implementation that is working, running on thousands of machines, and popular with its users and another that, he said, is "getting there." The team-pages code, he said, is ready to go in, with the possible exception of review of the ABI aspects — mount options, sysfs features, etc. It quickly became clear that there was no consensus for that in the room, though.

One of the key sticking points first came up at this time. The current THP code uses compound pages; Kirill's work is an extension of that approach. The team-pages mechanism is, instead, entirely new. If it is merged, the kernel will be using different techniques for anonymous and page-cache huge pages, essentially doubling (or worse) the amount of code that must be maintained going forward. Vlastimil Babka asked whether Hugh's patches could be converted to use compound pages; Hugh said that might be possible, but he would rather merge what he has now, then let Kirill do the conversion later if he is interested in doing it.

Another issue is "recovery" — substituting a huge page for a set of small pages at some future time if, for whatever reason, a huge page is not allocated at the outset. Team pages seem to have fewer problems with recovery, especially in the case of small files that grow, since the underlying huge page is allocated at the outset if possible. Failing that, either approach needs to run work in the background to coalesce ("collapse") sets of small pages into huge pages. The team-pages patches currently lack such a mechanism; compound pages appear to be in better shape in that regard.

There was some discussion of details — whether the work should be done in the khugepaged thread or in work queues in the context of the processes owning the pages — but that was peripheral to the main issue. As Mel Gorman put it, this question doesn't affect the decision that was being discussed.

Mel went on to complain that the purpose of the session was to choose between the two implementations, and that the group seemed no closer to that objective. This question has implications beyond just THP; he noted that the team-pages patches are currently in Andrew Morton's -mm tree. They are creating conflicts with other patches, notably his own node-accounting patches. If we are not merging team pages for 4.7, he said, those patches should not be in -mm (and, thus, in linux-next) at this time. Hugh said he really wanted the patches to get some exposure and testing, but that they could be backed out for now if need be.

Kirill said that there is no reason to rush the team-pages patches into the kernel now, just to add compound pages later. That, he said, would just add a bunch of churn. But, Hugh said, it would also give Linux users the opportunity to benefit from this work now, to which Kirill responded that Hugh had held on to the patch set for a year, so there should be no urgency now. The team-pages patches are ready, and have been for six months or so, Hugh said; he went on to say that, while Kirill has done well with the compound-page work, he (Hugh) doesn't know how long it will be until he is confident in that work.

Setting requirements

At about this point, Mel went to the front of the room in an attempt to focus the group and get a decision made. Using the flip chart, he started a list of requirements each approach would have to meet before the room — including the competing developer — would accept it into the mainline. It took a while, but some clear requirements did result.

On the compound-pages side, small files must not waste memory. Mel noted that, when the PowerPC architecture went to a 64KB native page size, the amount of memory required to run a basic system quadrupled. Andrea said that the anonymous THP implementation never allocates huge pages for small virtual memory areas; THP for the page cache should make similar decisions for small files. Kirill said that something like this is supported now via a mount option, but some work needs to be done. Among other things, the mount option should go away; things need to just work without administrative tuning. This requirement was extended to include fast recovery when small files grow into large files — the system should be able to swap in huge pages in short order.

Hugh also suggested that compound pages need to demonstrate a high level of robustness before it could be considered. This requirement was seen as being somewhat unfair, though: it is easy to show a lack of robustness, but difficult to show its presence. In the end, it will be incumbent on Hugh to show which robustness problems, if any, exist in the compound-pages implementation.

One of the requirements for team pages is similar: it has to have a recovery mechanism for files that didn't get huge pages assigned initially. In particular, khugepaged or something like it must be able to collapse pages when appropriate.

The harder requirement, though, is to move away from having independent mechanisms for anonymous and page-cache pages. If the team pages approach is to be adopted, there must be a plausible commitment to implement team pages for anonymous pages as well. If, once the implementation is in place, it is shown (in the form of performance problems, for example) that the problems are sufficiently different that two approaches are necessary, the two can remain separate. But until such a thing has been conclusively demonstrated, the goal needs to be a single approach for both cases. There is a lot of concern about excessive complexity in the memory-management code; few people want to add to it.

Finally, the current incompatibility between team pages and non-tmpfs filesystems (the ext4 filesystem in particular) needs to be resolved. In practice that means that team pages must stop using the PagePrivate flag, since ext4 (along with other filesystems) is already using it.

The session concluded with both Kirill and Hugh agreeing that, if the other developer's system met the requirements, they would not block its merging. Hugh also agreed that team pages would come out of the -mm tree for now, since it is not destined for merging in 4.7. What will be merged in subsequent cycles remains to be seen; it would not be entirely surprising if it were a topic of discussion again at LSFMM 2017.

Comments (none posted)

Ideas for rationalizing GFP flags

By Jonathan Corbet
April 20, 2016

LSFMM 2016

The kernel's memory-allocation functions normally take as an argument a set of flags describing how the allocation is to be performed. These "GFP flags" (for "get free page") control both the placement of the allocated memory and the techniques the kernel can use to make memory available if need be. For some time, developers have been saying that these flags need to be rethought; in two separate sessions at the 2016 Linux Storage, Filesystem, and Memory-Management Summit, Michal Hocko explored ways of doing that.

GFP_REPEAT

The first session, in the memory-management track, started with a discussion of the GFP_REPEAT flag which, as its name would suggest, is meant to tell the allocator to retry an attempt should it fail the first time. This flag, Michal said, has never been useful. It is generally used for order-0 (single-page) allocations, but those allocations are not allowed to fail and, thus, will retry indefinitely anyway. For larger requests, he said, it "pretends to try harder," but does not actually do anything beneficial. Michal would like to clean this flag up and create a better-defined set of semantics for it.

The kernel does have the opposite flag in the form of GFP_NORETRY, but that one, he said, is not useful for anything outside of order-0 allocations. What he would like to see instead is something he called GFP_BESTEFFORT; it would try hard to satisfy the request, but would not try indefinitely. So it could retry a failed request, and even invoke the out-of-memory killer but, should that prove fruitless, it would give up. This flag would be meant to work for all sizes of requests.

He is trying to move things in that direction, starting with the removal of GFP_REPEAT from order-0 allocation requests around the kernel. The next step would be to start placing the new flag in the places where it makes sense. As an example, he mentioned transparent huge pages and the hugetlbfs filesystem. Both need to allocate huge pages but, while an allocation failure for a transparent huge page is just a missed optimization opportunity, a failure in hugetlbfs is a hard failure that will be passed back to user space. It clearly makes sense to try harder for hugetlbfs allocations.

Johannes Weiner asked whether it would be a good idea to provide best-effort semantics by default while explicitly annotating the exceptions where it is not wanted. The existing GFP_NORETRY flag could be used for that purpose. Michal said that doing so would cause performance regressions, leading Andrew Morton to question whether "taking longer but succeeding" constitutes a regression. The point is that some callers do have reasonable fallback paths for failed allocations and would rather see the failures happen quickly if they are going to. Andrew asked how often that sort of failure happens, but nobody appeared to have any sort of answer to that question. It will be, in any case, highly workload-dependent.

Johannes persisted, saying that it can be difficult to know where the memory allocator should be told to try harder, but it is usually easy to see the places where failure can be handled easily. There was also a suggestion to make the flags more fine-grained; rather than use a vague "best effort" flag, have flags to specify that retries should not be done, or that the out-of-memory killer should not be invoked. Mel Gorman noted that he has already done some work in that direction, adding flags to control how reclaim should be performed.

That led to a wandering discussion on whether the flags should be positive ("perform direct reclaim") or negative ("no direct reclaim"). Positive flags are more descriptive, but they are a bit more awkward to use since call sites will have to mask them out of combined mask sets like GFP_KERNEL. There are also concerns that there aren't many flag bits available for fine-grained control.

The session ended with Michal asking if the group could at least come to a consensus that his work cleaning up GFP_REPEAT made sense. There seemed to be no objection there, so that work can be expected to continue.

GFP_NOFS

Later that day, the entire LSFMM group was present while Michal talked about a different GFP flag: GFP_NOFS. This flag instructs the memory allocator to avoid actions that involve calling into filesystem code — writing out dirty pages to files, for example. It exists for use by filesystem code for a number of reasons, the most straightforward of which is the avoidance of deadlocks. If a filesystem acquires locks then discovers that it must allocate memory, it doesn't want the allocator coming back and trying to obtain the same locks. But there is more to it than that; GFP_NOFS reflects a number of "indirect dependencies" within the filesystems. Also, XFS uses it for all page-cache allocations, regardless of deadlock concerns, to avoid calling so deeply into filesystem code that the kernel stack overflows.

There are, Michal said, too many uses of GFP_NOFS in the kernel tree; they needlessly constrain the memory allocator's behavior, making memory harder to obtain than it should be. So he would like to clean them up, but, he acknowledged, that will not be easy. The reason for any given use of GFP_NOFS is often far from clear — if there is one at all.

His suggestion is to get rid of direct use of that flag entirely; instead, setting a new task flag would indicate that that current task could not call back into filesystem code. XFS has a similar mechanism internally now; it could be pulled up and used in the memory-management layer. A call to a function like nofs_store() would set the flag; all subsequent memory allocations would implicitly have GFP_NOFS set until the flag was cleared.

There are a number of reasons for preferring this mechanism. Each call to nofs_store() would be expected to include documentation describing why it's needed. It allows the "no filesystem calls" state to follow the task's execution into places — security modules, for example — that have no knowledge of that state. Chris Mason noted that it would save filesystem developers from sysfs, which brings surprises of its own. Ted Ts'o added that there are a number of places where code called from ext4 should be using GFP_NOFS for its allocations, but that doesn't happen because it would simply be too much work to push the GFP flags through the intervening layers. Thus far, he has been crossing his fingers and hoping that nothing goes wrong; this mechanism would be more robust.

Michal asked the filesystem developers in the room how much work it would be to get rid of the GFP_NOFS call sites. Chris said that the default in Btrfs has been to use it everywhere; a bunch of those sites have since been fixed, but quite a few remain. He would be happy to switch to the new API, he said. Ted agreed, as long as the transition would be gradual and GFP_NOFS would not disappear in a flag day, as it were. The end result, he said, would be nice.

There was some talk of refining the mechanism to specify the specific filesystem that should be avoided, allowing the memory allocator to call into other filesystems. The consensus seemed to be that this idea would be tricky to implement; the possibility of stack overruns was also raised. Michal will go ahead and put together an API proposal for review. He hopes it will succeed: the fewer GFP_NOFS sites there are, the better the memory allocator's behavior will be.

Comments (13 posted)

Persistent memory as remote storage

By Jake Edge
April 20, 2016

LSFMM 2016

In a combined storage and filesystem session at the 2016 Linux Storage, Filesystem, and Memory Management Summit (LSFMM), Chuck Lever talked about using remote DMA (RDMA) for access to persistent memory. But, more generally, he was "soliciting feedback and rotten tomatoes" about what changes might be needed to make the various protocols and persistent storage classes/types work well together.

While he mostly discussed RDMA, Lever said that many of the same issues apply to other protocols, such as iSER and SRP at the block layer and SMB Direct and NFS/RDMA at the file layer. The performance equation for those protocols and fast persistent devices is such that the cost of making data durable may be less than that of the I/O to get the data to the device.

So, Lever asked, why marry slow technology to this new fast technology? Data replication for disaster recovery is one particularly good use case. It can be set up so that there are geographically diverse failure domains so that the data will be available for recovery. There are other use cases as well.

Today, Linux uses a "pull mode" to do I/O to remote targets, where the initiator exposes a region of its memory to the RDMA controller and sends a request to the target, which then uses that memory to complete the request. Once the initiator receives a reply, it invalidates the memory it exposed so it can no longer be accessed. For a read, the target simply places the data into the initiator's memory using an RDMA write. But for a write request, the target must do an RDMA read to the initiator to get the data to be written and await the response before it can write it. That means there is an additional round-trip for writes.

There are some advantages to pull mode, Lever said, including good memory security, since the initiator only exposes small amounts of memory and only for the duration of the request. In addition, the work to do the transfer is moved to the target side, leaving the CPU on the initiator available for application work. There are several downsides too, however. There is more than one interrupt for each request and the extra round-trip for write requests. In addition, the target CPU has to be involved in all requests.

The NFS server on Linux does not have zero-copy write—except for small I/O operations, as Christoph Hellwig pointed out. Lever said that RDMA could perhaps do zero-copy writes to get better performance. He asked: should splice() be used to do so? Hellwig replied that "splice() is really nicely over-hyped" and doesn't really help this kind of problem. He suggested that any I/O for a fast device should be using direct I/O to avoid the page cache.

For the future, Lever wondered about switching to a "push mode" instead. The initiator would register its interest in regions of a file and the target would expose memory for the initiator to use for read and write operations on those regions. It would return handles to the regions for the initiator to use; multiple RDMA read and write operations could be performed by the initiator before it informed the target that it was done. At that point, the handles would be invalidated (and the memory no longer exposed).

Ted Ts'o asked what the "security story" was for push mode. Lever replied that it uses "reliable connections" where there are only two peers. That connection is set up so that one side can view the other's memory based on the handles. Those handles are only valid for a single connection and the hardware guarantees that other endpoints can't interfere with the connection.

One problem is that there is no generic way to ensure that writes have reached durable storage for the remote storage protocols. Each operating system, network/fabric, and device has different durability guarantees and its own way to ensure that a remote write is stored safely. Sagi Grimberg suggested that code to ensure durability could be written once for all the different options and made available as a library, something like what DAX has. There was general agreement that there should be an API made available that hides the differences.

Comments (none posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.6-rc4 ?

Sebastian Andrzej Siewior 4.4.7-rt16 ?

Kamal Mostafa Linux 4.2.8-ckt8 ?

Sasha Levin Linux 4.1.22 ?

Kamal Mostafa Linux 3.19.8-ckt19 ?

Sasha Levin Linux 3.18.31 ?

Luis Henriques Linux 3.16.7-ckt27 ?

Kamal Mostafa Linux 3.13.11-ckt39 ?

Jiri Slaby Linux 3.12.58 ?

Architecture-specific

Suzuki K Poulose kvm-arm: Add stage2 page table walker ?

Geoff Levand arm64 kexec kernel patches v16 ?

David Daney ACPI NUMA support for ARM64 ?

Kees Cook x86, boot: KASLR cleanup and 64-bit improvements ?

Thomas Garnier x86, boot: KASLR memory implementation (x86_64) ?

Core kernel code

Richard W.M. Jones procfs: expose umask in /proc/<PID>/status ?

Bill Huey (hui) Cyclic Scheduler Against RTC ?

Petr Mladek kthread: Use kthread worker API more widely ?

Device drivers

Garlic Tseng [RFC PATCH 0/4] ASoC: Mediatek: Add support for MT2701 SOC ?

James Liao Add clock support for Mediatek MT2701 ?

HS Liao Mediatek MT8173 CMDQ support ?

Georgi Djakov Add initial support for RPM clocks ?

Lorenzo Pieralisi ACPI IORT ARM SMMU support ?

Wadim Egorov Add RK818 PMIC support ?

Stefan Berger Multi-instance vTPM proxy driver ?

Tomasz Nowicki Support for generic ACPI based PCI host controller ?

Christoph Hellwig RFC: automatic interrupt affinity for MSI/MSI-X capable devices ?

Caesar Wang thermal: rockchip: Support rk3366/rk3399 SoCS and fixes the driver ?

Enric Balletbo i Serra Add ANX7814 I2C bridge driver ?

Rajendra Nayak qcom: Add support for TSENS driver ?

Crestez Dan Leonard Support for max44000 Ambient and Infrared Proximity Sensor ?

H. Nikolaus Schaller driver: leds: is31fl3196/99 dimmable dual/triple rgb driver ?

Alexey Brodkin drm: Add support of ARC PGU display controller ?

Mark Yao drm/rockchip: add rk3399 display controller support ?

Andrew F. Davis Add support for INA3221 Triple Current/Voltage Monitors ?

Jon Hunter Add support for Tegra210 AGIC ?

Device driver infrastructure

Boris Brezillon pwm: add support for atomic update ?

Gustavo Padovan drm: explicit fencing support ?

Octavian Purdila ACPI overlays ?

Noralf Trønnes drm: Add fbdev deferred io support to helpers ?

Filesystems and block I/O

Ming Lei block: prepare for multipage bvecs ?

mchristi@redhat.com v6: separate operations from flags in the bio/request structs ?

Al Viro parallel lookups ?

Jens Axboe Make background writeback not suck ?

Anand Jain Btrfs: Introduce device state 'failed', spare device and auto replace ?

Jan Kara [PATCH 0/18] DAX page fault locking ?

Memory management

Mel Gorman Move LRU page reclaim from zones to nodes v5 ?

Mel Gorman Optimise page alloc/free fast paths v3 ?

Kirill A. Shutemov THP-enabled tmpfs/shmem using compound pages ?

Michal Hocko oom detection rework v6 ?

Networking

Sabrina Dubroca add MACsec support ?

Alexander Duyck Add support for offloads with IPv6 GRE tunnels ?

Jiri Pirko devlink: add support for shared buffer configuration and control ?

Alexander Aring 6lowpan: introduce basic 6lowpan-nd ?

Security-related

Eric W. Biederman devpts: Attempting to get it right ?

David Howells KEYS: Provide keyctls to do public key operations [ver #2] ?

Thomas Garnier mm: SLAB freelist randomization ?

Eric Richter ima: add support for flexible pcr measurements ?

Dan Jurgens SELinux support for Infiniband RDMA ?

Virtualization and containers

James Bottomley [Patch resend v3 0/3] allow the creation of architecture emulation containers where the emulator binary is outside the container ?

Miscellaneous

Steven Rostedt tracing: Add event-fork to trace tasks children ?

Wang Nan perf tools: Support uBPF script ?

Page editor: Jonathan Corbet
Next page: Distributions>>