Kernel development

Brief items

Kernel release status

The current development kernel is 4.6-rc5, released on April 24. Linus said: "Things continue to be fairly calm: rc5 is bigger than rc4 was, but rc4 really was tiny. And while we're back to fairly normal commit counts for this time in the release window, the kinds of bugs people are finding remain very low grade: there's absolutely nothing scary in here. If things continue this way, this might be one of those rare releases that don't even get to rc7."

Stable updates: 4.5.2, 4.4.8, and 3.14.67 were released on April 21.

Comments (none posted)

Quote of the week

I don't mind small pull requests at all, and I don't see "just one tiny commit" as being a bad thing. Quite the reverse. Those pull requests are easy, and it just makes me feel "good, that subsystem is calm and quiet, but not because the maintainer is not responding to people".

— Linus Torvalds

Comments (none posted)

Lots more LSFMM coverage

As can be seen from the articles below, we are continuing to make progress on writing up the 2016 Linux Storage, Filesystem, and Memory-Management Summit. At this point, the memory-management track is complete, while the other tracks are getting close. Interested readers can go to our LSFMM 2016 page for the full set, along with the inevitable group photo.

Comments (none posted)

Kernel development news

reflink() and other topics

By Jake Edge
April 26, 2016

LSFMM 2016

At the 2016 Linux Storage, Filesystem, and Memory-Management Summit, Darrick Wong led a session to discuss several features that he has been working on for XFS. While the session was slated as a plenary, the memory-management track was embroiled in another discussion so many of those developers were absent. Wong said that he had expected to stir up Dave Chinner with some of the topics, but Chinner ran into some travel difficulties and was unable to make it to the summit.

Wong has implemented a reverse mapping of physical blocks to files and metadata for XFS. He had to change the btree that tracks that information in ways that broke many of the assumptions in the code. This is all in preparation for getting reflink() working for XFS.

He has also been working on an online scrubber for XFS that could be used by other filesystems. It would find and fix problems in the filesystem data structures. The idea is to make the scrubber "pluggable" so that it could sensibly deal with metadata and other differences between filesystems. The scrubber will walk the filesystem, locking directories and scrubbing their contents as it encounters them. He has run some "fairly trivial tests" of the scrubber on XFS and ext4.

He has also been working on allowing two files to share pages in the page cache. Al Viro asked if that was for reading or writing; Wong said both would be supported. Chris Mason wondered why writing was in the mix. Wong said that the idea was to share pages as long as they aren't modified; a copy-on-write would be done if they are changed and both copies would be maintained at that point.

Viro was concerned that the tracking of sections of files would need to be concerned with more operations than simply writes. For example, collapsing a range to shrink a file would need to be reflected in the page cache. Jan Kara said that the page cache entries for the file could just be invalidated when those operations are performed. It may be somewhat tricky to identify the pages of interest, but he thought it should be possible to make it all work.

Wong said that he had come to the summit prepared to hear: "Gee, Darrick, you have a ten-year project on your hands". As it turned out, though, his impression is that sharing pages can probably be done, but there will be a lot of bookkeeping needed.

Returning to reflink(), Wong said that he is trying to make the XFS interface as close as he can to what Btrfs does "to avoid splitting people's brains". He is also trying internally to get OCFS to use the same interface as well. Christoph Hellwig has been helping with testing and there are still some bugs in the btree code that need to be worked out.

One question he had was how to handle quotas for reflinked files. An easy way to do it would be to charge the full size of the file at reflink time, but it might be better to wait until a copy-on-write happens. But Hellwig pointed out once a file is reflinked, the user loses control of when a copy-on-write operation might be done. Charging against the quota at that point could lead to situations where the quota was exceeded, so the full charge should be done at reflink time.

Comments (1 posted)

Quickly: Filesystems and containers / Self-encrypting drives

By Jake Edge
April 27, 2016

LSFMM 2016

Two lightning talks ended day one of the 2016 Linux Storage, Filesystem, and Memory-Management Summit. One looked at the problems with user namespaces and the image files used by unprivileged containers. The other was concerned with self-encrypting disk drives.

Containers and filesystem images

James Bottomley kicked things off by describing a problem inside containers using user namespaces, where root in the container is mapped to some unprivileged user outside the namespace. Filesystems that are mounted in the container will have files that are owned by root but, by the time a read or write hits the virtual filesystem (VFS) layer, the UID is for the unprivileged user, so those operations fail. There is a need to not do this UID remapping for some mounts.

One way to do that would be to give up on using bind mounts and to use FUSE mounts that are user-namespace-aware instead. David Howells said that the performance would be lacking for FUSE filesystems, but Bottomley was not so sure. There have been lots of performance enhancements to FUSE, so "in theory we can get reasonable performance".

As he saw it, there were three options: use a FUSE filesystem, re-work the user-namespace remapping code throughout the VFS, or take the patches that systemd uses. He expected that Al Viro (who was not present for the talk) was likely to be resistant to making these changes at the VFS level. Alternatively, there are twelve or so "horrible patches" that systemd uses to handle this problem, but he noted that there are many more users of unprivileged containers that also need a solution to the problem—it is not systemd-specific at all.

There was a question about why the filesystem image wasn't simply changed to remap the UIDs before mounting. Bottomley said that breaks the checksum of the image, which is used to verify its integrity.

Ted Ts'o suggested that a specific UID-remapping filesystem could be created, along the lines of overlayfs. That would limit the UID remapping to that filesystem, rather than scattering it throughout the VFS layer.

That idea had some appeal. Bottomley noted that FUSE has options for better performance, including direct I/O and writeback caching. But it would seem that the overlayfs-based solution may be given a long look.

Self-encrypting drives

Keith Busch wanted to discuss self-encrypting drives and how best to support them in Linux. In particular, what is responsible for unlocking the drives after the system goes to sleep? When the power is removed from these drives, they lock; when power is restored, they require user input to unlock them. Other operating systems store the user's password somewhere (such as in EFI variables) and then play it back when the system wakes to unlock the drives.

Martin Petersen questioned the value of self-encrypting drives other than as a check-mark for "security". Overall, there was general skepticism about the security value of the feature. Busch said that there were customer requests to support the feature, however..

There are Trusted Computing Group (TCG) specifications to handle the authentication problem, but Busch guessed that adding that code to the kernel would not be welcome. Hannes Reinecke concurred, saying that the kernel security developers would not want that since the TCG code implementation is "horrible".

Dan Williams suggested that the BIOS could put up a prompt to ask for the password to unlock the drive. That could be done as a pre-resume hook that re-runs the authentication step. Others, though, believed the problem had already been solved: "dm-crypt, check it out". In the end, there did not seem to be much support for handling these devices, even though both Busch and Brian King said that there are growing customer requests.

Comments (3 posted)

The multi-order radix tree

By Jonathan Corbet
April 27, 2016

LSFMM 2016

Radix trees have a number of uses in the kernel, the most prominent of which is storing the association between pages in memory and the file blocks that hold their backing store. That particular tree was designed under the assumption that all pages are the same size. When huge pages are used, each single page must be represented by many (typically 512) entries in the radix tree, which is less than fully efficient. Matthew Wilcox has been working on adding multi-order (multiple sizes) support to the radix tree; he described this work in a plenary session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit.

The core idea behind this work is to allow the radix tree to hold a single entry to represent an entire huge page. This is useful for persistent memory, which is most efficiently managed as huge pages, but it is also desirable for the transparent huge pages mechanism. There have been suggestions that it might assist filesystems in the management of large file blocks, but Matthew is not sure whether that is true or not. Regardless, he wanted developers to know that this functionality is now available.

The radix tree API has changed little as a result of this work. The low-level functions __radix_tree_create() and __radix_tree_insert() now have order arguments indicating the size of the inserted entry. Code making use of multi-order radix trees may require significant changes, though, so that the code is prepared for the multi-order entries that can be returned by lookup operations.

Internally, the feature is implemented by tagging pointers with an "indirect" bit to indicate nodes in the tree; if that bit is clear, the pointer refers to user data instead. At the bottom level of the tree, a number of "sibling" entries contain pointers to the canonical entry for the page. Notably, the tree doesn't store the order of the page; users have to get that information via some other means.

James Bottomley asked whether the tree could be used to detect opportunities for the use of larger pages; Matthew answered that it was probably not the right place for that. Chris Mason said that he once tried to get Btrfs to use stubs in the radix tree as a sort of lock when direct I/O is being performed; that technique might be more easily implemented using the multi-order feature. It could, perhaps, make direct I/O "a bit less of a nightmare." Jan Kara noted that similar things are done to support the DAX direct-access mechanism. He also said that the page cache uses radix-tree entries to mark evicted pages, which could perhaps interfere with other uses. The right solution, perhaps, would be the long-discussed range-locking mechanism; at this point, though, the session wound down and the idea was not discussed further.

Comments (none posted)

Performance-differentiated memory

By Jake Edge
April 27, 2016

LSFMM 2016

There are memory devices coming down the pike that have different performance characteristics than traditional DRAM. Linux will need to support these devices, but there is a question whether they should be treated as traditional memory that is managed by the kernel or if they should be presented as separate devices with their own drivers. Dan Williams led a plenary discussion on that topic on the second day of the Linux Storage, Filesystem, and Memory-Management Summit.

Technologies like 3D XPoint and High Bandwidth Memory (HBM) perform differently from DRAM. These types of devices might also be mirrored or serve as caches for slower memory. So applications may need to know that there is a limited amount of high-performance memory available, with more, slower memory behind that cache.

If the memory-management subsystem is to handle this type of memory, it could be classified as a memory zone or some kind of NUMA node. There will be a need to find this memory by its type and location, so there may be a need for a new type of memory, Williams said, rather than tracking it as a "crazy kind of NUMA node".

High-performance computing (HPC) applications, databases, and other applications that know how to do their own buffer management just need the kernel to tell them what type it is and where it is. The kernel would then get out of the way and let the application manage that space. To Williams, that seems like a device.

Other applications, though, simply want something sane done with that memory. They don't need strong guarantees about what type of memory is used. That seems more like a memory-management subsystem job to him.

It comes down to whether the memory is tracked with struct page or not, Ted Ts'o said. Williams thinks that the memory would start off being managed as a device without page structures, but it could be handed off to the memory-management subsystem at some point, which would create the page structures at that point.

For persistent memory, there is a use case for hypervisors or databases that don't need a filesystem. The persistent memory can be carved up, in a way similar to partitions, then handed out in huge regions to these applications. Keith Packard said that for his work (on The Machine), the plan is to put a filesystem on top of chunks of persistent memory. But some of that memory can also be hotplugged into the kernel and get page structures at that time.

The device side of things is fairly well understood, Williams said. It presents a file or device that an application can mmap() into its address space. The problem comes when you want to get the memory-management subsystem involved. He asked, is there a need to have something besides NUMA nodes to direct applications to the different memory types?

Rik van Riel said that there is some existing code to direct applications to certain types of memory. What is missing, though, is a way to evict pages from faster memory back to slower memory. Williams said that persistent memory can have swap memory associated with it, which might form the basis of a rudimentary migration strategy. But it seemed to him that what was really needed was to send some patches for discussion.

The memory access patterns will need to be tracked for different regions, Van Riel said, so that memory management can make decisions on migration and placement. There is some information available from the CPU performance counters, but that will not cover memory accesses, so there will need to be a way to track processes that are using the wrong type of memory.

There is also a need for a unified way for different architectures to describe this memory to user space. But Packard wondered if would make sense to wait and see what the applications actually need. For now, he plans to simply expose the hardware and let the application developers figure out what more they need.

Comments (none posted)

VFS parallel lookups

By Jake Edge
April 27, 2016

LSFMM 2016

In one of just a handful of filesystem-only sessions at the 2016 Linux Storage, Filesystem, and Memory-Management Summit, Al Viro reported on work he has done to allow VFS lookups to proceed in parallel. Today, all directory operations are done with the inode mutex (i_mutex) held, which prevents anything else from touching that directory. But the most common operation, lookup, is non-destructive, so there is no real conceptual reason to stop it from happening in parallel. Solving that scalability problem took some work, though, as Viro described.

The obvious choice to replace the mutex is a read/write semaphore (rwsem), Viro said. But there is a problem: the mutex currently protects the directory entry (dentry). A lookup operation can cause dentries to be created, which can lead to races if two dentries are created for the same name. Unwinding that took some effort, he said. There is a need to ensure that there are never two hashed dentries with the same parent and name at the same time. If that were to happen, subsequent lookups would only find one or the other, which must be avoided. If two lookups on the same name in the same directory run in parallel, there is a danger that these two dentries would be created.

There was a need for an object that would be used to indicate that a lookup was in progress for a given parent and name. When a dentry is not found in the dentry cache, a new dentry is created to be passed to the filesystem lookup() function. That dentry is the obvious place to track a lookup in progress for the given parent and name. There are some fields that are unused at that point, so they can be repurposed for lookup tracking.

These in-progress lookup dentries are tracked in a hash on the parent. That hash can't be bigger than the number of in-progress lookups for that directory. If a lookup finds an entry on the parent's hash for the same name, it simply waits until the earlier lookup is done. So there are no parallel lookups for the same parent/name combination.

His "lookups" branch that implements parallel lookups "actually works", Viro said. i_mutex is replaced with i_rwsem and lookups are done using that shared lock.

For readdir(), a different choice was made. Because there is state associated with readdir() (i.e. directory position), it doesn't really make sense to allow two threads to be calling getdents() on the same directory file descriptor in parallel. The struct file that represents the open directory file has a lock that prevents read() and lseek() from happening in parallel; it is used to prevent parallel readdir()/getdents() calls.

One problem that he ran into when converting to an rwsem is a lack of "killable" variants of the semaphore primitives (like down_write()). There is a patch series floating around that adds down_write_killable(), but it has not stabilized yet so, for now, he replaced mutex_lock_killable() calls with down_write(), which is fine for testing purposes.

Jan Kara asked about the performance of the semaphore. Viro said that there have been no performance regressions that he has seen on the tests he runs. But read/write semaphores are a bit costlier. Kara was concerned that all lookups are paying the cost of the semaphore, when only some lookups get the benefit of parallelism. Hugh Dickins said that a lot of effort has been put into improving the performance of the semaphore, so the differences should be minimal.

Comments (none posted)

Block and filesystem interfaces

By Jake Edge
April 26, 2016

LSFMM 2016

Steven Whitehouse led a discussion about the interfaces between the block subsystem and filesystems in a combined storage and filesystem session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). In some ways, the discussion was a catch-all for topics that were not slated for their own session during the two-day summit.

For a long time, the block interface was not that complicated, Whitehouse said. But, over the last few years, there have been more and more device types (both real and virtual), more filesystem choices, as well as features like encryption, compression, copy on write, copy offload, and so on. All of these have an impact on the block and filesystem interfaces.

He believes that getting the interfaces right should be the primary consideration that will ensure maximum interoperability. It allows modularization, which is more than simply a good engineering principle; things that don't work can simply be replaced.

One of the changes that has come about is "dynamic devices", which are devices that can change their attributes after they have been mounted. Thin provisioning is a good example of that, he said. For example, the device mapper thin-provisioning (dm-thin) module needs a way for filesystems to communicate their requirements for space that has been allocated, but not written to, so that dm-thin can arrange to have that storage present (which might require operator intervention). Otherwise, operations that should always succeed (because of an earlier fallocate(), say) might return ENOSPC.

Jan Kara said that there had been talk of an interface to provide dm-thin with the information it needs, but it might be better to use notifiers. Al Viro agreed and said that notifiers would also provide a more natural way to inform the other layers that the device topology had changed in some way.

Mike Snitzer said that XFS is now working with dm-thin using a new reservation interface to ask that some space be set aside for the filesystem to use. It is nice to have that software interface, but he wondered if there was an equivalent way to reserve space on the hardware. Martin Petersen said that there is an "anchor" facility that can guarantee that certain logical block addresses (LBAs) are available for writes in the future.

That led to a quick discussion of what is really needed. Snitzer said that specific LBA numbers are not important, as he is just looking for a reservation of a certain amount of space. Kara said that the LBAs are not known when writes go into the page cache, but the amount of space required is known. But reserving blocks on the device can lead to other problems. Fred Knight asked, when does the storage device know that it can release those blocks if the OS or application crashes? There was a thought that perhaps this could be combined with streams (as was discussed in the standards update earlier in the day) and a timeout, which Knight said had been talked about in the standards bodies along the way.

Whitehouse then moved to another topic: error reporting. In particular, what changes might be needed to support thin provisioning and other new types of devices. There is a need to inform other layers of changes in the topology of the storage, but also to report when certain kinds of operations are not supported, such as shrinking a filesystem while it is mounted. There was general agreement that work was needed in this area, but no concrete plans emerged.

Shingled magnetic recording (SMR) was up next. Whitehouse noted that it was the only new device type that didn't have its own session at LSFMM. Hannes Reinecke said that he didn't want to bore the attendees with yet another SMR session. He has posted patches that add a red-black tree to the request queue to track the write pointer in the various zones of the device. There have been no comments, "so it must be OK", he said.

Reinecke mentioned that those patches also map the "reset write pointer" command to the existing discard functionality, since they are closely related. Whitehouse wondered about that because discard is more of a hint than a command. Petersen noted that currently the ioctl() commands are implemented directly as calls to the block library functions, but that should change. More specific building blocks should be used to implement those ioctl() commands.

Ted Ts'o brought up a problem he had run into on mobile phones with inline cryptographic processors. Those processors keep the actual key information internally; the kernel just gets a key ID that it uses to decrypt a block device. Device mapper does not currently support that, but he would like to implement the functionality for the mainline kernel in a cleaner way than was done for the phone.

Snitzer wondered why he had never heard of the problem, since he is a device mapper maintainer. Ts'o said that it was something that was done quickly internally, but he hadn't gotten around to filing bugs and the like. He suggested that perhaps the data integrity (DIF/DIX) support in the device mapper might be used instead, but was concerned that the device mapper support might be lacking.

Darrick Wong said that device mapper does work with DIF/DIX, so it should in theory be possible to use it. He did not know how much "rigorous testing" it had seen, however. Snitzer said that he thought device mapper had "a pretty good story" with respect to using DIF/DIX and stacking protection profiles.

Ts'o did mention another problem, though. Getting a key ID from the inline processor would involve an ARM TrustZone operation that could be slow, since it might require getting a PIN from a user. That part shouldn't go into device mapper, he said.

As time for the session expired, Whitehouse asked about the status of copy offload. Petersen said that the support for token-based copies is ready. It is awaiting some block layer cleanups from Mike Christie, but is in "pretty good shape" overall. He said the copy_file_range() interface will be available in the 4.6 kernel.

Comments (1 posted)

DAX, mmap(), and a "go faster" flag

By Jake Edge
April 26, 2016

LSFMM 2016

At the 2016 Linux Storage, Filesystem, and Memory-Management Summit, Dan Williams led a combined storage and filesystem session he jokingly called "mmap() ponies" (referring to the O_PONIES debate from 2009). The discussion was about the "I know what I'm doing" flag for mmap() that would allow user space to manage its dirty cache lines when persistent memory and the DAX direct-access mechanism are being used. The overhead of the kernel tracking those dirty cache lines in the page cache can be avoided, but many saw it as a premature optimization.

The flag in question is actually called MAP_PMEM_AWARE, but it certainly acts like its longer name would imply. Williams acknowledged that it breaks POSIX semantics and noted that Dave Chinner was strongly against it. It is an attempt to "have our cake and eat it too", he said.

Jan Kara called it a "go faster" flag, but claimed that most of the overhead would still be present. The tracking would be needed to avoid races between faults for regular pages and huge pages. So the flag really won't even go that much faster.

Williams said that one of Chinner's complaints was that there was no reason to have a "go faster" flag when we don't know how slow it goes now. But it is roughly ten times faster to simply flush the cache lines from user space, as opposed to calling msync() to flush a 4KB page when data in that range is dirtied. The difference is in the granularity that is being flushed, he said: 64 bytes for a cache line versus 4KB for a page.

As several pointed out, though, pages can be larger than that, including 2MB huge pages or 1GB extra-huge pages. Kara also said that the PPC architecture has 64KB pages, so it is tracking on that granularity. The problem, he said, is that user space believes that flushing the cache lines is enough to ensure data integrity, but it isn't. There is metadata of various sorts that filesystems need to persistently store, which requires an fsync() or similar.

Williams acknowledged that managing the dirty data in user space without calling fsync() will not allow filesystems to do copy-on-write (CoW) or "other fancy stuff" behind the application's back. But Chris Mason pointed out that the filesystem doesn't know when the application is done with the data; the filesystem defines when the read-only phase of its data begins and ends, so without an fsync() it doesn't know when to write its metadata.

Williams suggested that the DAX semantics could be changed so that all faults are synchronous with respect to the metadata. It would be "less surprising" if that were simply a property of DAX, he said. Kara said that he still thinks it is a premature optimization, but that making faults synchronous with metadata updates is probably the right way forward.

The flag may still be useful, Williams said, as it does mean that the application knows that it is bypassing sync operations. But he disagreed with the "premature optimization" characterization. Some data that he received just before the session started showed the 10x performance difference he had mentioned earlier. The problem is that if a customer asks "what happens if I don't call fsync()?", the answer from filesystem developers will be that their data is not guaranteed to be persistent. That is what the flag would allow.

Kara suggested that the dirty data could still be tracked in the kernel, but that information simply wouldn't be used. Williams said that there is "no real cost" to doing that tracking, so that would probably be fine.

Ted Ts'o suggested a flag to msync() just for metadata flushes as an approach. Or perhaps an fmetadatasync() or similar system call, that would simply sync all applicable filesystem metadata for a range—trusting user space to flush its data cache lines correctly. Kara said that it might make it difficult to determine what caused data-loss problems, but Ts'o said that if data is lost, it is an application problem, but if the filesystem is corrupted, that would indicate a filesystem bug.

Kara agreed and thought that fmetadatasync() would both be more future-proof and more in line with what filesystem developers would prefer. It might not perform all that well for small updates, but should be fine for large operations, he said. Ts'o cautioned that it is still early, so kernel developers really do not know how applications will want to use persistent memory. As time goes on, new or different interfaces may be needed.

Comments (none posted)

Partial drive depopulation

By Jake Edge
April 27, 2016

LSFMM 2016

With today's large storage devices there are times when a component of the drive will fail (e.g. a head in a disk or a die in an SSD), which reduces the capacity of the device without rendering it completely unusable. But the arrangement of logical block addresses (LBAs) on the devices is such that the non-functioning LBAs are scattered across the device's address space. There is a need to "depopulate" (or "depop") those LBAs so that the rest of the device can continue to be used. Hannes Reinecke and Damien Le Moal led a combined storage and filesystem session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit to discuss depop and how it should be handled by the kernel.

Le Moal began by outlining the problem, noting that there are several types of components (head, surface, die, channel) that can go bad in a device without taking the entire device with them. The device will report the problem with a "unit attention" condition. One way to handle that is with offline logical depop, where the drive is simply reformatted to the new, smaller capacity. Reinecke said that would "not require a lot of work" to handle.

The question of recovering data from the good portion of the device prior to reformatting came up. Ted Ts'o asked if there would be a list of bad sectors delivered to the kernel. Le Moal said there was a way for the host to get that list, but James Bottomley thought that sounded like an "awful lot of data to store in the kernel". For offline depop, though, the data would not need to be stored, Le Moal said.

It is a large list, Fred Knight said, as the bad sectors are likely to be spread across the LBA range. Christoph Hellwig called the list "useless" to the kernel, but Knight said that if it was just needed for recovering the good data, the block list need not be stored. The problem is that disks are not uniform in the number of sectors per track across the drive and bad-block remapping can also complicate things.

The discussion then turned to online logical depop, where the idea is to try to avoid reformatting the drive. The healthy LBAs would be kept intact, which would leave holes in the LBA space. The holes could be "amputated", removing them from the LBA range and never using them again. Or the blocks could be "regenerated" by allocating other blocks and remapping them into the holes.

All of that seemed "overly complicated" to Ric Wheeler. He suggested that users would simply regenerate the filesystem from backups rather than fix the holes. They would truncate the size of the device and reformat it to get it back into production. The data still on the platters would just be ignored.

Chris Mason agreed that users are likely to take the drive out of production, truncate and reformat it, then put it back. "Healing" drives is not an online process, he said. Wheeler said that he thought any work on online depop was likely a waste of time.

But Knight said that a failure that only affected 10% of the drive would only take 10% of the time to rebuild, which might be attractive in some cases. Mason, though, felt that most would want some kind of verification step before bringing a partially failing drive back online. It may be true that it is simply one component that has failed, but that isn't truly known until the drive is examined and tested. Failing to do that, could result in a "bunch of borderline stuff" running in production, he said.

Bottomley and Martin Petersen both said that a large discontiguous LBA range was not really usable. Wheeler summed up the feeling in the room by saying that offline depop is something that can be supported, but that unless the LBA regions were large or computable, they were not something that the kernel developers would use; "scatter-gather lists of LBAs" are not helpful.

Comments (2 posted)

fallocate() and the block layer

By Jake Edge
April 27, 2016

LSFMM 2016

In a session he dubbed "block device fallocate() bikeshedding", Darrick Wong led a discussion on some recent ideas on moving some functionality from ioctl() commands to a higher level in the stack. The session was in a combined filesystem and storage track session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit.

There are some block-layer ioctl() commands that could be considered as candidates for changing to fallocate() flags. For example, he had proposed a BLKZEROOUT2 ioctl() command to provide a way for user space to access the zeroing facility in the block layer. In the discussion on the mailing list, others, including Linus Torvalds, thought that it made more sense to use the FALLOC_FL_PUNCH_HOLE and FALLOC_FL_ZERO_RANGE flags to fallocate() to do that.

Wong has implemented those changes, but wondered about the alignment requirements. His patches currently require that the ranges specified are aligned with the 512-byte logical block. That avoids the complexity of manually zeroing out both ends, while punching out multi-block holes in the middle. Ted Ts'o suggested that, for simplicity, non-aligned ranges should simply get EINVAL.

The conversation then turned to thin provisioning (dm-thinp) in the context of the out-of-tree FALLOC_FL_NO_HIDE_STALE functionality. That flag will cause space to be allocated, but that space will not be zeroed before being made available to user space. So it may return stale data in those blocks, which is generally considered to be a security problem. Dm-thinp needs to allocate space from its pool, but those blocks could still have stale data. By default, dm-thinp will zero out any reads from those regions until they have been written and will zero out the rest of the block if a write is smaller than a block.

But, more often than not, that zeroing is disabled in dm-thinp by users because it is not needed, since some filesystems (e.g. XFS) already handle getting stale data from the block layer. Wong asked if that "no hide stale" functionality would be useful to in-kernel callers, so that no extra zeroing was done for them. Chris Mason said that many filesystems already expect garbage from the drives, so it would be a surprise to get zeroes. It would make sense to have a way to get zeroes, but Btrfs and others don't really need it.

A problem with fallocate() is that it might take a long time to complete for certain kinds of operations, Wong said. Since using it is supposed to ensure that user space will never receive an ENOSPC error when operating on the file, copy-on-write (or reflink()-supporting) filesystems need to "unshare" the blocks of a file. For a 1TB file, that effectively means a 1TB copy operation, but not doing so would mean that "the thing that is not supposed to happen, happens".

So, Wong asked, should there be an fallocate() flag for "do expensive operations" to avoid violating users' expectations? But David Howells asked in return: is fallocate() supposed to be fast or is it intended to ensure that ENOSPC doesn't happen? Christoph Hellwig said that there are already instances where fallocate() is not fast. The questions remained unresolved and the session wound down.

Comments (none posted)

Using the multiqueue block subsystem by default

By Jake Edge
April 27, 2016

LSFMM 2016

The multiqueue block layer has been a part of the kernel for a few years now, and it works reliably, Bart Van Assche said to start a session he led at the 2016 Linux Storage, Filesystem, and Memory-Management Summit. He wondered if it was time to make multiqueue the default with an eye toward eventually removing the single-queue API.

The SCSI multiqueue (scsi-mq) and device mapper multiqueue (dm-mq) drivers have to be explicitly enabled currently, but there is a long list of block and SCSI devices that support multiqueue at this point. The challenge, Van Assche said, is that both the single and multiqueue paths have to be maintained and each needs to be tested when changes are made.

He suggested that the first step toward eliminating the single-queue code would be to enable multiqueue by default for SCSI devices, but there are some missing pieces that would need to be filled in. I/O accounting support would need to be added, he said, but Jens Axboe disagreed, saying that the accounting support was present. A bigger missing piece is I/O scheduling for block multiqueue, he said.

Evolving to a single code path will be difficult because the two paths are significantly different, Van Assche said. Hannes Reinecke said that it would require all drivers to support multiqueue, which will be hard. There is a need to document which block-layer functions do not work with multiqueue—effectively, which functions are safe to use with multiqueue and which are not. Van Assche suggested that some kind of tool might help with the conversion.

Reinecke asked if there were all that many drivers that had not been converted yet. Christoph Hellwig said that there was some "weird stuff" in the block directory that hasn't been converted. After some quick analysis of the tree, he said that there were 38 drivers that still needed conversion. It will be hard to test the conversion of some of them, since they are for hard-to-obtain hardware and the like.

In the near term, there is a question of how to handle multiqueue, Reinecke said. If there is a global switch that chooses between multiqueue and single queue, it won't be good for legacy devices. Distributions can either disable multiqueue and lose out on its performance gains, or enable it and get bad performance from legacy devices, he said.

The main problem seems to be a lack of scheduling for multiqueue, but Axboe said that it is actively being worked on. A patch that would select between single and multiqueue for different devices "on the fly" is possible, he said, and could be used as a near-term fix.

The biggest problem, Axboe said, is for systems with regular SATA drives, where the read latency goes "through the roof" with multiqueue. That is likely a problem that writeback throttling would solve. Once those patches get in, scheduling will not be as important, he said. Multiqueue is not only for fast devices, so getting I/O scheduling or throttling in will be likely to remove the barriers for multiqueue on legacy devices.

Comments (none posted)

Bulk memory-allocation APIs

By Jonathan Corbet
April 23, 2016

LSFMM 2016

Jesper Dangaard Brouer has been working on improving networking performance for some time. When one is trying to process packets with a tight per-packet time budget, every bit of overhead counts, and the overhead of the memory allocator turns out to count quite a bit. At the 2016 Linux Filesystem, Storage, and Memory-Management Summit, he presented some ideas he had for reducing that overhead.

One area of trouble is the DMA API, especially on PowerPC systems. There, the mapping and unmapping operations are expensive, and pages that have been mapped for DMA must be considered read-only by the networking code. Unfortunately, those pages must be written at times (to change an IP address, for example), leading to expensive unmapping operations that, perhaps, can even be destructive to the data in the buffer.

Jesper suggested that, instead, a set of pages could be kept permanently mapped to the device and recycled when mapping requests are made. The pages could remain mapped, with the dma_sync_* functions used to control whether the device or the CPU "owns" the pages at any given time. That would cut out a lot of the overhead and speed up packet processing. One little detail is that it would require space in the page structure to indicate which device a page is dedicated to; space in that structure tends to be in short supply.

On x86 systems, DMA is not usually the problem; instead, memory allocation is. He is working with a target of 14.8 million packets (full wire speed on a 10Gb/s link) per second; on a 3GHz system, that gives the kernel about 200 cycles in which to process each packet. Allocating a single page, though, takes 277 cycles on that system, making the 14.8Mpps rate unattainable. He pointed out Mel Gorman's recent work to reduce memory-allocator overhead; that work reduced the cost to 230 cycles — a significant improvement, but nowhere near enough.

The DPDK user-space networking code can achieve the 14.8Mpps rate, he said, so it is clear that the kernel is doing something wrong and not using the hardware optimally. After two years of optimization work, he has managed to double the kernel's processing rate to about 2Mpps, which, while being a step in the right direction, is far from the target.

Drivers tend to work around the problem by allocating large pages and splitting them into pieces. An order-3 (eight-page) chunk can be allocated in 503 cycles, which is more than the single-page cost, but, when split into 4KB chunks, the per-page cost drops to 62 cycles. That clearly helps to reach the time budgets, but it has the disadvantage of pinning down large chunks of memory if some piece of code hangs on to just one of the component pages. That, in turn, can push the system as a whole into an out-of-memory situation, which can also adversely affect packet rates (among other things). For this reason, Google turns this mechanism off internally.

Wouldn't it be better, he said, to just have a bulk page-allocation API that would return multiple pages without the need to allocate them as a single large page? Mel Gorman answered that, in the end, it was just a matter of coding to make that idea work. The page allocator already has the bulking concept internally, there just has been no reason to expose it to the rest of the kernel before. There should be no fundamental problem with doing so, he said. In general, there has not been a huge emphasis on page-allocation performance optimization because most users immediately zero each page after allocating it. The cost of clearing the page's contents overwhelms the cost of allocating the page in the first place.

What worries Mel, though, is the idea of implementing an entirely new memory allocator. Eventually that allocator would have to gain features like NUMA awareness and would pick up "all the old mistakes," at which point it would be as complex and slow as what we have now. It would be better, he said, to use the existing per-CPU allocator and optimize it for this case; then all users in the kernel would benefit. He has a patch now that can halve the overhead of the page allocator if "certain assumptions are made"; in particular, the user needs to not care about NUMA placement and there should be a fair amount of memory available in general.

Jesper returned to the idea of a "page pool" that would be used to quickly satisfy requests from network drivers (or other time-sensitive users). This pool would be small, the number of pages required would be about equal to the sum of the size of the receive ring and the transmit queue for each device. It is important to bound the size of the pool, or a persistent traffic overload can run the system out of memory. To that end, he said, the page allocator should be able to shrink the page pool when memory is tight.

He saw two ways of creating this pool. One, that he called "all in," would be to write an entirely new slab-like allocator. The alternative is to wrap some sort of layer around the existing page allocator. Predictably, Mel was not fond of the "all in" strategy; he said we already have too many allocators and adding another will just create new problems. It would be better to add a bulk interface to the current allocator which, he repeated, could be made to be much faster in some settings.

The direction for this work thus seems clear, but the challenges to overcome are significant. It may be a little while yet before a stock kernel can hit that 14.8Mpps networking target.

Comments (2 posted)

CMA and compaction

By Jonathan Corbet
April 23, 2016

LSFMM 2016

The nice thing about virtual-memory systems is that the physical placement of memory does not matter — most of the time. There are situations, though, where physically contiguous memory is needed; operating systems often struggle to satisfy that need. At the 2016 Linux Storage, Filesystem, and Memory-Management Summit, two brief sessions discussed issues relating to a pair of techniques used to ensure access to physically contiguous memory: the contiguous memory allocator (CMA) and compaction.

CMA troubles

Aneesh Kumar started off the CMA session by bringing up a problem that comes up when running virtualized guests under KVM on PowerPC systems. The hardware page table for such guests must be stored in a large, contiguous memory region, which can be hard to come by after the system has been running for a while and memory has become fragmented. The solution is to use CMA, which reserves a region of memory for these allocations, to allocate this page table. But things don't work quite as desired.

One problem is that, if the kernel is doing a lot of movable allocations (those that can be relocated if need be), the kernel will go into reclaim far earlier than it should. By the design of CMA, the kernel should be able to obtain movable allocations within the CMA area, since they can be moved out should a need for a large contiguous area arise. The kernel tends to avoid the CMA area, though, leading to situations where the system behaves like it's out of memory while much of the CMA area remains free.

The ZONE_CMA patches are meant to address this problem. They put the CMA area into a separate memory zone that is available for movable allocations. But, Aneesh reported, using ZONE_CMA just replaces one set of problems with another. Allegedly movable allocations become pinned in place when an application places them under I/O or locks them with mlock(). The compaction code will not relocate compound pages (transparent huge pages, for example). The result is that the CMA area becomes unfixably fragmented and CMA allocations fail, defeating the original purpose. So users like Aneesh are left wondering what other approaches they can try.

Mel Gorman came in at this point with a bit of a lecture; according to him, the ZONE_CMA approach is simply not acceptable. Memory zones exist to deal with addressing limitations — whether the memory can be used for DMA, for example — and should not be used for other purposes. Like ZONE_MOVABLE, which is in the kernel now, ZONE_CMA just brings a new set of problems with it. ZONE_MOVABLE was a mistake, he said, one which should not be repeated here.

The better solution, he said, would be to migrate pages out of the CMA area prior to pinning them. In addition, page blocks (large groups of pages used to try to keep similar allocation types together) could gain a sticky MIGRATE_MOVABLE bit that would prevent nonmovable allocations from being performed there. Finally, if problems remain, the lumpy reclaim mechanism should be brought back to help clean up the mess. There was some talk about the details, but it seemed to be generally agreed that this was the direction to go to improve the interaction between CMA and the rest of the memory-management subsystem.

Compaction

"Compaction" is the process of shifting pages of memory around to create contiguous areas of free memory. It helps the system's ability to satisfy higher-order allocations, and is crucial for the proper functioning of the transparent huge pages (THP) mechanism. Vlastimil Babka started off the session on compaction by noting that it is not invoked by default for THP allocations, making those allocations harder to satisfy. That led to some discussion of just where compaction should be done.

One option is the khugepaged thread, whose job is to collapse sets of small pages into huge pages. It might do some compaction on its own, but it can be disabled, which would disable compaction as well. Thus, khugepaged cannot guarantee that background compaction will be done. The kswapd thread is another possibility, but Rik van Riel pointed out that it tends to be slow for this purpose, and it can get stuck in a shrinker somewhere waiting for a lock. Another possibility, perhaps the best one, is a separate kcompactd thread dedicated to this particular task.

Michal Hocko said that he ran into compaction problems while working on the out-of-memory detection problem. He found that the compaction code is hard to get useful feedback from; it "does random things and returns random information." It has no notion of costly allocations, and makes decisions that are hard to understand.

Part of the problem, he said, is that compaction was implemented for the THP problem and is focused a little too strongly there. THP requires order-9 (i.e. "huge") pages; if the compaction code cannot create such a page in a given area, it just gives up. The system needs contiguous allocations of smaller sizes, down to the order-2 (four-page) allocations needed for fork() to work, but the compaction code doesn't care about creating contiguous chunks of that size. A similar problem comes from the "skip" bits used to mark blocks of memory that have proved resistant to compaction. They are an optimization meant to head off fruitless attempts at compaction, but they also prevent successful, smaller-scale compaction. Hacking the compaction code to ignore the skip bits leads to better results overall.

Along the same lines, compaction doesn't even try with page blocks that hold unmovable allocations. As Mel pointed out, that was the right decision for THP, since a huge page cannot be constructed from such a block, but it's the wrong thing to do for smaller allocations. It might be better, he said, for the compaction code to just scan all of memory and do the best it can.

There was some talk of adding flexibility to the compaction code so that it will be better suited for more use cases. If the system is trying to obtain huge pages for THP, compaction should not try too hard or do anything too expensive. But if there is a need for order-2 blocks to keep things running, compaction should try a lot harder. One option here would be to have a set of flags describing what the compaction code is allowed to do, much like the "GFP flags" used for memory allocation requests. The alternative, which seemed to be more popular, is to have a single "priority" level controlling compaction behavior.

The final topic of discussion was the process of finding target pages when compaction decides to migrate a page that is in the way. The current compaction code works from both ends of a range of memory toward the middle, trying to accumulate free pages at one end by migrating pages to the other end. But it seems that, in some settings, scanning for the target pages takes too long; it was suggested that, maybe, those pages should just come from the free list instead. Mel worried, though, that such a scheme could result in two threads doing compaction just moving the same pages back and forth; the two-scanner approach was designed to avoid that. There was some talk of marking specific blocks as migration targets, but it is not clear that work in this area will be pursued.

Comments (none posted)

Virtual machines as containers

By Jonathan Corbet
April 23, 2016

LSFMM 2016

Containers and virtualization are two distinct mechanisms for sharing a physical host across multiple tenants. Containers tend to be more resource-efficient than virtualization, but virtual machines can provide stronger isolation. Rik van Riel started a memory-management track session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit by stating that there is an increasing level of interest in using virtual machines as if they were containers. One problem that results is that each virtual machine (VM) does its own caching, and, if left to its own devices, will fill its memory with cached data. That results in systems using much more memory than they really need, and reduces the number of VMs that can be packed into the host.

A longstanding approach to this problem is balloon drivers, which will "expand" by allocating memory from the guest and returning it to the host system. Ballooning is effective for extracting memory from guests, but it doesn't answer one important question: when should this be done? Despite years of experience with virtualization, we don't really know how to do this sort of memory balancing.

James Bottomley suggested that it might be a good idea to use paravirtualization to move some memory-management decisions from the guest to the host. The Clear Containers project, for example, is using the DAX mechanism — implemented to allow direct access to file data stored in persistent memory — to share file pages with the host. That works well, though sharing of anonymous pages would be harder. Perhaps the guest could share its LRU list with the host; the host could then see what the guest is trying to do and make more intelligent memory-balancing decisions.

It should be possible to share all cached file data across the guests and the host if we had a paravirtualized page cache, James said: "how hard can it be?"

Even if page caching is moved out of guests, though, there would still need to be a way to put memory pressure on guests. Other caches, such as the inode and dentry caches, could still expand to fill all available memory. So the need for a way to quantify memory pressure and communicate it between the host and the guests does not go away. As the session wound down, it was agreed that there were some interesting ideas in play. How soon those ideas will be turned into code remains to be seen, though.

Comments (2 posted)

Partial address-space mirroring

By Jonathan Corbet
April 27, 2016

LSFMM 2016

A feature found in some systems designed for high availability is memory mirroring: providing two copies of data stored in main memory so that said data can be recovered should something happen to one of the copies. But, as Taku Izumi noted during the memory-management track of the 2016 Linux Storage, Filesystem, and Memory-Management Summit, as memory sizes grow larger, the cost of providing that mirror grows as well. So there is interest in building systems that only mirror part of the physical address space. It quickly became clear, though, that the memory-management developers have strong doubts about the wisdom of such an arrangement.

This feature, Taku said, is managed by the low-level BIOS; it can be configured there or by using the efibootmgr command. The amount of memory to be mirrored can be set there, and read from the EFI memory map by the kernel at boot time. Unlike fully mirrored memory, partially mirrored memory requires support from the kernel. The idea is to improve fault tolerance by using mirrored memory for the kernel and its data, while putting user space in single-copy memory. Everything that is not mirrored would be placed into the ZONE_MOVABLE zone, so that kernel memory would not be allocated there.

By default, in such a system, one would want user-space memory to be kept out of ZONE_NORMAL, since that's where the mirrored memory is. To that end, a new __GFP_NONMIRROR allocation flag would be added; it would be part of GFP_HIGHUSER_MOVABLE. But, occasionally, there might be critical user data that should go into mirrored memory. That could be obtained via a new MADV_MIRROR flag to the madvise() system call.

Kirill Shutemov objected that madvise() is the wrong interface to use; placement in mirrored memory would be mandatory, while madvise() is, as its name suggests, an advisory system call. Rik van Riel asked why we would want to put user-space memory into the mirrored range; the answer seems to be to enable the running of a particularly important virtual machine with mirroring. The problem is that, once you try to put a user-space program into that range, everything has to go there, including shared libraries. Making all that work properly could get a little messy.

On the other hand, it was pointed out that computers exist to run applications. If a particular application is so important that it needs a computer with (expensive) mirrored memory, why not protect that application, too? Aneesh Kumar said that one has to start somewhere, and that protecting the kernel is the first step. Protection can be expanded from there.

There was some talk about preventing user space from exhausting the mirrored range; perhaps requesting mirrored memory should be a privileged operation. It's also not clear what the kernel should do if mirrored memory runs out; should it fall back to non-mirrored memory? The conclusion seemed to be that falling back would remove the reliability guarantee that mirrored memory is meant to provide, so it should not be done. Instead, if possible, the range of memory that is mirrored should be expanded.

It was Michal Hocko who raised the key objection to this scheme, though: it threatens to bring back all of the low-memory problems that, the developers had thought, we had finally left behind. On 32-bit systems, only a portion of the physical address space is directly addressable by the kernel; that portion is called "low memory." Kernel data structures, as a rule, can only be placed in the low-memory region. That has led to many problems over the years where the system runs out of low memory and finds itself crippled, even though quite a bit of memory is available in general. 64-bit systems do not have this problem, since they can map the entire address range.

By creating a new zone that must contain all kernel memory, partial memory mirroring recreates the low-memory problem. It will place hard limits on the amount of user-space memory that can be allocated, leading, Michal said, to out-of-memory situations when, in fact, lots of memory is free. Rik added that experience has shown that the ratio of non-kernel memory to kernel-addressable memory should not go much above 4:1; after that, problems start to develop.

Returning to the user-space side, Rik said that it would be necessary to place some user-space data in mirrored memory. If the init process dies due to memory corruption, for example, the fact that the kernel is protected will provide little comfort. Then the C library probably needs to live there, and probably no end of other things. In the end, Michal said, the obvious conclusion is that one should simply mirror the entire address space.

Mike Kravetz suggested that mirrored memory could be an opt-out resource rather than opt-in, but Kirill pointed out that an application that opts out would likely end up placing important libraries in unmirrored memory. Those would have to be somehow upgraded later on when another application needs them. Mel Gorman said that, in the end, nobody would volunteer to opt out; as Linus noted recently, few developers think that their application is not important.

Mel went on to say that partial mirroring is simply the wrong approach; if a system needs that level of reliability, he said, it should just mirror all of memory. Trying to work around that requirement is trading a potential future problem (memory corruption) for a definite problem now (kernel-memory issues). Beyond that, we can't pretend that user space can make mirroring decisions correctly. Security issues remain; even if requesting mirrored memory directly is a privileged operation, an unprivileged process will still be able to force the exhaustion of mirrored memory. There is no privilege separation in this scheme; it promises the ability to protect specific applications, but is unable to deliver on it. The result will be worse than a false hope; it will create a system that is fundamentally fragile.

Andrea Arcangeli pointed out that some of this fragility is already inherent in the ZONE_MOVABLE memory zone. Mel agreed, saying that ZONE_MOVABLE is a curse that should never have been used the way it is. As a result, he said, features like memory hotplug are fundamentally broken and systems are more fragile than they should be. But, he said, if you see a car crash, you don't normally drive in to join it; the same approach should be taken here.

At this point, time ran out, and it became clear that the conversation was circling around on itself. But it was also clear that the memory-management developers think little of the partial-mirroring idea and would rather not see code added to the kernel to support it.

Comments (3 posted)

Heterogeneous memory management

By Jonathan Corbet
April 27, 2016

LSFMM 2016

The processor that one thinks of as "the" CPU is not the only processor on most systems; indeed, it is often not the fastest. Attached devices, first and foremost the graphics processor (GPU), have their own processors that can speed a number of computing tasks. They often have full access to system memory, but there are obvious challenges to sharing that memory completely between the CPU and other processors. The heterogeneous memory management (HMM) subsystem aims to make that sharing possible; Jérôme Glisse led a session on HMM for the memory-management track at the 2016 Linux Storage, Filesystem, and Memory-Management Summit.

The key feature of HMM, Jérôme said, is making it possible to mirror a process's address space within the attached processor. This should happen without the need to use a special allocator in user space. On the hardware side, there are a couple of technologies out there that make this mirroring easier. One is the PowerPC CAPI interface; another is the PASID mechanism for the PCI Express bus. On the software side, options are to either mirror the CPU's page table in the attached processor, or to migrate pages back and forth between CPU and device memory. Regardless of how this is done, the hope is to present the same API to user space.

We care about this, Jérôme said, because the hardware is out there now; he mentioned products from Mellanox and NVIDIA in particular. Drivers exist for this hardware which, he said, is expensive at the moment, but which will get cheaper later this year. If we don't provide a solution in the kernel, things will run much more slowly and will require the pinning of lots of memory. It will be necessary to add more memory-management unit (MMU) notifiers to device-driver code, which few see as desirable. OpenCL support will only be possible on integrated GPUs. In general, he said, it is better to support this capability in the kernel if possible.

The solution to these ills is the HMM patch set, which provides a simple driver API for memory-management tasks. It is able to mirror CPU page tables on the attached device, and to keep those page tables synchronized as things change on the CPU side. Pages can be migrated between the CPU and the device; a page that has been migrated away from the CPU is represented by a special type of swap entry — it looks like it has been paged out, in other words. HMM also handles DMA mappings for the attached device.

Andrew Morton noted that the patch set is "a ton of code," which always makes it harder to get merged. There was some talk of splitting the patch set into more palatable pieces; some of the code, evidently, is also useful for KVM virtualization. Andrew told Jérôme to take care to document who the potential users of this code are. Then, he said, "it's a matter of getting off our asses and reviewing the code." There might be trouble, he said, with the use of MMU notifiers, since Linus has been clear about his dislike of notifiers in the past.

Overall, though, no objections to the core model were expressed. The HMM code has been in development for some years; maybe it is finally getting closer to inclusion into the mainline kernel.

Comments (3 posted)

Memory-management testing

By Jonathan Corbet
April 27, 2016

LSFMM 2016

Memory-management subsystem testing is notoriously difficult; mistakes in the code often make themselves felt far from the place where things actually went wrong. Things are being done to improve testing in the kernel, though; Sasha Levin led a session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit to discuss what has been done and how memory-management testing can be improved.

Sasha started by saying that he has been working with the KASan memory-error detection tool in an attempt to find places where the necessary locks are not being taken. But, to get there, he needs to be able to annotate which memory is protected by each lock, and he is not sure how to proceed. Adding that information inline, or in spin_lock_init(), doesn't seem like the best solution. Christoph Lameter suggested that the locking requirements could be put into the relevant structure definitions, but that could get messy; there were concerns about what that would do to the already convoluted struct page, for example.

The discussion moved on to VM_BUG_ON() calls, which cause an oops when an assertion finds something wrong. A lot of these, he said, could be relocated to, for example, where page flags are written. That would catch problems at the source, rather than tripping over them at some later point. The problem with that approach might be the performance cost, since it would be adding checks at every write to the page flags. Kirill Shutemov also worried about potential false positives; if flags are changed in multiple steps, they could appear to be in an incorrect state between those steps. Hugh Dickins agreed that there would be a lot of noise resulting from such a change, and said that nobody would bother to try to fix it all.

One possible improvement might be to validate the page flags only when a page is unlocked. Sasha said he would put a proposal together and see what sort of response he gets.

He is also trying to improve testing of the handling of huge pages. To that end, he has written a patch to expose the list of huge pages to debugfs; that allows him to force page splits at inconvenient times to see what happens. He plans to clean that patch up and submit it soon.

The final topic of discussion was kernel trees for testing. Sasha said that Andrew Morton's "mmotm" tree contains changes that are thought to be suitable for the next merge window, but that is only so helpful to get testing for large memory-management patch sets, which often go through several iterations before reaching that point. So he is thinking about running a testing tree of his own containing patches from "known authors," in the hope of catching problems earlier. Would that be useful, he asked, and would others make use of it?

Andrew responded that it's often more useful if one person provides the testing service, rather than putting out a tree in the hope that others will test it. But he might consider pulling it into mmotm. Hugh worried, though, that doing so might destabilize the mmotm tree. Sasha responded that such destabilization was exactly the purpose — it would help to bring out problems early. But whether such destabilization would be welcome for mmotm users in general is not clear.

Comments (none posted)

Memory control group fairness

By Jonathan Corbet
April 27, 2016

LSFMM 2016

Control groups running with the memory controller (known as "memory control groups" or "memcgs") allow the system administrator to regulate memory use by groups of processes. Getting this controller working well has been a long process, though, and, as Vladimir Davydov made clear during his session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit, the job is not done yet.

The main problem that remains, according to Vladimir, is fairness. When the system is under memory pressure, all memcgs are scanned and trimmed in proportion to their configured limits. But, if one process is creating lots of inactive pages — by streaming through a large file, for example — it will claim space from the others. This is not a useful result; it is causing needed pages to be pushed out in favor of pages that will never be used again. Unless other processes in the group touch their pages just as quickly as the streamer, they will lose those pages.

Johannes Weiner said that this behavior is only a problem if memory is overcommitted on the system. But it tends to come up with workloads like virtualization, for which the entire point is to overcommit the system's resources.

One possible solution would be to adjust each process's memory limits dynamically depending on how much memory pressure is created by each. It is not clear how that pressure would be detected and quantified, though.

Another possibility is to store the time that each page was added to the LRU list, and to track the oldest page on each list. The system could try to achieve an approximate balance of ages. A control-group parameter could configure a minimum age for pages in the list. Only the active list would be affected by this parameter, so it would tend to protect actively used pages over the streamer's pages, which are in the inactive list.

Johannes said that the hard and soft memory limits implemented by memory control groups should be used for this purpose; their whole reason for existence, after all, is to route memory pressure. If the streaming process's limits are set to a relatively low value, it will be trimmed accordingly. The problem is that setting these limits appropriately is not an easy task; there would really need to be a system daemon charged with doing that.

It was suggested that the refault distance work could help in this case. Refault distance is essentially a measurement of a process's working set; it tells how long a process's pages tend to stay paged out before being brought back into working memory. This measurement is a bit too one-sided for this task, though; it can tell when to increase a group's limits, but not when to shrink them.

Another possibility is the vmpressure mechanism, which is meant to notify processes when memory starts to get tight. Michal Hocko said, though, that vmpressure only works well on small systems; on larger systems, pressure tends to look high even when the situation is not that severe.

Johannes said that it might be possible to track how much time processes spend waiting for memory. If a process spends, say, over 10% of its execution time waiting for pages to be faulted back in, its memory limits need to be expanded. The kernel has no such metric at the moment, though, so it's not possible to tell how "healthy" the workload is. Rik van Riel suggested that the same metric could be used to shrink working sets if the wait time goes below a low watermark.

Vladimir concluded the session by saying that he would start by trying to use the facilities that are available now and add a daemon to try to tweak the memory limits on the fly.

Comments (none posted)

TLB flush optimization

By Jonathan Corbet
April 27, 2016

LSFMM 2016

The translation lookaside buffer (TLB) caches mappings from virtual to physical addresses in an attempt to minimize the number of traversals of the page tables the CPU needs to make. When the page tables are changed, though, the information in the TLB may be rendered incorrect and, as a result, need to be flushed. TLB flushes are expensive; they drop cached information and can involve sending inter-processor interrupts (IPIs)

across the system. So there is interest in reducing their cost; Aneesh Kumar and Andrea Arcangeli led a session at the 2016 Linux Storage, Filesystem, and Memory-Management Summit to discuss ideas of how to do that.

Aneesh started by saying there needs to be an easier way to flush a range of TLB entries. But, when it comes time to do a TLB flush, it is not always easy to know what the size of the range is. A possible solution would be to track multiple flushes in the mmu_gather structure used with TLB flushing and push it all out at once. The idea seemed to be reasonably well accepted and was not discussed at length.

Andrea got up to talk about a related issue: overly long reverse-mapping (rmap) walks. Reverse mappings are the mechanism by which the kernel can determine which processes have a given page in their page tables. They are kept in a linked list which, Andrea said, can get to be very long, causing the kernel to spend a long time traversing the list. On a virtualization-heavy system with KSM enabled, the lists can grow toward one million entries.

That leads to a clear desire to reduce the length of the lists. One simple proposal is to just cap the length of the rmap list to something like 256; after that, a new mapping would be needed. Or, even without a maximum list length, setting a maximum sharing factor after which KSM would not merge pages would help a lot.

Andrea is also interested in reducing the cost of IPIs associated with memory-management changes. One place where IPIs can happen is with the tracking of referenced pages; he noted that page_referenced(), which checks whether a given page has been referenced via any of its mappings, no longer sends IPIs in the normal case. It traverses the rmap lists, though, so is affected by long list lengths. But, beyond that, the memory-management unit notifier in the KVM hypervisor does do IPIs, since there is no "accessed" bit that is maintained by the hardware in the shadow page tables in guest systems. That can lead to scalability problems.

Another place with rmap scalability problems is page migration, which must walk the entire rmap list and perform TLB flushing. Offlining memory from guests requires migration, so this can be a significant issue. There are patches in circulation to do the flushing in batches and reduce the resulting IPIs. This work is good but needs to be extended somewhat.

The session wound down without much in the way of real conclusions. TLB flushing and the related machinery, it was agreed, present some scalability issues, and work will be required, as always, to mitigate those issues.

Comments (none posted)

Improving the OOM killer

By Jonathan Corbet
April 27, 2016

LSFMM 2016

When the system becomes so short of memory that nothing can make forward progress, the only possible way to salvage anything may be to kill processes until some memory becomes available. That is where the dreaded out-of-memory (OOM) killer comes into play. For as long as the OOM killer has existed, developers have been trying to improve its operation. The latest to work in this area is Michal Hocko, who led a session during the memory-management track of the 2016 Linux Storage, Filesystem, and Memory-Management Summit to talk about what he has been doing.

One of Michal's goals is to make the process of detecting OOM situations more reliable and deterministic. How things have been done in practice so far, he said, is to try to reclaim memory until nothing more can be found for several iterations in a row, then invoke the OOM killer. The problem is that there have always been bugs in this code. The OOM killer is only summoned for order-0 (single-page) requests and, worse, a single free page resets the scan counter. That means that, with a tiny trickle of pages becoming free, the kernel can loop forever without ever starting the OOM killer.

Michal's work in this area involves getting feedback from the reclaim and compaction code, and invoking the OOM killer if the situation doesn't improve over time. In the future, he would like to make the code more conservative, and to detect when the system is thrashing. In thrashing situations, the OOM killer could be started even if the system is not strictly out of memory. Christoph Lameter complained that starting the OOM killer "wrecks the system" by killing off useful processes, but Michal responded that, in such situations, the system is already lost, so it makes sense to try to recover it partially. Then, if nothing else, an administrator can get in and try to figure out what the problem is. The situation as it exists now is fragile, he said, and is worth changing. The developers in the room seemed to agree with that sentiment, and it was decided that this work should be merged.

Michal's other area of work is OOM-killing reliability — making sure that something useful happens after the OOM killer is started. Some developers have been trying to add timeouts to the OOM-killing code, meaning that, if killing one process does not yield free memory within a bounded time, the OOM killer would move on to a new victim. Michal has been pushing back on those, in the opinion that other means should be used if possible. His alternative is the OOM reaper, which deprives a victim process of its memory resources even before that process can exit. That allows the memory to be freed even if the victim process is blocked on some lock and unable to exit. This code was merged for the 4.6 release.

While nobody objected to that work, some developers still felt that there is a place for timeouts in the OOM killer code. There are situations, for example, in which the OOM reaper will be unable to free a process's memory. Should things get wedged, the feeling seemed to be, it's better to try killing another process than to lose the system altogether. Michal said that, if somebody wants to work on adding timeouts, it would be acceptable to him as long as the code was deterministic. Timeouts are, after all, orthogonal to the rest of what he is working on. Andrea Arcangeli warned against attempts to make the OOM killer perfect, since it is unlikely to ever get there.

As the session came to a close, Hugh Dickins raised another problem: what to do if all of the system's memory is tied up in the tmpfs filesystem (which has no backing store and only stores files in memory). Killing processes will not, in general, cause that memory to be freed, and there is, he said, no way to randomly truncate files to free their pages. There is an experiment in Google, he said, to try to truncate large tmpfs files when the system runs out of memory. The immediate reaction in the room, though, was that any such approach is dangerous at best, so this patch may not ever make it out into the wider world.

Comments (none posted)

Memory-management subsystem workflow

By Jonathan Corbet
April 27, 2016

LSFMM 2016

The final session in the memory-management track of the 2016 Linux Storage, Filesystem, and Memory-Management Summit concerned how the developers work together. It was led by Andrew Morton, the maintainer of the -mm tree and primary memory-management maintainer. As a whole, things seem to be working smoothly, but there are always ways to make improvements.

Andrew started off by saying that he does not actually do that much of the work; he relies heavily on the other memory-management developers to do code reviews. He allowed that he ignores acks from some developers while respecting others. He apologized for the "peculiar management" of the -mm tree (but didn't say anything about making changes to how it is managed). Were there any changes that the other developers would like to see?

Michal Hocko asked for the inclusion of message IDs in patches to make it easier to figure out where they came from. That can be especially hard when batches of fixes are merged together. Andrew agreed to do this.

With regard to code review, he requested that developers let him know if they don't have time to review patches of interest to them at the moment. He is happy, he said, to "park" the patches until time for reviewing becomes available.

Matthew Wilcox asked Andrew if he would rather, when patches are updated, see incremental fixes or entirely new patch sets. Andrew replied that he would rather see the fixes separately; that allows reviewers to easily see what has changed. But, he said, developers should send whatever seems easiest to review in general; he can always generate a delta from one patch set to the next.

Mel Gorman asked for a more clear distinction between patches that have gone into -mm because they are intended for merging soon and those that are just there for testing. He mentioned, in particular, the team pages patches (discussed the previous day) which, he said, created conflicts with other near-term work. In this case, though, it turns out that Andrew had indeed expected to merge those patches in the 4.7 merge window — a plan which had been changed in the previous day's discussion.

Kirill Shutemov complained that it can be hard to get memory-management patches reviewed sometimes. He wondered if he should be dividing his work into smaller batches, since big patch sets can be scary. Andrew suggest he should "bribe people" to review his work, but acknowledged that he didn't have any better answers for what is a hard and persistent problem. Michal said that he sometimes holds off on posting reviews if he doesn't like a particular patch but lacks ideas for a better solution. Hugh Dickins added that, if he comments on a patch, he ends up on the CC list for subsequent versions. So he is tempted to hold back on reviews in order to keep his email load under control.

At the end of the session, Michal pointed out the large volume of non-memory-management work that goes through the -mm tree and asked if that should change. Andrew remains the maintainer of last resort for much of the kernel. He replied, though, that carrying those other patches is not a large amount of work, so he sees no strong need to change things.

Comments (none posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.6-rc5 ?

Greg KH Linux 4.5.2 ?

Greg KH Linux 4.4.8 ?

Greg KH Linux 3.14.67 ?

Jiri Slaby Linux 3.12.59 ?

Architecture-specific

James Morse [PATCH v8 00/13] arm64: kernel: Add support for hibernate/suspend-to-disk ?

David Daney ACPI NUMA support for ARM64 ?

David Long arm64: Add kernel probes (kprobes) support ?

Jarkko Sakkinen Intel Secure Guard Extensions ?

Tom Lendacky x86: Secure Memory Encryption (AMD) ?

Build system

Emese Revfy Introduce GCC plugin infrastructure ?

Core kernel code

Peter Zijlstra implement atomic_fetch_$op ?

Mathieu Desnoyers LTTng modules 2.7.3 (Linux kernel tracer) ?

Device drivers

Peter Griffin Add support for FDMA DMA controller found on STi chipsets ?

Jose Abreu Add AXS10X I2S PLL clock driver ?

Mylène Josserand add support for Microcrystal RV-3049 ?

Tiffany Lin Add MT8173 Video Encoder Driver and VPU Driver ?

Jiancheng Xue clock: hisilicon: Add CRG driver for hi3519 soc ?

Lijun Ou Add HiSilicon RoCE driver ?

tthayer@opensource.altera.com Addition of Altera Arria10 System Resource Chip ?

Vladimir Murzin Support MPS2 Timer ?

Vladimir Murzin Support MPS2 UART controller ?

Wadim Egorov Add support for rk818 ?

Maxime Ripard drm: Add Allwinner A10 display engine support ?

Jan Glauber i2c-octeon and i2c-thunderx drivers ?

Andrew F. Davis Add support for INA3221 Triple Current/Voltage Monitors ?

Andreas Dannenberg ASoC: codecs: add support for TAS5720 digital amplifier ?

Nicolas Dichtel drivers/net: add 6WIND SHULTI support ?

Ivaylo Dimitrov Make Nokia N900 cameras working ?

Device driver infrastructure

Pramod Kumar Add Shared MDIO framework for iProc based SoCs ?

Lee Jones pwm: Add support for PWM Capture ?

Noralf Trønnes drm: Add fbdev deferred io support to helpers ?

Gustavo Padovan drm: explicit fencing support ?

Hannes Reinecke libata: ZAC support ?

Documentation

Petr Mladek livepatch: Add some basic LivePatch documentation ?

Filesystems and block I/O

Seth Forshee Support fuse mounts in user namespaces ?

Jens Axboe Make background writeback great again for the first time ?

Seth Forshee Support fuse mounts in user namespaces ?

Memory management

js1304@gmail.com Introduce ZONE_CMA ?

Michal Hocko scop GFP_NOFS api ?

Michal Hocko change mmap_sem taken for write killable v2 ?

Minchan Kim Support non-lru page migration ?

Mel Gorman Optimise page alloc/free fast paths followup v1 ?

Networking

Tom Herbert ila: Support for checksum neutral translations ?

Security-related

Kees Cook LSM: LoadPin for kernel file loading restrictions ?

Stephan Mueller /dev/random - a new approach ?

W. Michael Petullo SimpleFlow: simple information-flow-based security module for Linux ?

Thomas Garnier mm: SLAB freelist randomization ?

Miscellaneous

Alexander Shishkin perf: Introduce address range filtering ?

Masami Hiramatsu perf-probe --cache and SDT support ?

Mathieu Desnoyers Userspace RCU 0.9.2 ?

Page editor: Jonathan Corbet
Next page: Distributions>>