Kernel development

Brief items

Kernel release status

The current development kernel is 4.0-rc4, released on March 15. Linus said: "Nothing particularly stands out here. Shortlog appended, I think we're doing fine for where in the release cycle we are."

Stable updates: 3.19.2, 3.14.36, and 3.10.72 were released on March 18.

Comments (none posted)

Kdbus on track for 4.1

Greg Kroah-Hartman has added the kdbus tree to linux-next with an eye toward merging it during the next merge window. "The code has been reworked and reviewed many times, and this last round seems to have no objections, so I'm queueing it up to be merged for 4.1-rc1."

Full Story (comments: none)

Kernel development news

Virtual filesystem layer changes, past and future

By Jonathan Corbet
March 16, 2015

LSFMM 2015

While most of the 2015 Linux Storage, Filesystem and Memory Management summit was dedicated to subsystem-specific discussions, some subjects were of sufficiently wide interest that they called for plenary sessions. Al Viro's session about the evolution of the kernel's virtual filesystem (VFS) layer was one such session. There is little that happens in the system that does not involve the VFS in one way or another; in a rapidly changing kernel, that implies a need for the VFS to change quickly as well.

One of the things that has not yet happened, despite wishes to the contrary, is the provision of a better set of system calls to replace mount(). Al did some work in that area but the patches got bogged down before they were even posted for review. So there is no real progress to report in that area yet. On the other hand, there has been some limited progress toward the creation of a revoke() system call. The full implementation remains distant, but some of the infrastructure work is done.

An area that has seen more work is the transition to the iov_iter interface. Al's hope is that, by the time the 4.1 merge window closes, the reworking of aio_read() and aio_write() (part of the asynchronous I/O implementation) to use iov_iter will be complete. There are several instances that still need to be converted, but he is reasonably confident that there are no significant roadblocks.

In the last year the send and receive paths in the network stack have seen iov_iter conversions. The sendpages() path remains to be done, but there do not seem to be any obstacles to getting it done. The conversion of the splice() system call is a bit harder. The code on the write side has almost all been switched, with one exception: the filesystem in user space (FUSE) module. The problem with FUSE is that it wants to do zero-copy I/O, moving pages directly between a splice() buffer and the page cache.

When splice() was first added to the kernel, this sort of "page stealing" was part of the plan; it seemed like a useful optimization. But page stealing had a number of problems, including confusion in the filesystem code when an up-to-date page is stuffed directly into the page cache. So Nick Piggin removed that feature in 2007 and nobody has ever gotten around to putting it back. Al noted that Nick described some of the problems in his commit message, but there are others and, since Nick has proved hard to reach in recent years, they will have to remain a mystery until somebody else rediscovers them.

Meanwhile, zero-copy operation in splice() is disabled, with one exception: FUSE. The problems that affected page stealing with other filesystems do not come up with FUSE, so there was no reason to disable it there; beyond that, FUSE needs zero-copy operation or its performance will suffer. This has prevented the conversion of FUSE over to iov_iter for now. Al's preferred solution to this problem would be to restore the zero-copy mode for all cases, but that is going to take some exploration.

The read side (as represented by the splice_read() file_operations method) will probably be converted sometime this year.

In summary, Al said, he is surprised by how many iovec instances (the predecessor to iov_iter) remain in the kernel. It is not about to go extinct quite yet, but there are fewer and fewer places where it is used.

Another upcoming change that might be visible outside of the VFS is that the nameidata structure is about to become completely opaque. It will only be defined within the VFS code. Al would like to eventually get rid of even the practice of passing around pointers to this structure and switch to using a pointer out of the task structure. This change should not affect non-VFS code that much, but he wanted to mention it because there are patch sets out there that will be broken.

Work continues on the project of getting rid of the numerous variants of d_add(), the basic function that adds a directory entry (dentry) structure into the dentry cache. One of those variants — d_materialise_unique() — was removed in 3.19. Others, like d_splice_alias(), remain. The ideal situation would be to have a single primitive to associate dentries with inodes. Matthew Wilcox asked if the other variants might still have value for documentation purposes, but Al said such cases should be handled with assertions.

A couple of other recent changes include unmounting of filesystems on invalidation and better shutdown processing. The unmounting changes cause a filesystem to be automatically removed if its mount point is invalidated; it went in some months ago. The big change with filesystem shutdown processing is that it is now delayed and always run on a shallow stack. That should address concerns about stack overflows that might otherwise occur during shutdown processing.

Al's final topic had to do with BSD process accounting. What happens if you start accounting to a file, then unmount the underlying filesystem? On a BSD system, the unmount will fail with an EBUSY error. But, on Linux, "somebody decided to be helpful" and thought it would be a friendly gesture to automatically stop the accounting and allow the unmount to proceed. This policy seems useful, but there is a catch: it creates a situation where an open file on a filesystem does not actually make that filesystem busy. That has led to a lot of interesting races dating back to 2000 or so; it is, he said, a "massive headache."

This mechanism has now been ripped out of the kernel. In its place is a mechanism by which an object can be added to a vfsmount structure (which represents a mounted filesystem); that object supports only one method: kill(). These "pin" objects hang around until the final reference to the vfsmount goes away, at which point each one's kill() function is called. It thus is a clean mechanism for performing cleanup when a filesystem goes away.

The first use of this mechanism is to handle shutdown of BSD process accounting. But it can also be put to good use when unmounting a large tree with multiple filesystems. If one filesystem depends on another, a pin object can be placed to ensure that the cleanup work is done in the right order. This facility, found in fs/fs_pin.c looks to be useful but, as Ted Ts'o noted, it is also completely undocumented at the moment. Al finished the session with an acknowledgment that some comments in that file would be helpful for other users.

Comments (1 posted)

Filesystem/block interfaces

By Jake Edge
March 17, 2015

LSFMM 2015

In his session at the 2015 LSFMM Summit, Steven Whitehouse wanted to try to pull together lots of individual projects that are affecting the interfaces between the filesystem and block layers. There may be certain commonalities between them, so it would be good if the projects know about each other. When looking at making interface changes, it is also important for the storage and filesystem maintainers to consider the needs of all of these related projects rather than to just look at them piecemeal.

These projects come under one of three broad headings: dynamic devices, innovative I/O, and snapshots. Dynamic devices refers to "intelligent storage" devices; normally, a block device has the same characteristics throughout its life, but dynamic devices change capacity or other attributes over time. Innovative I/O refers to working with devices like shingled magnetic recording (SMR) and persistent memory devices as well as supporting data integrity features like checksums. Snapshots could fit in either of the other two headings, but he thought it was best to pull them out on their own.

Dynamic devices are those that have changes made to the device post-mount. For example, thin provisioning changes the capacity in the underlying devices in response to less available disk space—up to the capacity the kernel believes that it has. But dynamic devices may require a different kind of interface for error reporting so that filesystems can distinguish between temporary and permanent errors. Topology changes for multipath devices are another dynamic change. If Btrfs exeriences checksum failures while trying to read data, it may want to be able to ask for a different mirror or to change the path to the data. He asked, what information is needed from the block layer and how do the filesystems get that information?

There is a difference between informational reporting and error reporting, James Bottomley said. One contains hints that filesystems might want to use, while the other means the filesystem needs to do something about the event. Another question is how applications would want to get that kind of information, Ted Ts'o said, though it is clear that most applications won't change to take advantage of this kind of information.

Hannes Reineke said that there have been some attempts to use udev notifications to provide information to user space. The problem with that is there is no device information available for udev to attach the information to. Even if the information is available, there needs to be a way to transport it, he said.

But it is the filesystems that really need to know about changes in the block layer, Ts'o said. Maybe there needs to be a callback added to struct super that the block layer can make use of to alert filesystems to changes. Even a simple "something changed" message would be helpful.

There are a variety of new features that require different ways to communicate between the filesystems and the block layer, Whitehouse said in transitioning to the innovative I/O topic. SMR devices need to provide ways for the filesystem to find out where the write pointer is and the layout of the zones in the device. Data integrity (e.g. DIF/DIX) requires ways for checksums and/or checksum failures to be communicated between the block and filesystem layers. If the filesystem wants to read from a specific disk in a mirror, to provide hints to the block layer, or to initiate a copy offload operation, there needs to be an interface available to do so. He wondered if the same sorts of mechanisms could be used to support all of these kinds of operations.

The short answer would seem to be "no". Ts'o said that there are too many differences for all of those to be able to share much. But too much specificity in the interfaces won't be good either, Ric Wheeler said. Sometimes the right thing to request is for the block layer to "do something different than you did last time" when there is an problem, he continued. Christoph Hellwig agreed that "try again" can be the right approach for both disk failures and transport failures, while Dave Chinner suggested that adding some kind of "retry as hard as you can" operation might be helpful.

The problem comes back to error reporting and distinguishing transient from permanent errors, which is a recurring topic in the storage and filesystems tracks at LSFMM. The kernel is currently limited to the POSIX-defined errors, Chinner said. What is really needed are more fine-grained errors that give more information than just ENOSPC. A proper error interface from the block layer is really needed, he said.

Getting consistency between the snapshot operations across various devices was Whitehouse's last topic. Trying to take a filesystem snapshot on a single device is much different than doing so on a thin-provisioned array that may involve multiple underlying block devices. There are different granularities for snapshots as well. It could be that a single-file snapshot or application snapshot (which might include files on multiple filesystems) is desired.

For this topic, though, there was little time for discussion. Whitehouse was able to at least introduce the problem a bit for consideration down the road.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Comments (none posted)

Overlayfs issues and experiences

By Jake Edge
March 17, 2015

LSFMM 2015

David Howells and Mike Snitzer led a discussion at the 2015 Linux Storage, Filesystem, and Memory Management (LSFMM) Summit about the overlay filesystem (overlayfs), which is the union filesystem implementation that was adopted into the kernel in 3.18. There are a number of problems that need to be addressed for this new filesystem.

Howells was first up. He noted that overlayfs does not play nicely with security technologies that use object labels (e.g. SELinux). There are a couple of problems that he reported back in November. Overlay filesystems can have three different inodes for any given file, one in the overlayfs itself, one in the read-only lower layer, and another in the writable upper layer if the file has been written (and, thus, copied up to the upper layer). The problem for SELinux and others regards which of the three different possible versions of the inode (i.e. lower, upper, or overlay) is visible to them. That affects what security labels will be seen on the file. But those problems have largely been solved at this point.

There are two more problems, for file locking and fanotify, that still need to be addressed. The first is a Jeff Layton problem, while the other is an Eric Paris problem, Howells said with a chuckle. Layton was present, so the discussion turned to locking. What happens when an overlayfs file that has not been written to is locked (so the lock must be placed on the lower layer), then written to so that it must be copied up from the lower layer into the upper? Should the lock be copied up too? What if there are two overlays referring to the same underlying file, each of which has a copied-up version of the file, where should the lock go then?

As it turns out, the fanotify problems are similar. If an application requests notifications on an overlayfs file that has not been written to, the notification must get placed on the lower layer inode. If the notifications are not copied up when the file gets written, then applications won't get notified even if changes are being made to the file.

James Bottomley suggested that the semantics for file locking and fanotify need to be worked out before a mechanism to satisfy them can be proposed. Ted Ts'o was uncomfortable having different behavior based on whether the file was part of an overlayfs. Howells noted that things can get worse than he had described when you add in network filesystems (e.g. SMB or NFS) as the overlayfs layers. He noted that he had posted a message in January with all of the problems he could think of, but "there are probably more".

Layton suggested returning ENOLCK when trying to lock files in an overlayfs until the semantics could be worked out and implemented. Al Viro noted that with overlayfs, a file opened for reading may have a different inode number than one opened for writing. That could be a problem for a number of different applications. The classic example is a mail user agent, Viro said, but some editors also care.

Bottomley said that there is a need to avoid surprise semantics. To do that, the developers need to know what actually matters and what users care about. POSIX semantics were broken for overlayfs, but does that really harm real users? "There is a limit to how far we need to dig to find problems that people are not complaining about", he said.

One of the users of overlayfs is Docker, so Snitzer wanted to look at that use case. Docker tried Btrfs, but didn't like it, he said. The project can't use block-based solutions, such as those based on device mapper and thin provisioning (thinp) that most Linux distributions use. The reason behind that is "lame" in Snitzer's view. Essentially, the project wants its Go programs to be built once (on Ubuntu), then to be able to be run on any other distribution forever, which requires statically built binaries. But there is no static library available for udev, which means that the devicemapper graph driver cannot be used. That is a political, not a technical, issue, Snitzer said.

The big reason that Docker has switched to overlayfs is to gain the memory efficiency that comes from pages in the page cache being shared between the containers. That doesn't happen with thinp currently, but Snitzer said that Dave Chinner has some ideas for using XFS on top of thinp to achieve it.

Chinner spoke up to describe the problem, which is that there might be a hundred containers running on a system all based on a snapshot of a single root filesystem. That means there will be a hundred copies of glibc in the page cache because they come from different namespaces with different inodes, so there is no sharing of the data. Basically, he said, there needs to be a kind of page cache deduplication to fix the problem.

Bottomley noted that it was a similar problem to the one that KSM tries to solve. KSM basically uses hashes of the contents of various pages of memory to share memory better between virtual machines. For containers, the main need is to deduplicate the page cache specifically. Bottomley said that the company he works for, Parallels, has a solution to the deduplication problem that does not require hashing each page, but that it is, currently at least, proprietary. Sharing of memory between containers is something that many are looking for, though, so there was some discussion of how to do it without the overhead that KSM incurs. That is where things wound down.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Comments (4 posted)

Asynchronous buffered read operations

By Jake Edge
March 18, 2015

LSFMM 2015

A problem that Milosz Tanski has run into throughout his career is part of what brought him to the 2015 Linux Storage, Filesystem, and Memory Management Summit. Some reads can be satisfied immediately from the page cache, while others require an expensive I/O. Distinguishing between the two can lead to more efficient programs. He has implemented a new mode for read() that does so, though it requires adding a new system call.

The problem typically occurs in low-level network applications, Tanski said. Not every application can use sendfile(). For example, applications using TLS modify the data to encrypt it before sending it, which means they can't use sendfile(). So they must do their own copies but, depending on whether the data is in the page cache, some will be "slow", while others are "fast". Some programs that want to do asynchronous disk I/O often just use O_DIRECT and replicate the page cache concept in user space. That way they can track the contents of the cache to determine if an I/O can be satisfied quickly or not.

The normal workaround for these problems is to use thread pools for the I/O, but that pattern "kinda sucks". The latency added due to synchronization between the threads is not insubstantial. It is also often the case that requests that could be satisfied quickly get stuck behind slower requests.

So, with the help of Christoph Hellwig, he has implemented preadv2(), which is like preadv() except that there is a new flags argument (which, as was pointed out by several attendees, really should have been added with preadv()). There is only one flag available in his patches: RWF_NONBLOCK (which could also have been called RWF_NOWAIT, he said). That flag will cause reads to succeed only if the data is already in the page cache, otherwise it will return EAGAIN.

Basically, that flag allows reads from the network loop to skip the queue if the data needed is already available in the page cache. It essentially provides a fast path with minimal changes to the user-space application. He has been using it with an internal application and it works well.

His patches drew one major comment, he said, which was about using functionality like that in fincore() to get a list of the pages of a file that are resident in the page cache. The problem with that is a race condition where a page that was present at the time of the check is no longer there when the read is performed, which puts that read back into the slow lane.

He has also tested the patches with Samba, where they reduce the latency significantly. For his internal application, which is a large, columnar data store using the Ceph filesystem, he got 23% lower response times. The average response times dropped by 200ms, he said.

There have been some objections to adding another system call, Tanski said. James Bottomley was not particularly concerned about that, since the new system call is just adding a flag argument that should have been there already. Hellwig added that it required a new system call just to get the flag in, which is not an unusual situation in recent times.

Hellwig has also implemented pwritev2() as part of the patch set to add a flag argument for the write() side. There are no write flags included in the patch, though some will be added as separate patches down the road. There are some potential user-space uses for flags for writes, including a "high priority" flag and a non-blocking flag that could be used for logging, Hellwig said.

No one in the room seemed opposed to the idea. It seems likely that the two new system calls could show up as early as the 4.1 kernel.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Comments (5 posted)

Handling 32KB-block drives

By Jake Edge
March 18, 2015

LSFMM 2015

There have been requests from certain disk drive manufacturers for the kernel to support 32KB block (or sector) sizes, James Bottomley said to kick off the discussion at a combined storage and filesystem session at the 2015 LSFMM Summit. He noted that the page cache could only handle 4KB granularity, and he didn't see that changing any time soon, which means that 32KB block sizes cannot be directly supported. But he wondered if aligning and sizing requests for 32KB boundaries most of the time would work for the disk drives.

Dave Chinner said that XFS can already handle making requests that are aligned and sized correctly, but Bottomley asked if that included metadata reads and writes. Metadata is the biggest problem, Bottomley said. Shorter writes can be supported by doing a read-modify-write (RMW) underneath the covers, in the filesystem, block layer, or in the disk itself.

Support for 4KB disk sectors, instead of the traditional 512-byte sectors, was added to Linux long ago, Ric Wheeler said. There are disk drives with 4KB logical and physical sectors out there now, Bottomley added. But that change matched up with the 4KB Linux page size. As Ted Ts'o pointed out, the page cache will need to be able to evict 4KB pages, which means that something will need to do an RMW operation on disks with larger block sizes.

Chris Mason pointed out that even if all filesystems had changes made in their data paths to do all I/O in 32KB chunks, and those changes were ready for the 4.1 kernel (which is, of course, only a thought experiment), it will be years before the code is in the hands of users. It will take at least a year before the enterprise distributions pick up the changes and at least another year before users are comfortable switching. Given that the disk drive makers want support now, it would make sense for them to add emulation of 512-byte sectors, as they did with the 4KB drives, so no changes are required of the kernel.

Christoph Hellwig agreed, noting that virtual-memory eviction has various corner cases that will require page-sized writes. Chinner was also on board with that, saying that the "easy solution is to fix it in the drive". That is also true for supporting shingled magnetic recording (SMR) drives, he continued.

Bottomley asked about ext4 support for doing 32KB I/O. Ts'o said that it would require some work but that it could be done. The same is true for Btrfs, Mason said. "We're all wrong but in slightly different ways", he said of Linux filesystem support. Ts'o said that there would need to be support added to the virtual-memory subsystem to support 32KB I/O. The filesystems could do their own RMW to ensure the full 32KB was in the cache when doing writes.

Chinner asked about workloads that generate lots of small files. Bottomley said those would essentially waste an additional 28KB per file. Each would require an RMW operation as well, which might not perform all that well for some workloads.

There was a suggestion that having 4KB emulation (rather than 512-byte emulation) would be better, but Chinner called it "immaterial". There are all kinds of "mapping tricks" already done by SSDs, any emulation would essentially be the same. SSD makers won't even say what the sector size is for those devices, Bottomley said. But Chinner said that he didn't care and didn't really want to know. Some were concerned about the performance implications of hiding RMW operations in the drive, however.

One way to support larger block sizes in the page cache would be to move to larger pages throughout the kernel. The last time the idea of larger page sizes was raised with the memory management (MM) folks, they were not happy with the idea, Bottomley said. He wondered if it was worth raising the issue on day two of the summit in a plenary session. But Ric Wheeler said that the topic was raised in New Orleans (in 2013) and he didn't think the MM developers were "adamantly opposed" to the idea, just that no one was working on it.

But, as Chinner pointed out, 32KB is not likely to be the end of the line. Even if the page size were increased to 32KB, disk drive manufacturers will someday want 128KB or 256KB (or beyond) for the block size. So a solution that is not dependent on the page size of the system is needed. Using vmalloc() allocations rather than contiguous allocations might help. Compound pages might also be part of any eventual solution.

In the end, Bottomley summed up the discussion by saying that filesystems could "pull tricks" to make most I/O 32KB-friendly, but would need help from the MM subsystem to have it all be aligned correctly. Given the time frames, it would seem that drive makers need to do some kind of emulation for now.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Comments (2 posted)

Filesystem support for SMR devices

By Jake Edge
March 18, 2015

LSFMM 2015

Two back-to-back sessions at the 2015 Linux Storage, Filesystem, and Memory Management Summit looked at different attempts to support Linux filesystems on shingled magnetic recording (SMR) devices. In the first, Hannes Reinecke gave a status report on some prototyping he has done to support SMR in Linux. The second was led by Adrian Palmer of Seagate about a project to port the ext4 filesystem to host-managed SMR devices.

Reinecke described some prototyping he has done in the block layer to support SMR. Those devices have a number of interesting attributes that require code in the kernel to support. For example, SMR devices have multiple zones, some of which are normal random-access disk zones, while others must be written to sequentially. He has been looking specifically at supporting host-managed SMR devices, which require that the host never violate the sequential-write restriction in those types of zones.

SMR drives disallow I/O that spans zones, Reinecke said, which means that I/O operations need to be split at those boundaries. The zone layout could have a different size for each of the different zones, though none of the drives currently does that. To support that possibility, though, he used a red-black tree to track all of the zones. The current SMR specification allows for deferred lookup of some of the zone information, so the tree could just be partially filled for devices with lots of irregular zones.

Ted Ts'o suggested that supporting "insane drives" that have a variety of zone sizes might use a different data structure. That way, the majority of drives that have a straightforward layout could have all of that information available in kernel memory. He was concerned that there might be I/O performance degradation when issuing the "report zones" command once the device has been mounted.

There is also a question about "open zones" and the maximum number of open zones. Reinecke said that it is a topic that is still under discussion among the drive makers. From the LSFMM discussion it seems clear that there is no agreement on what an open zone is. Some believe that any partially filled zone qualifies, while to others it means zones that are simultaneously available to write to. In addition, the maximum may range from the four to eight that Martin Petersen has heard to the 128 that the drive makers have proposed.

In fact, someone from one of the storage vendors asked what the kernel developers would like the maximum to be. The reply was, not surprisingly, "all of them". Reinecke said that he is lobbying that "zone control" (maximum number of open zones) be optional and that any I/O that violates the maximum open zones should be allowed, possibly with a performance penalty. Ts'o agreed with that, saying that writing to one more zone than is allowed must not cause an I/O error, though adding some extra latency would be acceptable. Reinecke said that he had hoped to avoid the whole topic of open zones "because it is horrible".

Reinecke then moved back to his prototype work. He noted that sequential writes must be guaranteed. Each sequential zone has its own write pointer, which is where the next write for that zone must be. That "sort of works" using the NOP I/O scheduler, since it just merges adjacent writes. If out-of-order writes from multiple tasks are encountered, they can be requeued at the tail of the queue. The queue size must be monitored, he said, since if it never gets smaller, the I/O is making no progress, which should cause an I/O error.

But Dave Chinner said that once a filesystem has allocated blocks to different tasks, it must then guarantee an ordering of those writes "all the way down". The only way to do that is to serialize the I/O to the zone once the allocation has been done. Reinecke said that requeueing at the tail can solve that problem, but Chinner said that in a preemptible kernel that won't work. "Sequential I/O is basically synchronous I/O", he said.

There is a philosophical question about whether it makes sense to try to put a regular filesystem on SMR devices, Ts'o said. Chinner said that SMR is really a firmware problem. Actually solving the problems of SMR at the filesystem level is not really possible, he said.

Reinecke wondered if the host-managed SMR drives would actually sell. Petersen piled on, noting that the flash-device makers had made lots of requests for extra code to support their devices, but that eventually all of those requests disappeared when those types of devices didn't sell. Reinecke's conclusion was that it may not make a lot sense to try to make an existing filesystem work for host-managed SMR drives.

Ext4 on host-managed SMR

On the other hand, though, Palmer is quite interested in doing just that. He works on host-managed drives and is trying to get ext4 working on them.

He started by looking at block groups as a way to track the zones, but ran into a problem with that idea. Zones are 256MB in length, but a 4KB block only has enough bits to address 128MB worth of blocks, so he would need to use 8KB blocks, which is a sizable change. He also noted that O_DIRECT I/O was going to be a problem for host-managed SMR, without really going into any details.

As Reinecke said earlier, the order of writes to the disk is critical for host-managed drives. Out-of-order writes may not be written at all. Palmer looked at putting the code to keep write operations sequential into either the I/O scheduler or the block device. For now, the block device seems to be the right place.

Ts'o said that he is mentoring a student who is working on making the ext4 journal writes more SMR-friendly. But Chinner is worried about fsck. A corrupt block in the middle of a sequential zone may need to be rewritten, but it can't be overwritten in place. Ts'o suggested a 256MB read-modify-write with a chuckle.

One attendee noted that the drive makers want to start with host-aware drives (which will perform better with mostly sequential writes to those zones, but will not fail out-of-order writes) to get them working. That will allow the companies to learn from the market how much conventional space (zones without the sequential requirement) and overprovisioning is required.

Chinner suggested that some of that conventional space might be used for metadata sections. Another attendee cautioned that SSD makers are also looking at zone block devices, so it may be more than just SMR drives that need this kind of support. But Chinner said that the kernel developers had "more than enough" on their hands rewriting filesystems for use on SMR.

Another way to approach the problem, Chinner said, might be to have a new kind of write command for disks (perhaps "write allocate") that would return the logical block address (LBA) where the data was written, rather than getting the LBA from the filesystem or block layers with the write. That way, the drive would decide where to place the data and return that to the operating system. One attendee said that the driver vendors would probably welcome a discussion about what the API to these drives would look like.

There was some discussion on how to proceed with a new command, which would (eventually) need to be handled by the T10 committee (for SCSI interface standards). Petersen (who represents Linux on T10) noted that it is difficult to change the standard. An attendee from one of the drive makers thought it might be possible to prototype the idea to try it out completely separate from the standards process.

That is where the conversation trailed off, but the "write allocate" idea seemed to generate some interest. Whether that translates into action (or standards) remains to be seen.

After the summit, on March 16, Dave Chinner posted a pointer to a design document on supporting XFS on host-aware SMR drives.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Comments (8 posted)

Testing power failures

By Jake Edge
March 18, 2015

LSFMM 2015

Trying to replicate failures that can happen in filesystems when the power suddenly fails was the topic of a discussion led by Josef Bacik at the 2015 LSFMM Summit. He has been working on a tool based on the device mapper to try to make power-failure scenarios more reproducible, but he was wondering if he should continue that work or shift to something else.

In Btrfs, he believes there are ways that the balancing operation can lead to a corrupted filesystem if there is a power failure at just the "right" moment. He has not caught it yet, but the problem has inspired the development of a new tool. It uses the device mapper and two disks, one of which is the normal filesystem and the other keeps a log of all the writes that go to the first disk. The log disk keeps a list of all the write operations that have completed, which is updated with each flush operation to the first disk.

The tool has been integrated into xfstests and works for ext4 and XFS as well as Btrfs. It does take a good bit longer on those other two filesystems, but it works. The idea is to be able to test "weird interactions", where the filesystem is fine at point A and at point B but, if the power fails in between those points, the filesystem gets corrupted. Bacik asked: does this log approach make sense?

Someone asked about using fault injection instead. But Bacik wants these tests to be generic for any filesystem without adding code to the kernel. Logging allows for replaying the problem. It is also finer-grained, as you can check the filesystem consistency at each flush.

He would like others to look at his assumptions to help ensure he isn't off base. He is only logging information for write operations that have completed. The tool drops all writes that have not completed at flush time.

There was a suggestion that blktrace could be changed to log the data that is being written. Bacik seemed to be leaning toward dropping his tool in favor of that, but Chris Mason wondered about maintaining the ordering of writes using blktrace. One attendee said that blktrace has sequence numbers that are maintained per-CPU but are not synchronized, so the order of the writes may not be preserved. Since the device mapper does preserve that order, Bacik concluded that he would finish up that tool, rather than switch.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Comments (none posted)

Reservations for must-succeed memory allocations

By Jonathan Corbet
March 17, 2015

LSFMM 2015

When the schedule for the 2015 Linux Storage, Filesystem, and Memory Management Summit was laid out, its authors optimistically set aside 30 minutes on the first day for the thorny issue of memory-allocation problems in low-memory situations. That session (covered here) didn't get past the issue of whether small allocations should be allowed to fail, so the remainder of the discussion, focused on finding better solutions for the problem of allocations that simply cannot fail, was pushed into a plenary session on the second day.

Michal Hocko started off by saying that the memory-management developers would like to deprecate the __GFP_NOFAIL flag, which is used to mark allocation requests that must succeed at any cost. But doing so, it turns out, just drives developers to put infinite retry loops into their own code rather than using the allocator's version. That, he noted dryly, is not a step forward. Retry loops spread throughout the kernel are harder to find and harder to fix, and they hide the "must succeed" nature of the request from the memory-management code.

Getting rid of those loops is thus, from the point of view of the memory-management developers, a desirable thing to do. So Michal asked the gathered developers to work toward their elimination. Whenever such a loop is encountered, he said, it should just be replaced by a __GFP_NOFAIL allocation. Once that's done, the next step is to figure out how to get rid of the must-succeed allocation altogether. Michal has been trying to find ways of locating these retry loops automatically, but attempts to use Coccinelle to that end have shown that the problem is surprisingly hard.

Johannes Weiner mentioned that he has been working recently to improve the out-of-memory (OOM) killer, but that goal proved hard to reach as well. No matter how good the OOM killer is, it is still based on heuristics and will often get things wrong. The fact that almost everything involved with the OOM killer runs in various error paths does not help; it makes OOM-killer changes hard to verify.

The OOM killer is also subject to deadlocks. Whenever code requests a memory allocation while holding a lock, it is relying on there being a potential OOM-killer victim task out there that does not need that particular lock. There are some workloads, often involving a small number of processes running in a memory control group, where every task depends on the same lock. On such systems, a low-memory situation that brings the OOM killer into play may well lead to a full system lockup.

Rather than depend on the OOM killer, he said, it is far better for kernel code to ensure that the resources it needs are available before starting a transaction or getting into some other situation where things cannot be backed out. To that end, there has been talk recently of creating some sort of reservation system for memory. Reservations have downsides too, though; they can be more wasteful of memory overall. Some of that waste can be reduced by placing reclaimable pages in the reserve; that memory is in use, but it can be reclaimed and reallocated quickly should the need arise.

James Bottomley suggested that reserves need only be a page or so of memory, but XFS maintainer Dave Chinner was quick to state that this is not the case. Imagine, he said, a transaction to create a file in an XFS filesystem. It starts with allocations to create an inode and update the directory; that may involve allocating memory to hold and manipulate free-space bitmaps. Some blocks may need to be allocated to hold the directory itself; it may be necessary to work through 1MB of stuff to find the directory block that can hold the new entry. Once that happens, the target block can be pinned.

This work cannot be backed out once it has begun. Actually, it might be possible to implement a robust back-out mechanism for XFS transactions, but it would take years and double the memory requirements, making the actual problem worse. All of this is complicated by the fact that the virtual filesystem (VFS) layer will have already taken locks before calling into the filesystem code. It is not worth the trouble to implement a rollback mechanism, he said, just to be able to handle a rare corner case.

Since the amount of work required to execute the transaction is not known ahead of time, it is not possible to preallocate all of the needed memory before crossing the point of no return. It should be possible, though, to produce a worst-case estimate of memory requirements and set aside a reserve in the memory-management layer. The size of that reserve, for an XFS transaction, would be on the order of 200-300KB, but the filesystem would almost never use it all. That memory could be used for other purposes while the transaction is running as long as it can be grabbed if need be.

XFS has a reservation system built into it now, but it manages space in the transaction log rather than memory. The amount of concurrency in the filesystem is limited by the available log space; on a busy system with a large log he has seen 7-8000 transactions active at once. The reservation system works well and is already generating estimates of the amount of space required; all that is needed is to extend it to memory.

A couple of developers raised concerns about the rest of the I/O stack; even if the filesystem knows what it needs, it has little visibility into what the lower I/O layers will require. But Dave replied that these layers were all converted to use mempools years ago; they are guaranteed to be able to make forward progress, even if it's slow. Filesystems layered on top of other filesystems could add some complication; it may be necessary to add a mechanism where the lower-level filesystem can report its worst-case requirement to the upper-level filesystem.

The reserve would be maintained by the memory-management subsystem. Prior to entering a transaction, a filesystem (or other module with similar memory needs) would request a reservation for its worst-case memory use. If that memory is not available, the request will stall at this point, throttling the users of reservations. Thereafter, a special GFP flag would indicate that an allocation should dip into the reserve if memory is tight. There is a slight complication around demand paging, though: as XFS is reading in all of those directory blocks to find a place to put a new file, it will have to allocate memory to hold them in the page cache. Most of the time, though, the blocks are not needed for any period of time and can be reclaimed almost immediately; these blocks, Dave said, should not be counted against the reserve. Actual accounting of reserved memory should, instead, be done when a page is pinned.

Johannes pointed out that all reservations would be managed in a single, large pool. If one user underestimates their needs and allocates beyond their reservation, it could ruin the guarantees for all users. Dave answered that this eventuality is what the reservation accounting is for. The accounting code can tell when a transaction overruns its reservation and put out a big log message showing where things went wrong. On systems configured for debugging it could even panic the system, though one would not do that on a production system, of course.

The handling of slab allocations brings some challenges of its own. The way forward there seems to be to assume that every object allocated from a slab requires a full page allocation to support it. That adds a fair amount to the memory requirements — an XFS transaction can require as many as fifty slab allocations.

Many (or most) transactions will not need to use their full reservation to complete. Given that there may be a large number of transactions running at any given time, it was suggested, perhaps the kernel could get away with a reservation pool that is smaller than the total number of pages requested in all of the active reservations. But Dave was unenthusiastic, describing this as another way of overcommitting memory that would lead to problems eventually.

Johannes worried that a reservation system would add a bunch of complexity to the system. And, perhaps, nobody will want to use it; instead, they will all want to enable overcommitting of the reserve to get their memory and (maybe) performance back. Ted Ts'o also thought that there might not be much demand for this capability; in the real world, deadlocks related to low-memory conditions are exceedingly rare. But Dave said that the extra complexity should be minimal; XFS, in particular, already has almost everything that it needs.

Ted insisted, though, that this work is addressing a corner case; things work properly, he said, 99.9999% of the time. Do we really want to add the extra complexity just to make things work better on under-resourced systems? Ric Wheeler responded that we really shouldn't have a system where unprivileged users can fire off too much work and crash the box. Dave agreed that such problems can, and should, be fixed.

Even if there is a reserve, Ted said, administrators will often turn it off in order to eliminate the performance hit from the reservation system (which he estimated at 5%); they'll do so with local kernel patches if need be. Dave agreed that it should be possible to turn the reservation system off, but doubted that there would be any significant runtime impact. Chris Mason agreed, saying that there is no code yet, so we should not assume that it will cause a performance hit. Dave said that the real effect of a reservation would be to move throttling from the middle of a transaction to the beginning; the throttling happens in either case. James was not entirely ready to accept that, though; in current systems, he said, we usually muddle through a low-memory situation, while with a reservation we will be actively throttling requests. Throughput could well suffer in that situation.

The only reliable way to judge the performance impact of a reservation system, though, will be to observe it in operation; that will be hard to do until this system is implemented. Johannes closed out the session by stating the apparent consensus: the reservation system should be implemented, but it should be configurable for administrators who want to turn it off. So the next step is to wait for the patches to show up.

Comments (52 posted)

Heterogeneous memory management

By Jonathan Corbet
March 13, 2015

LSFMM 2015

Jérôme Glisse started an LSFMM 2015 memory-management track session on heterogeneous memory management (HMM) by noting that memory bandwidth for CPUs has increased slowly in recent years. There is little motivation for faster progress, since not many workloads sustain maximum memory bandwidth; instead, CPU access patterns are relatively random, and latency is usually the determining factor in the performance of any given workload.

When one looks at graphical processing units (GPUs), the story is a bit different. Contemporary GPUs are designed for good performance with up to 10,000 running threads; to get there, they can have a maximum memory bandwidth that exceeds CPU-memory bandwidth by a factor of ten. Even so, a good GPU can saturate that bandwidth. GPUs, in other words, can do some things extremely quickly.

Increasingly, Jérôme said, we are seeing systems where the CPU and the GPU are placed on the same die, both with access to the same memory. The GPU is useful for "light" gaming, user-interface rendering, and more. On such systems, most of the memory bandwidth is used by the GPU.

The HMM code exists to allow the CPU and GPU to share the same memory and the same address space; it could eventually be useful for other devices with access to memory as well. The GPU gains software capabilities similar to those the CPU has; it runs its own page table, can incur page faults, and more. The key is to provide a way to manage the ownership of a given block of memory to avoid race conditions. And that is what HMM does; it provides a way to "migrate" memory between the CPU and the GPU, with only one side having access at any given time. If, say, the CPU attempts to access memory that currently belongs to the GPU, it will incur a page fault. The fault-handling code can then migrate the memory back and allow the CPU's work to proceed.

Implementing this functionality requires the ability to keep page tables synchronized on both sides; that is done on the CPU side through the use of a memory-management unit (MMU) notifier callback. Whenever the status of a block of memory changes, the appropriate page-table invalidations can be done. There is one catch, though: to work properly, the notifier needs to be able to sleep, which is not something that MMU notifiers are currently allowed to do. That has been a sticking point for the acceptance of this patch so far.

Andrew Morton jumped in to express some concerns about the generality of this system. GPUs are changing rapidly, he said; we could easily reach a point where, five years from now, nobody is using the HMM code anymore, but it still must be maintained. Jérôme responded that he believes the system is sufficiently general to be useful for GPUs, digital signal processors, and other devices for a long time.

Jérôme finished up by saying that HMM support is needed in order to provide full, transparent GPU support to applications. The compiler projects are working on the ability to vectorize loops for execution on the GPU; when this works, applications will be able to use the GPU without even knowing about it.

Rik van Riel asked if the group had any issues with the HMM code that needed discussion. Mel Gorman asked how many people had actually read the patch set; it turned out that not many had. Rik had reviewed an older version and didn't find any real issues with it. Andrew noted that there have not been a whole lot of reviews of the HMM code in general, and there do not appear to be many other users waiting in the wings.

The session finished with some scattered discussion of various HMM details. How is the migration of anonymous pages to a device handled? The answer is that the device looks like a special type of swap file. The trick here is in handling of fork(); in this case, all of the relevant memory must be migrated back to the CPU first. Atomic access by the device is handled by mapping the relevant page(s) as read-only on the CPU; subsequent write faults look a lot like copy-on-write faults. It would be nice to be able to handle file-backed pages in the HMM system; that would require the creation of a special entry type in the page cache. That brings up a problem similar to the MMU-notifier issue: the filesystem code assumes that page-cache lookups are atomic, but, in this case, the code will need to sleep. It is not clear how to handle that one; adding HMM-specific code to each filesystem was mentioned, but that does not appear to be an appealing option.

Comments (none posted)

Current issues with memory control groups

By Jonathan Corbet
March 13, 2015

LSFMM 2015

The memory controller for control groups has often been a prominent topic at the annual Linux Storage, Filesystem, and Memory Management Summit. At the 2015 event, control groups were mostly notable by their absence, suggesting that the worst of the problems have been solved. That said, there was time for a brief session where some of the remaining issues were discussed.

Initially, memory control groups ("memcgs") only tracked user-space memory. Over time, the tracking of kernel-space memory has been added, but, until recently, this feature was acknowledged to not be in particularly good shape. Vladimir Davydov spent quite a lot of time fixing it up, and things work better now. One of the biggest problems was the fact that, while the controller could track and limit kernel memory use, it had no way of reclaiming memory. So, when a particular group hit its limit, things simply came to a stop. Vladimir added per-memcg least-recently-used (LRU) lists for heavily used data structures like dentries and inodes, and kernel-space reclaim now works.

Much of the remaining discussion centered on whether administrators really need the separate kmem.limit_in_bytes knob that controls how much kernel-space memory a control group can use, or whether an overall limit for both kernel-space and user-space memory is sufficient. Michal Hocko noted that kernel-space limits are often used to throttle forking processes, a task that might be better handled in other ways. Perhaps it should be possible to apply ordinary Unix-style resource limits to control groups. Peter Zijlstra said that a number of users want that feature; it will need to be provided or people will continue to propose other control-group-based solutions.

That left the group without an answer to the question of whether a separate knob for kernel-space memory limits is needed. In the end, there were not a lot of strong feelings on the subject. It will come down to collecting the use cases and seeing whether any are strong enough to warrant adding another knob.

The final topic discussed was where the biggest holes are in the accounting of kernel memory usage. The most prominent one at this point, it would seem, is tracking the memory used for page tables. So that may be where the next round of memcg development effort is targeted.

Comments (1 posted)

Memory-management scalability

By Jonathan Corbet
March 13, 2015

LSFMM 2015

One of the drivers of memory-management development is scalability — performing well on ever-larger systems. So it is not surprising that scalability is a perennial discussion topic at kernel development gatherings; the 2015 Linux Storage, Filesystem, and Memory Management Summit was no exception. Andi Kleen and Peter Zijlstra led the first of two sessions on virtual memory scalability during the memory-management track at that event.

Andi started by pointing out that systems were growing, not only in the number of CPU cores available, but also in the amount of attached memory. The number of cores per NUMA node is on the rise, which is bringing out some new scalability problems.

One of the well-used scalability tactics found in the kernel is per-CPU variables; when each CPU has its own data, there can be no contention between them. But, Andi asserted, as the number of CPUs grows, it no longer makes sense to do things on a per-CPU basis. It just adds a lot of work whenever it becomes necessary to touch every CPU's version of a variable. Instead, data should be made local to groups of N cores (where N was not specified).

Christoph Lameter said that a lot of these scaling problems can be addressed by limiting subsystems to specific cores. Andi replied that this approach works great at installations where there is an experienced person configuring the system. In the absence of that person, it does not work quite so well.

Mel Gorman asked the group what other scalability problems are being experienced now. Christoph complained about I/O bandwidth; in particular, he said, he is unable to push more than about 2GB/second to a filesystem. The problems come down to locking and the handling of 4KB pages in the XFS filesystem. Writeback tends to slow things down, since a lot of CPU time is spent making it happen.

That led to a discussion of batching operations — another tried-and-true scalability technique. It was noted that the reverse-mapping code, which maintains data structures to enable the kernel to tell which processes have references to a given physical page, takes its locks on a per-page basis. Fixing that, evidently, is not hard, but it will require some reorganization of the code.

The current least-recently-used (LRU) lists track memory in units of 4KB pages. That is considered at this point to be overly fine-grained; there is no need for LRU accuracy at that level. There was talk of implementing a "bucket LRU" that would track larger groups of pages.

Inter-processor interrupts (IPIs) for translation lookaside buffer flushes have long been seen as a potential scalability problem. But, it seems that, while people worry about IPIs, it is hard to find a workload where they create a bottleneck. Usually the much-maligned mmap_sem semaphore gets in the way first.

There was some vague talk of other scalability issues; memory compaction was mentioned as a problem on large systems. If compaction tries to migrate a lot of pages, that can lead to large latencies in process execution. Mel Gorman said that compaction shouldn't be doing that, though, so it is not clear where the problem is.

The session wound down without coming to any real conclusions. The scalability topic returned on the second day, though, when Davidlohr Bueso led a session focused on mmap_sem in particular. This semaphore controls access to a process's page tables, along with a number of other, not always well-defined things; it has been on the list of things to fix for some time now. Davidlohr stated a wish to walk out with some tangible action items for improving the situation.

He started by looking back at past action items, especially those that came out of the LSFMM 2014 locking session. One of the concerns then was use of mmap_sem in drivers and other code outside of the memory-management subsystem. Jan Kara has been working on getting drivers to use the gup_fast() variant of get_user_pages() in order to eliminate dependencies on mmap_sem; the biggest problem he is facing at the moment is a deadlock problem in the media subsystem.

Jan would also like to get mmap_sem out of the filesystem code. Al Viro wondered, though, about how virtual memory area (VMA) structures would be protected in its absence. Peter said he has a patch that shifts the protection of VMAs to sleepable RCU if anybody wanted to push that work forward. Meanwhile, Jan hopes to get his driver patches submitted soon.

Davidlohr said that his focus is moving stuff out from under mmap_sem entirely and, eventually, breaking up the lock into something finer-grained. The problem with that, as Peter pointed out, is that what's protected by the lock now is not entirely clear. The way to start, he said, would be to document what's protected by mmap_sem; after that, one can start thinking about better locking schemes.

One problem with mmap_sem is that it protects a process's entire address space. Concurrency could be increased by locking only portions of that space instead. The concept of "range locks" is thus of interest here. Michal Hocko suggested that developers could start by replacing mmap_sem with a range lock that still covers the entire address space; the locking could then be made more precise in an incremental manner.

Hugh Dickins, though, wondered if that was the right approach and what problems, exactly, were being solved with range locks. His impression was that the top priority was to get page-fault handling out from under mmap_sem entirely. The answer was that there are, in fact, two different issues to be addressed regarding mmap_sem: it protects too much, and the hold times are too long. Range locks are one attempt to address the first part of the problem. Peter added that, among other things, range locking would allow concurrent mmap() calls to proceed, which is important for some threaded workloads.

There was some concern about surprises that can pop up when it turns out that an unexpected corner of the code was relying on mmap_sem. In extreme cases, Hugh said, user-space code may even rely on it. He described a complaint from a user about a change in mlock() semantics. Changes in the kernel increased mlock() concurrency and, in the process, exposed a lack of locking on the user-space side. Sympathy for the affected user was relatively low in this case, but, Hugh said, it would be wise to be prepared for nasty surprises.

In the end, Davidlohr's desire for tangible action items went mostly unfulfilled. About the only firm conclusion was that the range-lock code will be cleaned up and posted in the near future.

Comments (1 posted)

Memory-management testing and debugging

By Jonathan Corbet
March 16, 2015

LSFMM 2015

Memory-management problems can be hard to identify and track down; this is true for bugs that affect either correctness or performance. Quite a bit of work has been done in recent years to develop tools that can help with this task, though. The 2015 LSFMM gathering had a number of sessions dedicated to this area; like a large array on a virtual-memory system, though, they were scattered throughout the program. This article provides a virtual view of the entire discussion in one place.

Testing

Davidlohr Bueso started a session on testing by saying that he has been working on improving the mmtests benchmark suite to improve its ability to detect changes across kernel versions. To that end, he has looked at a couple of test suites that are being used in academia: Mosbench and Parsec. There were questions about how well these tests worked for testing the kernel in particular, but, Davidlohr said, these suites do contain some useful tests.

Andi Kleen said there is a new suite out there that is promising despite being named, inevitably, "cloudbench."

Davidlohr asked if anybody else had workload tests that they would like to contribute to mmtests. Laura Abbott said that she would like to see a good set of tests for mobile systems. Scalability tests, she said, tend to be oriented toward scaling up, but mobile developers need tests that focus on scaling down.

Hard conclusions from this session were hard to come by; Davidlohr will continue to work on integrating and documenting other tests aimed at memory-management scalability.

Debugging

Memory-management debugging was the topic of another session run by Dave Jones, Sasha Levin, and Dave Hansen. Dave Hansen started off by saying that, while developers have added a number of debugging features to the memory-management subsystem, they have so far left an important technology on the table. He was talking about Intel's MPX mechanism, which is able to check pointer accesses and ensure, in hardware, that they don't go outside a set of defined boundaries. The nice thing about MPX is that it has almost no runtime cost, so it can be enabled on production systems.

Of course, developers may have some excuse for not making much use of MPX so far. It requires the (not yet released) GCC 5 compiler to instrument code properly, and hardware that actually implements MPX is not yet available. So, he said, there is still time to get our act together.

There was some immediate interest in using MPX with the slab allocator in the kernel. That would take some work, though, since the kernel would have to be changed to load the appropriate MPX registers before accessing a given slab object. Christoph Lameter asked if access to all slab objects could be monitored with MPX. It turns out that there's a small practical difficulty there: a typical running kernel has many thousands of slab-allocated objects, but there are only four sets of registers in the MPX hardware. So tracking more than four objects requires juggling information into and out of those registers.

Peter Zijlstra suggested that MPX could be applied to the kernel stack. It is not clear, though, that MPX-based stack checking would provide advantages over the explicit stack-overflow checks done in the kernel now. Still, it may be possible to dedicate one of the registers to the kernel stack and gain some extra protection.

Andy Lutomirski asked if the MPX registers could be written to while running in atomic context. That turns out to be tricky, since setting up these registers involves doing a memory allocation. Andy also suggested that MPX could be used to block direct access to user-space addresses from the kernel. Laura asked about checking of DMA operations, but MPX only applies to accesses from the CPU.

Sasha shifted the discussion to the VM_BUG_ON() macro. This macro, which comes in a few variants, dumps out a bunch of information specific to the memory-management subsystem; it is thus useful for identifying memory-management bugs. Sasha would like to add more VM_BUG_ON() instances in the kernel, but he is worried about complaints of false positives. These complaints have kept debugging code out in the past; the result, he said, was that users suffered from a number of race conditions that could otherwise have been caught.

There was some talk about additional information that could be printed out by VM_BUG_ON(), but few conclusions. It was suggested that a full kernel memory dump would be helpful — but that, of course, is rather a large amount of data to print into the kernel log. Dave Jones would, instead, like more information about how the system got into the bad state; that would require adding some sort of transaction log. It was suggested that Intel's upcoming Processor Trace functionality could be helpful in this regard.

Dave Hansen then asked if there were any developers with sets of memory-management tracepoints that could be considered for merging? It seems that some exist, but Andi said that, rather than adding more tracepoints now, it would be better to focus on improving the documentation of existing tracepoints. Andrea Arcangeli questioned the value of memory-management tracepoints in general; he does his memory-management development on virtualized systems and wonders why anybody would do anything else. When a system is run under virtualization, it can be examined with an ordinary debugger. But others argued that there are a lot of problems that only show up on bare-metal systems, so there will always be a place for debugging infrastructure that works in that environment.

Fernando Vasquez Cao noted that his group uses SystemTap heavily for memory-management debugging. Among other things, it is handy for injecting faults at specific locations, making it easier to get at hard-to-reproduce problems. Dave Jones agreed that the tools have made life better; it is, he said, a miracle that we were able to solve anything five years ago. He also wondered why there was not more use of the existing fault-injection framework; when he turns it on, he said, "everything breaks," so he concludes that nobody else is doing so. Fernando responded that the injection framework does not allow sufficiently specific fault injection. Besides, he said, when you turn it on everything else breaks, making it hard to focus on the specific problem at hand. It was agreed that somebody (currently unnamed) should fix those problems.

KASan

One tool that has been merged relatively recently is the kernel address sanitizer (or KASan). This tool uses a "shadow memory" array to track which memory the kernel should legitimately be accessing; it can then throw an error whenever the kernel goes out of bounds. KASan developer Andrey Ryabinin led a session on this tool and how it might be improved.

The first idea that came out was to enable KASan to properly validate accesses to memory obtained with vmalloc(). Doing so would require putting hooks into vmalloc() itself and creating a new, dynamic shadow memory array. The amount of work required is not huge; it is much like tracking slab allocations, except that shadow memory for slab can be allocated at boot time. There were, unsurprisingly, no objections, so this work should go forward soon.

A slightly trickier problem is memory that is freed and quickly reallocated to a new user. That memory looks fine to KASan, but quick reallocation can mask use-after-free bugs in the code that previously owned it. The proposed solution here is to put freed memory into a "quarantine" area for a period, delaying its availability to the rest of the system. Memory would emerge from quarantine after a defined period; alternatively, a shrinker could be used to remove memory from quarantine when the system starts to run low. There are concerns that delaying free operations in this way could create a certain amount of memory fragmentation. Andrey is not quite sure how to move forward with this feature, and the group did not appear to have a lot of fresh ideas to share.

Then there is the possibility of catching reads of uninitialized memory. It is possible to get the compiler to instrument code to make this testing possible, but the results include a lot of false positives that are hard to get rid of. Among other things, memory initialized in assembly code must be annotated manually. Andrey has tried doing this and found the result difficult to support. He's afraid that developers will turn the feature on, see all the false positives, and just give up on the whole thing.

Another possibility is using KASan to find data races; there are some tools out there to help with this now. But, he said, it involves some "crazy overhead" — four bytes of shadow memory for every byte of normal memory. There's also a need for a lot of manual annotation; large numbers of false positives are also a problem. The end result is that this feature does not appear to be useful for now.

Other ideas for the more distant future include a quarantine for the page allocator (and not just the slab allocator), and the instrumentation of some inline assembly operations like the atomic bit operators.

Sasha made a plea for developers to enable KASan when they are running their own tests. It has turned up a lot of bugs, he said; the code is in the upstream kernel, it's easy to turn on, and the overhead is low. The only catch is that GCC 5 is needed to gain all of the features, though 4.9 works with reduced functionality.

The final question in this session was: now that we have KASan, is there still a need to maintain the older kmemcheck utility? Kmemcheck only works on single-processor systems, it is painful to use, and it is slow. It seems that nobody is actually making use of it. The consensus of the group was that kmemcheck should be removed. (It should be noted that Sasha's attempt to implement this decision ran into some opposition from developers who still use kmemcheck, so it may stay around for a while yet).

Comments (4 posted)

Improving page reclaim

By Jonathan Corbet
March 17, 2015

LSFMM 2015

Dave Hansen started a brief LSFMM 2015 memory-management track session on page-reclaim performance by saying that we have a problem: over the years, the kernel's memory-management and swap subsystems have been designed around the use of slow secondary storage devices. But now we are heading toward an era increasingly driven by the availability of massive nonvolatile memory, and we are not fully prepared for it.

The fundamental question, he said, was how to integrate these technologies into the Linux kernel. We have a number of subsystems like DAX that can provide high-speed access to persistent memory devices, but they require applications to be changed. If we run current kernels over such devices without using special interfaces, swapping is no faster than it is with older, slower devices. There is just too much overhead in the memory-management layer, and, in particular, in the manipulation of the least-recently-used (LRU) lists that track reclaimable pages in the system. The LRU, he said, is a fancy system to find the best eviction candidate at any given time, but, in this situation, perhaps it would be better to use something else?

Christoph Lameter suggested that users who care about performance should just put their entire application into memory and be done with it. But Dave was not so easily deterred; he would like to find ways for existing applications to get better performance on persistent-memory devices without changes.

Andrea Arcangeli said that we should not be worrying about memory in 4KB units when we are dealing with devices that can hold 100GB or more. Swapping pages in 2MB units would, he said, go a long way toward solving the problem. Andi Kleen agreed to a point — but he felt that 2MB was still far too small. In general, he said, we need to move toward managing memory in larger chunks or just do away with the LRU lists altogether.

Dave suggested that there are a number of opportunities to run the LRU lists in a more relaxed mode. One idea, he said, was to add a third LRU level for pages that are ready to be swapped out. (The kernel currently manages two levels of LRU lists, one for active pages and one for pages that seem to be inactive and should be considered for eviction). Perhaps some sort of "scanaround" algorithm could be applied to that third level to batch up pages for writing out to the swap device. Johannes Weiner answered that he had tried something similar a few years ago. It didn't work well, he said, due to disk seek issues, but it might work better on truly random-access devices.

Hugh Dickins expressed skepticism toward the entire idea, though. To him, it looks like an attempt to reduce memory-management overhead by adding even more complex algorithms to cluster things. That is increasing the complexity of the system rather than reducing it. Batching things up may help to speed things up, but you still have to deal with items individually to make up the batches.

As things wound down, Dave said that he was going away with a couple of interesting ideas to explore.

Comments (none posted)

Huge pages and persistent memory

By Jonathan Corbet
March 17, 2015

LSFMM 2015

One of the final sessions in the memory-management track of LSFMM 2015 had to do with the intersection of persistent memory and huge pages. Since persistent memory looks set to come in huge sizes, using huge pages to deal with it looks like a performance win on a number of levels. But support for huge pages on these devices is not yet in the kernel.

Matthew Wilcox started off by saying that he has posted a patch set adding huge-page support to the DAX subsystem. But, he said, only one other developer reviewed the code. The biggest complaint was the introduction of the pmd_special() function, which tracks a "special" bit at the page middle directory (PMD) level in the page table hierarchy, which is where huge pages are managed.

Some background: the kernel allows architecture-level code to mark specific pages as being "special" by providing a pte_special() function. These pages have some characteristic that causes them to behave differently than ordinary memory. In cases where the architecture has enough bits available in its page table entries, pte_special() just checks a bit there; otherwise things get more complicated. The core memory-management code treats so-marked pages, well, specially; for example, virtual memory areas containing "special" pages should also have a find_special_page() method to get the associated struct page.

Back to the discussion: adding pmd_special() requires that the "specialness" of the huge page be tracked at the PMD level. It is not clear that every architecture has a free bit available in the PMD to track that state. In theory, free bits should abound there since as many as 20 bits in the lower part of the entry are not needed to map to a page frame number, but some quick searching by developers in the room revealed that, on x86 at least, the "extra" bits must be set to zero. For now, though, Matthew is using the same bit that pte_special() uses, so his code should work on every architecture that supports pte_special().

In the case of huge pages backed by persistent memory, the pmd_special() bit indicates to the memory-management code that there is no associated page structure. Andrea Arcangeli asked why a special bit was needed to mark that condition; Matthew responded that it's because he doesn't really understand the memory-management subsystem, so he implemented something he knew he could make work.

This code may eventually be pushed in a direction where pmd_special() is no longer needed. But there are some other issues that come up. Matthew raised one: what happens when an application creates a MAP_PRIVATE mapping of a file into memory, then writes to a page in that mapping? The write will cause the memory-management code to allocate anonymous memory to replace the 2MB huge page being written to; the question is: should it allocate and copy a full 2MB page, or just copy the 4KB page that was actually written? Andy Lutomirski suggested that the answer had to be to copy 4KB; copying the full 2MB for each single-page change would be too expensive. But Kirill Shutemov replied that copy-on-write for huge pages does a 2MB copy now; the behavior with persistent memory, he said, should be consistent.

Matthew moved on to the topic of in-kernel uses for persistent memory. There will be some interesting ones, he thought, but how it should all work has yet to be worked out. HP, for example, is using ioremap() to map persistent memory into the kernel as if it were device memory; Matthew said that seems like the wrong approach to him. We should, he said, be using logical interfaces to persistent memory rather than direct physical interfaces like ioremap(). So he would like to see the creation of some sort of mapping interface implemented within the virtual filesystem layer that would allow persistent memory to be mapped into the kernel's address space.

Andy said that the pstore mechanism could benefit from directly-mapped persistent memory. There was also talk of maybe being able to load kernel modules from persistent memory without the need to copy them into "regular" memory. It might be possible to even map the entire kernel, but there is one little catch: the kernel patches its own code for a number of reasons, including use of optimal instructions for the specific hardware in use and turning tracepoints on and off. If the kernel were mapped from persistent memory, that patching would change the version stored in the device as well — probably not the desired result.

Finally, Matthew said, there have been requests for the ability to use extra-huge, 1GB pages as well as 2MB pages. He is looking at adding that functionality, but he has been struck by the amount of code duplication that exists at each of the four page table levels. He has some thoughts about creating a level-independent "virtual page table entry" abstraction that could be used to get rid of much of that duplication. The reaction from the assembled memory-management developers was cautiously positive; Matthew was encouraged to implement this abstraction within the DAX code. If it works out well there, it can then spread into the rest of the memory-management code.

Comments (4 posted)

Investigating a performance bottleneck

By Jake Edge
March 18, 2015

LSFMM 2015

In a short plenary session near the end of day one of the 2015 LSFMM Summit, Chuck Lever and Peter Zijlstra led a discussion on performance bottlenecks. The original idea for the session was to look at various performance problems, one of which came from Lever and others that would be offered up by those in attendance. As it turned out, though, only Lever's problem was discussed, perhaps due to low energy after a long day.

Lever described a problem he is seeing in NFS on low-latency transports, which have latencies an order of magnitude less than Ethernet. For his test, the latency added by the RPC infrastructure is on the order of 20µs and the round-trip network time is around 25µs. On idle clients, the performance is much what he expects, but if he loads the client with, say, a kernel build, these RPC tests start taking 300µs.

Lever has narrowed the problem down to wake_up_bit(). That function is taking "too bloody long", Zijlstra said. There is some contention on waking, he continued, but it is not clear what that could be.

Dave Chinner suggested using the latency tracer in ftrace to help further narrow it down.

Chris Mason noted that he has started benchmarking newer kernels at Facebook and had not run into anything surprising yet.

Lever said that it is not just a spinlock that is being contended, as the resources are being held far longer than that. Zijlstra said that the wakeup itself should not be that expensive. Perhaps it is the runqueue locks that are being contended in that situation.

Andy Lutomirski wondered if inter-processor interrupts (IPIs) take longer to send in this case. There is a different path in the code when the system is under load, he said. Mel Gorman suggested testing with a maximum cstate value set to zero to ensure that power management wasn't affecting things. At the end, Zijlstra suggested gathering more data and said that he and others would have a look then.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Comments (none posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.0-rc4 ?

Greg KH Linux 3.19.2 ?

Sebastian Andrzej Siewior 3.18.9-rt5 ?

Greg KH Linux 3.14.36 ?

Greg KH Linux 3.10.72 ?

Architecture-specific

Chanwoo Choi arm64: Add the support for new Exynos5433 SoC ?

Vikas Shivappa x86/intel_rdt: Intel Cache Allocation Technology ?

Yinghai Lu x86, boot: kaslr cleanup and 64bit kaslr support ?

Andrew Bresticker MIPS: Initial IMG Pistachio SoC support ?

Core kernel code

Josh Triplett CLONE_FD: Task exit notification via file descriptor ?

Aleksa Sarai cgroups: add pids subsystem ?

Mathieu Desnoyers sys_membarrier(): system/process-wide memory barrier (x86) (v12) ?

Peter Zijlstra qspinlock stuff -v15 ?

Alexei Starovoitov tracing: attach eBPF programs to kprobes ?

Ian Kent Second attempt at contained helper execution ?

Lai Jiangshan workqueue: Introduce low-level unbound wq sysfs cpumask v5 ?

Peter Zijlstra latched RB-trees and __module_address() ?

Development tools

John Stultz Add timekeeping tests to kernel selftests ?

Device drivers

Tomeu Vizoso Tegra124 EMC (external memory controller) support ?

Tomeu Vizoso Add support for Tegra Activity Monitor ?

Jonathan Richardson Add support for Broadcom iProc touchscreen ?

Lad Prabhakar media: i2c: add support for omnivision's ov2659 sensor ?

Daniel Baluta iio: light: Add support for Capella CM3323 color sensor ?

Kenneth Westfield ASoC: QCOM: Add support for ipq806x SOC ?

Hai Li drm/msm: Initial add DSI support ?

Alexandre Belloni rtc: add Abracon ABx80x ?

Andy Gross Add Qualcomm ADM dmaengine driver ?

vndao@altera.com [PATCH v3] mtd:spi-nor: Add Altera Quad SPI Driver ?

Ross Zwisler Add persistent memory driver ?

LABBE Corentin crypto: Add Allwinner Security System crypto accelerator ?

Ingi Kim Add ktd2692 Flash LED driver using LED Flash class ?

Chaotian Jing Add Mediatek MMC driver ?

Eddie Huang Add Mediatek SoC RTC driver ?

Alan Tull TTY Driver for Newhaven LCD module ?

Martin Kepplinger add support for Freescale's MMA8653FC 10 bit accelerometer ?

Hyungwon Hwang Add drivers for Exynos5433 display ?

Zubair Lutfullah Kakakhel dma: dt: Add DMA driver for jz4780 ?

Device driver infrastructure

Srinivas Kandagatla Add simple EEPROM Framework via regmap. ?

Jan Kara Helper to abstract vma handling in media layer ?

Roger Quadros USB: OTG Core functionality ?

Filesystems and block I/O

Mike Snitzer dm: add full blk-mq support to request-based DM ?

Neil Brown Support follow_link in RCU-walk. - V2 ?

Dan Williams evacuate struct page from the block layer ?

Milosz Tanski vfs: Non-blockling buffered fs read (page cache only) ?

Li Xi ext4: add project quota support ?

Janitorial

Stephen Boyd Remove mach-msm and associated code ?

Memory management

Sergey Senozhatsky new zram statistics reporting scheme ?

Mike Kravetz hugetlbfs: add min_size filesystem mount option ?

Vladimir Davydov idle memory tracking ?

Networking

Alexei Starovoitov bpf: allow extended BPF programs access skb fields ?

Felix Fietkau mac80211: add an intermediate software queue implementation ?

Security-related

Andy Lutomirski capabilities: Ambient capabilities ?

Matthew Garrett Trusted kernel patchset ?

Miscellaneous

Namhyung Kim perf kmem: Implement page allocation analysis (v1) ?

Karel Zak util-linux v2.26.1 ?

Namhyung Kim perf kmem: Implement page allocation analysis (v2) ?

Adrian Hunter perf tools: Introduce an abstraction for AUX Area and Instruction Tracing ?

Page editor: Jonathan Corbet
Next page: Distributions>>