Kernel development
Brief items
Kernel release status
The current development kernel is 4.0-rc5, released on March 22. Linus said: "There's nothing particularly worrisome going on, although I'm still trying to think about the NUMA balancing performance regression. It may not be a show-stopper, but it's annoying, and I want it fixed. We'll get it, I'm sure."
Stable updates: none have been released in the last week. The 3.19.3, 3.14.37, and 3.10.73 updates are in the review process as of this writing; they can be expected on or after March 26.
Quotes of the week
Kernel development news
NFS performance
On day two of the 2015 Linux Storage, Filesystem, and Memory Management Summit, Chuck Lever led a discussion on NFS performance. There are some bottlenecks to look at, and suggestions were made on ways to avoid some of them.
The transport_lock is a spinlock used by the Remote Procedure Call (RPC) layer. It is a bit like the Big Kernel Lock (BKL), Lever said, in that it protects all of the transport data on a per-socket basis. It is used as a queueing mechanism to prevent RPCs from interleaving on the wire. He is looking for ways to break up that lock, much as the BKL-removal work did with the BKL.
Currently, a thread is woken up to copy the received data, but it might make more sense to do that work in software interrupt (softirq) context, Jeff Layton said. That is how remote DMA (RDMA) does things, Lever said. Layton said you could start by simply doing copies out of the socket buffer from the softirq, but eventually using splice() might provide even better performance.
Lever said that there is also a proposal to make incoming data be page-aligned. Andreas Gruenbacher said that the idea was to use large network frames and to receive them into page-aligned buffers.
Dave Chinner said that will require the sending side to be aware of that setting so that it can form its TCP packets in large frames. Bruce Fields said that the networking developers didn't like the change. Chinner said that he was not surprised, as messing with segment boundaries is always tricky. Gruenbacher noted that it required using the new huge frames to get enough data into one packet, as doing page-aligned receives on small packets will just waste space.
One of the two data copies that are currently being done could be saved if the softirq code changed to look inside the RPC packets, Fields said. By figuring out what the packet contains, the RPC code could route it to the right place, sometimes using splice(). Lever said that RDMA solves the copying problem nicely, but that it is a niche use case and likely to remain that way.
Another area of performance improvement is to use NFS compounds, which allow multiple read and write operations in a single NFS transaction. Lever said that Fields has been working on support for that as part of the NFS 4.2 support in Linux.
In addition, Lever said, there is a new operation in 4.2 called READ_PLUS that will assist when clients are reading sparse files. That operation allows the server to report the holes optimally. There was concern that rematerializing the holes on the client might be expensive, but that turned out not to be the case.
Fields said that he used SEEK_HOLE and SEEK_DATA flags to lseek() to add the holes to the files on the client side. But Chinner cautioned that there is no way of atomically finding holes and returning data beyond them, as it will always race with other operations that are happening on the file.
Lever said that NFS delegations, which are a kind of file lock, would be required from the server when the READ_PLUS operation is used. That will only be granted by the server if no one has the file open for writing. However, delegation is not enabled on all NFS servers. And that is where the conversation kind of trailed off.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Filesystem defragmentation
Dmitry Monakhov prefaced his 2015 LSFMM Summit session on filesystem defragmentation with a statement that the "problem is almost already solved". His session turned into a largely informational description of the status of a defragmentation tool that he has been working on.
Over time, filesystems change and cannot avoid fragmentation issues, he said. For example, extracting a Linux source tree results in many small files that filesystem tries to allocate close to each other. Building in the tree results in lots of temporary files that get removed, so the filesystem gets fragmented.
Beyond appearing in regular filesystems, these fragmentation problems show up in thin provisioning systems, as well as for shingled magnetic recording (SMR) devices, he said. In addition, to make boot times shorter, it would be best to lay out all the needed files sequentially on the disk, which may require defragmentation.
The fragmentation problem is already solved for large files. Btrfs, XFS, and ext4 all have tools for doing defragmentation on files. But there is no solution for directory fragmentation. The filesystems try to put files that are in the same directory close to each other on the disk, but as files get deleted or moved, fragmentation of the directory occurs.
To perform defragmentation, it is often necessary to copy file data from one place to another. Monakhov suggested that a checksum could be calculated on the data when doing that copy, which could then be stored in a "trusted" extended attribute (xattr). He noted that overlayfs uses the "trusted.overlay" xattr, which can only be modified by processes with CAP_SYS_ADMIN, so a "trusted.sha1" (or or other hash) could be calculated and stored when copying data for defragmentation.
Executable files could then have their contents checked and compared to the hash value before being executed. He proposed adding that capability to his tool, but it seemed to be something of an aside. It is not clear how it relates to the integrity measurement architecture (IMA), for example.
He has been working on a tool called e4defrag2 (developed in a branch of e2fsprogs) that will perform defragmentation. It is mostly independent of the filesystem type. It uses the same block scanning code to find fragmentation, but ext4 and XFS have a different ioctl() name for their defragmentation operations.
The result is a "giant utility that works for everything", Monakhov said. The filesystem-dependent part is roughly 100 lines of code. This "universal defragmenter" will be released soon.
Ted Ts'o asked what would be needed to eliminate the 100 lines. He asked if wiring up the XFS ioctl() name into ext4 would help. Monakhov said that the tool needs to get the block bitmap from the filesystem, which is also different between the filesystems. Ts'o and Dave Chinner indicated that they would attempt to provide the same interfaces. Chinner did caution that XFS cannot defragment a range in a file, only the whole file. That is different than ext4, Monakhov said.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
UID/GID identity and filesystems
"User namespaces only solve half the problem", Andy Lutomirski said to start off his session at the 2015 LSFMM Summit. User namespaces remap user IDs (UIDs) and group IDs (GIDs) in the running kernel, but they don't do anything for the UID and GID values stored in filesystems. Those IDs are simply integers stored in the filesystem metadata.
Lutomirski noted that when inserting a USB stick with a "real filesystem, not FAT" on it, the mounted filesystem will have UIDs and GIDs that are likely to be wrong. It would be nice, he said, if instead the files showed up as being owned by the user's UID.
This is also a problem for both NFS and FUSE filesystems, he continued. There is a partial solution in that mounting a FUSE filesystem inside a user namespace will map the UIDs inside the namespace before writing them to the filesystem. NFS has a solution as well. He wondered if there could be a more general approach.
Dave Chinner pointed out that some filesystems have mount options to do simple UID remapping. Those options might simply squash all UID/GIDs on the filesystem into a single UID/GID. An option like that could be added to the virtual filesystem (VFS) layer so that all filesystems had access to it.
That might be a reasonable way to approach the problem, Lutomirski said. Obviously NFS has already solved it, he said, though he had not looked to see what it does. Jeff Layton said that NFS has traditionally mapped UIDs and GIDs between the server and the client. That was originally done using strings for the user and group names, which would get mapped at the other end to integers. The current NFS solution is more complicated, Bruce Fields said, involving LDAP lookups, which is probably not what Lutomirski is looking for.
For his use case, squashing to a single UID would be sufficient, Lutomirski said. Handling Linux Security Module (LSM) contexts is trickier, but that could perhaps be added later. There was some discussion of the different ways that filesystems interpret the uid= and gid= mount options; he would like to see there be some uniformity, which would might require an entirely new mount option (possibly something like vfs_uid=).
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Issues with epoll()
In a filesystem session at the 2015 LSFMM Summit, Jason Baron led a discussion about the epoll() system call. He and others have observed some performance problems with epoll(), especially for large sets of monitored file descriptors. There are two problems that Baron is trying to address: the "thundering herd" problem on wakeups and the use of global locks when manipulating the epoll() sets. He has posted patches for both, but they haven't really been commented on, he said. He also noted that Fam Zheng has posted some patches that add new system calls for epoll().
The thundering herd problem occurs when there are multiple threads that share a wakeup source in their epoll() sets. When that file descriptor becomes ready, all of the threads waiting wake up, even though only one of them is needed to service the event. One solution that had been suggested was to have a single epoll() queue, with all events being taken off that single queue. But that is not optimal for what he is trying to do, he said.
His patches simply wakeup the first idle thread that is waiting, then round-robin through the threads on subsequent wakeups. Some suggested using CPU affinity to wake up the thread on the CPU where the interrupt has come in. But epoll() doesn't currently have access to that kind of information. Baron has "heard vaguely" that some people are doing this, but he hasn't seen any patches. He would like to explore the idea further.
His initial proposal was to simply wake up one thread waiting on the epoll() set, but there was concern that might break programs that were expecting the current behavior. The wait queue used is associated with a file descriptor, so it is local to the process (and its threads), rather than global. A flag passed to epoll() could change the behavior for a program without affecting other programs that might also be waiting.
Another option that he has tried is to change the wakeup behavior in the scheduler, though he was worried that the scheduler developers would be unhappy with a change like that. When he posted it, though, there was no feedback of that sort. Still, avoiding changes to the wakeup code is desirable.
But epoll() has the ability to nest the file descriptors it is monitoring. That means a set of file descriptors can be constructed that contains descriptors returned from other epoll_create1() calls. In the past, loops could be created that way, though that has been fixed. One could use the nesting capability, coupled with a new flag to epoll_create1() to add the round-robin feature, but restrict the changes to the epoll() code instead of changing the wakeup code.
Jeff Layton asked if there would be two flags, one to request the CPU affinity mode and one for the round-robin behavior. But Baron did not think both would be needed. The CPU affinity mode could simply fall back to round-robin behavior if the interrupt did not come in on a CPU that was running a thread waiting on the event.
He moved on to locking, which has shown up in some profiles of epoll() performance. Akamai (where Baron works) has not necessarily run into it, but people don't like global locks, in general, he said. Part of the problem is that the kernel does not know when the sets have file descriptors in common, so it locks everything when manipulating them.
The idea is to break up the locks in the classic way, he said, so that operations are serialized only for sets with common file descriptors. He posted patches a few months ago, but they added three pointer fields to struct file, which was not something other developers were happy with. He plans to switch to only adding a single pointer that points to a structure to hold anything that epoll() needs. It would be allocated when the epoll() file descriptor is created.
In addition, his patches eliminate the runtime checking for loops and too deep of nesting in the file descriptor sets. Right now those checks are done when calling epoll_wait(), but his patches do that checking when file descriptors are added to the set in epoll_ctl().
Layton asked if all of this work meant that Baron was volunteering to be the epoll() maintainer. Baron was non-committal, but Chris Mason suggested (with a chuckle) that if these patches were accepted, that would more or less happen by default.
Mason said that Facebook is hitting some of these problems, as is Google. Someone said that GlusterFS is hitting them too. Baron said that Akamai would be using his patches in production, so they should get lots of testing.
There are other epoll() patches out there, including those for new system calls from Zheng. Others include a patch that would add a lockless way to enqueue and dequeue events and one that would optimistically wait (briefly) in the kernel for another event rather than immediately go to sleep. The person working on the latter patches, which were targeted at networking, is now working on other things, Matthew Wilcox said, so they could be taken over by someone else if that was of interest.
It would seem that scalability problems with epoll() are cropping up in a number of places, so some fixes are needed. Baron's patches are not running into much in the way of opposition, at least from the assembled filesystem developers, which means they may make their way into the mainline before long.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Copy offload
In the final combined storage-and-filesystem session at the 2015 LSFMM Summit, Zach Brown and Martin Petersen teamed up to describe the state of and plans for supporting copy offload, which is a way of handing the work of copying a file to a filesystem or lower-level storage device, where the task can often be optimized. The functionality has been available in storage devices for eight years or so, Brown said.
The current strategy is to add a new system call, copy_file_range(), that takes two file descriptors with pointers to offsets and lengths, Brown said. As the later discussion indicated, those file descriptors could be for files on different filesystems, but some feel that they should be restricted to a single filesystem. The big difference from earlier proposals is that callers are now required to create the destination file. That avoids some race conditions in the virtual filesystem (VFS) layer.
The remaining contentious parts for the system call are minor, he continued. For example, a flag value for the length could indicate that the entire source file should be copied. There is a "whole world of shit we can argue about", he said, since there are 32 bits worth of flag values available. The contentious piece is on the block side, he said. Petersen has added support, but the device mapper developers did not like the approach he took.
For Btrfs, the system call is a wrapper around the existing ioctl(), though there are some alignment issues still to be worked out. Chris Mason said that for Btrfs there are different options for doing copy offload. Creating a directory subvolume is a constant-time operation that can make a copy of an entire file (using copy on write or COW). Making a file copy directly, which could support a range in the file (again, using COW), is proportional to the number of extents in the file. Brown suggested that under the covers Btrfs could implement the copy as a subvolume creation if the copy is for a whole file.
Ric Wheeler seemed to sum up the feeling of many when he said that "anything that works is better than years of nothing" for copy-offload support.
Petersen said that SCSI support for copy offload has advanced since last year, even though he had said it was done then. It now supports more features. There are some patches that add copy-offload support to the device mapper kcopyd (dm-kcopyd), though he "did not agree with the approach exactly". He has also added support for token-based copy offload, where device-generated tokens are used to identify the data of interest at the storage level. The block and SCSI support for copy offload has just been waiting for a user other than dm-kcopyd, he said.
Brown noted that callers of copy_file_range() could perhaps get an error return if the underlying storage did not support copy offload. That way the caller could decide whether to fall back to a regular copy or not. A flag could be added to the call to do that fallback in the kernel, too.
The new system call would allow copying between files between two different mounted filesystems as long as both support copy offload, at least conceptually, but Christoph Hellwig thought that should be left for an add-on patch. All of the existing system calls will only work within a single mountpoint, he said, so making an exception needs to be considered carefully. Wheeler said that being able to do copies between mountpoints is a powerful feature, but Hellwig thought it should wait until someone actually needs that functionality and can provide a good implementation. It is never a problem to relax restrictions on system calls, Hellwig said.
The cross-filesystem copying feature is most important for network filesystems, Hellwig said. Wheeler disagreed, saying that it is also important for local filesystems. Hellwig said there needs to be a well-thought-out interface, so that users don't get locked into ioctl()-based mechanisms. Block-based filesystems could defer to the lower-level copy-offload support, he suggested. There is "more than one way to skin the cat; we just have to find a cat that we can skin", Dave Chinner said with a chuckle.
Step one should be to get the single-mountpoint system call implementation in, Hellwig said. Getting the block-layer support in should be step two. "Anything more fancy can follow". He also thought that token-based copies "make zero sense" from a user-interface perspective. That should be hidden in the lower levels. Finally, there should be an asynchronous interface with a notification when the operation completes.
The sense in the room was that copy-offload support is nearing inclusion after being discussed for several years at LSFMM. We will have to wait and see what gets into the mainline or whether copy offload will be on the agenda at next year's summit in Raleigh, North Carolina.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
