User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 3.14 kernel is out, released on March 30. "So we had a few fairly late changes that I could have done without, but the changelog from -rc8 is still pretty small, and I'm feeling pretty good about it all." Headline features in this release include user-space lock debugging, the deadline scheduler, event triggers in the tracing subsystem, the zram swap subsystem, and various networking changes including the heavy-hitter filter, the PIE packet scheduler and TCP auto corking. See the KernelNewbies 3.14 page for details.

Stable updates: 3.13.8, 3.10.35, and 3.4.85 were released on March 31. 3.12.16 and 3.2.56 were released on April 2.

Comments (none posted)

Quotes of the week

Yes, this is a pet peeve of mine. Our configuration phase is absolutely *the* single worst part of the kernel, and it's not because our Kconfig language is complex. It's because it scares people away from building their own kernels and testing, because we make it insanely hard to answer the questions, and we seem to actively encourage people to enable features that are pointless and just bloat things and make the build process slower and harder.

Christ, even *I* find our configuration process tedious. I can only imagine how many casual users we scare away.

Linus Torvalds

During last week's Collab summit, Jon Corbet suggested we use the power of social media to improve the Linux kernel patch review process.

We thought this was a great idea, and have been experimenting with a new Facebook group dedicated to patch discussion and review. The new group provides a dramatically improved development workflow, including:

  • One click patch or comment approval
  • Comments enhanced with pictures and video
  • Who has seen your patches and comments
  • Searchable index of past submissions
  • A strong community without anonymous flames

To help capture the group discussion in the final patch submission, we suggest adding a Liked-by: tag to commits that have been through group review.

Chris Mason

So for those that don't use Facebook, we're also setting up a twitter account that tweets patches 140 characters at a time. This should help with review by automatically splitting up patches into manageable 140 character chunks.
Josh Boyer

Comments (19 posted)

Merge window coverage to begin next week

The merge window has opened and Linus Torvalds is madly merging code into the mainline. We got a bit behind this week, but will (of course) be covering the merge window, starting in next week's edition. We are sorry for any inconvenience this may cause.

Comments (none posted)

Kernel development news

Ideas for supporting shingled magnetic recording (SMR)

By Jake Edge
April 2, 2014
2014 LSFMM Summit

At the 2014 Linux Storage, Filesystem, and Memory Management (LSFMM) Summit, Dave Chinner and Ted Ts'o jointly led a session that ended up spanning two slots over two days. The topic was, broadly, whether the filesystem or the block layer was the right interface for supporting shingled magnetic recording (SMR) devices. In the end, it ranged a bit more broadly than that.

Zone information API and cache

Ts'o began with a description of a kernel-level C interface to the zone information returned by SMR devices that he has been working on. SMR devices will report the zones that are present on the drive, their characteristics (size, sequential-only, ...), and the location of the write pointer for each sequential-only zone. Ts'o's idea is to cache that information in a compact form in the kernel so that multiple "report zones" commands do not need to be sent to the device. Instead, interested kernel subsystems can query for the sizes of zones and the position of the write pointer in each zone, for example.

The interface for user space would be ioctl(), Ts'o said, though James Bottomley thought a sysfs interface made more sense. Chinner was concerned about having thousands of entries in sysfs, and Ric Wheeler noted that there could actually be tens of thousands of zones in a given device.

The data structures he is using assume that zones are mostly grouped into regions of same-sized zones, Ts'o said. He is "optimizing for sanity", but the interface would support other device layouts. Zach Brown wondered why the kernel needed to cache the information, since that might require snooping the SCSI bus, looking for reset write pointer commands. No one thought snooping the bus was viable, but some thought disallowing raw SCSI access was plausible. Bottomley dumped cold water on that with a reminder that the SCSI generic (sg) layer would bypass Ts'o's cache.

The question of how to handle host-managed devices (where the host must ensure that all writes to sequential zones are sequential) then came up. Ts'o said he has seen terrible one-second latency in host-aware devices (where the host can make mistakes and a translation layer will remap the non-sequential writes—which can lead to garbage collection and terrible latencies), which means that users will want Linux to support host-managed behavior. That should avoid these latencies even on host-aware devices.

But, as Chinner pointed out, there are things that have fixed layouts in user space that cannot be changed. For example, mkfs zeroes out the end of the partition, and SMR drives have to be able to work with that, he said. He is "highly skeptical" that host-managed devices will work at all with Linux. Nothing that Linux has today can run on host-managed SMR devices, he said. But those devices will likely be cheaper to produce, so they will be available and users will want support for them. An informal poll of the device makers in the room about the host-managed vs. host-aware question was largely inconclusive.

Ts'o suggested using the device mapper to create a translation layer in the kernel that would support host-managed devices. "We can fix bugs quicker than vendors can push firmware." But, as Chris Mason pointed out, any new device mapper layer won't be available to users for something like three years, but there is a need to support both types of SMR devices "tomorrow". The first session expired at that point, without much in the way of real conclusions.

When it picked up again, Ts'o had shifted gears a bit. There are a number of situations where the block device is "doing magic behind the scenes", for example SMR and thin provisioning with dm-thin. What filesystems have been doing to try to optimize their layout for basic, spinning drives is not sensible in other scenarios. For SSD drives, the translation layer and drives were so fast that filesystems don't need to care about the translation layer and other magic happening in the drive firmware. For SMR and other situations, that may not be true, so there is a need to rethink the filesystem layer somewhat.

Splitting filesystems in two

That was an entrée to Chinner's thoughts about filesystems. He cautioned that he had just started to write things down, and is open to other suggestions and ideas, but he wanted to get feedback on his thinking. A filesystem really consists of two separate layers, Chinner said: a namespace layer and a block allocation layer. Linux filesystems have done a lot of work to optimize the block allocations for spinning devices, but there are other classes of device, SMR and persistent memory for example, where those optimizations fall down.

So, in order to optimize block allocation for all of these different kinds of devices, it would make sense to split out block allocation from namespace handling in filesystems. The namespace portion of filesystems would remain unchanged, and all of the allocation smarts would move to a "smart block device" that would know the characteristics of the underlying device and be able to allocate blocks accordingly.

The filesystem namespace layer would know things like the fact that it would like a set of allocations to be contiguous, but the block allocator could override those decisions based on its knowledge. If it were allocating blocks on an SMR device and recognized that it couldn't put the data in a contiguous location, it would return "nearby" blocks. For spinning media, it would return contiguous blocks, but for persistent memory, "we don't care", so it could just return some convenient blocks. Any of the existing filesystems that do not support copy-on-write (COW) cannot really be optimized for SMR, he said, because you can't overwrite data in sequential zones. That would mean adding COW to ext4 and XFS, Chinner said.

But splitting the filesystem into two pieces means that the on-disk format can change, he said. All the namespace layer cares about is that the metadata it carries is consistent. But Ts'o brought up something that was obviously on the minds of many in the room: how is it different from object-based storage that was going to start taking over fifteen years ago?—but hasn't.

Chinner said that he had no plans to move things like files and inodes down into the block allocation layer, as object-based storage does; there would just be a layer that would allocate and release blocks. He asked: Why do the optimization of block allocation for different types of devices in each filesystem?

Another difference between Chinner's idea and object-based storage is that the metadata stays with the filesystem, unlike moving it down to the device as it is in the object-based model, Bottomley said. Chinner said that he is not looking to allocate an object that he can attach attributes to, just creating allocators that are optimized for a particular type of device. Once that happens, it would make sense to share those allocators with multiple filesystems.

Mason noted that what Chinner was describing was a lot like the FusionIO filesystem DirectFS. Chinner said that he was not surprised; he looked and did not find much documentation on DirectFS and that others have come up with these ideas in the past. It is not necessarily new, but he is looking at it as a way to solve some of the problems that have cropped up.

Bottomley asked how to get to "something we can test". Chinner thought it would take six months of work, but there is still lots to do before that work could start. "Should we take this approach?", he asked. Wheeler thought the idea showed promise; it avoids redundancy and takes advantage of the properties of new devices. Others were similarly positive, though they wanted Chinner to firmly keep the reasons that object-based storage failed in his mind as he worked on it. Chinner thought a proof-of-concept should be appearing in six to twelve months time.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (12 posted)

Data integrity user-space interfaces

By Jake Edge
April 2, 2014
2014 LSFMM Summit

Darrick Wong and Zach Brown led a session at the 2014 LSFMM summit to discuss progress that has been made in the user-space interfaces for data integrity. This feature uses the data integrity field (DIF) defined by SCSI to add cyclic redundancy check (CRC) information to blocks so that corruptions between user space and the disk platter can be detected. That detection is done with the data integrity extension (DIX).

At last year's meeting, Wong said, he talked about DIF/DIX, but didn't have any code to show. That has changed this year as he has implemented an I/O extension feature that will allow user space to generate DIF information for its blocks. James Bottomley was concerned that the user has to know far too much about the drive to use the interface. It is, he said, the most complex interface we could export for DIF/DIX. But Martin Petersen said that user space would not be expected to use that interface directly, as a library would be written to support "write and protect" and "read and verify" kinds of operations.

The I/O extension uses an "extra" field in the iocb (I/O control block) structure that others have been "eyeing", Ted Ts'o said. But Brown noted that any additional users could add their data into the I/O extension itself, but that would require multiple users to coordinate on their extensions, Ts'o said. Kent Overstreet floated the idea of new system calls that added the protection information as a new parameter, but Wong said he was trying to avoid an explosion of system calls.

Ts'o noted that Google (his employer) has an out-of-tree extension that provides "cut to the head of the line" I/O scheduling; it also uses that extra iocb field. It is a bit of a hack, he said, and he didn't think he could navigate the politics to get it upstream. Wong suggested using a flag value in iocb to determine how the extra field was being used; alternatively, Brown suggested extending the I/O extension. While some coordination is needed to structure these kinds of things, Ts'o said, he didn't seem opposed to the idea overall.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (none posted)

Copy offload

By Jake Edge
April 2, 2014
2014 LSFMM Summit

At the 2014 LSFMM Summit, held in Napa, California March 24-25, Martin Petersen and Zach Brown gave an update on the status of copy offload, which is a mechanism to handle file copies on the server or storage array without involving the CPU or network. In addition, Hannes Reinecke and Doug Gilbert took the second half of the slot to discuss an additional copy offload option.

The Petersen/Brown talk was titled "Copy Offload: Are We There Yet?" and Petersen tried, unsuccessfully, to short-circuit the whole talk by simply answering the question: "Yes, thank you", he said and started to head back to his seat. But there was clearly more to say about a feature that allows storage devices to handle file copies without any involvement of either the server or the network—at least once the copy has been initiated.

Petersen said that he had been working on the feature for some time. He rewrote it a few times and had to rebase it on top Hannes Reinecke's vital product data (VPD) work. That last step got rid of most of his code, he said, and led to a working copy offload. The interface is straightforward, just consisting of target and destination devices, target and destination logical block addresses (LBAs), and a number of blocks. Under the covers, it uses the SCSI XCOPY (extended copy) command because that is "supported by everyone". It does not preclude adding more complicated copy offload options later, Petersen said, but he just wanted something "simple that would work".

Depending on the storage device, copy offload can do really large copies instantly, by just updating some references to the data, Ric Wheeler said. Someone asked what Samba's interface would look like. To that, Brown said that a new interface using file descriptors and byte ranges is the next step. It will be a single-buffer-at-a-time system call that handles descriptors rather than devices. It can return partial success, so user space needs to be prepared for that, he said. While he didn't commit to a date, Brown said that the interface would be much simpler now that Petersen had added XCOPY support.

LID1 and LID4

Moving on to the token-based copying, Gilbert noted that there are two big players in the copy offload world: VMware, which uses XCOPY (with a one-byte length ID, aka LID1), and Microsoft, which uses ODX (aka LID4 because it has a four-byte length ID). Storage vendors all support XCOPY, but ODX support is growing.

LID4 added a number of improvements to LID1, but it adds lots of complexity and ugly hacks too, Gilbert said. ODX is a Microsoft name for the "lite" portion of the original T10 (SCSI standardization group) document "XCOPYv2: Extended Copy Plus & Lite". ODX is a two-part disk-to-disk token-based copy, he said. It uses a storage-based gather list to populate a "representation of data" (ROD), which can be thought of as a snapshot ID. It also generates a ROD token that can be used to access the data assembled.

Wheeler noted that anyone who has the token value (and access to the storage) can copy the data without any security checks. "If you have the token, you have the data" is the model, Fred Knight said. That bypasses the usual operating system security model, though, which is something to be aware of, Wheeler said.

The lifetimes of the tokens (typically 30-60 seconds) will help reduce problems, Reinecke said. But Knight cautioned that lifetimes vary between implementations. In addition, Reinecke noted that the token is not guaranteed to work throughout the entire lifetime.

Gilbert said that ODX is a "point in time" copy, which sounds something like snapshots, but the 30-60 second lifetime makes them not particularly useful as snapshots. He then gave a demo that created a gather list, wrote a token to a file, used scp to copy the token file to another host, then used the token with his ddpt utility to retrieve the data. As Reinecke summed up, the main idea is to avoid data transfer via the CPU whenever possible. If that can be done efficiently, then Linux should look at supporting it.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (2 posted)

Error handling

By Jake Edge
April 2, 2014
2014 LSFMM Summit

Hannes Reinecke led two sessions at this year's Linux Storage, Filesystem, and Memory Management (LSFMM) Summit that were concerned with errors in the block layer. The first largely focused on errors that come out of partition scanners, while the second looked at a fundamental rework of the SCSI error-handling path.

Reinecke has added more detailed I/O error codes that are meant to help diagnose problems. One area where he ran into problems was that an EMC driver was returning ENOSPC when it hit the end of the disk during a partition scan. He would rather see that be an ENXIO, which is what the seven kernel partition scanners (and one in user space) return for the end-of-disk condition. So, he has remapped that error to ENXIO in the SCSI code. Otherwise, the thin provisioning code gets confused as it expects ENOSPC only when it hits its limit.

Al Viro was concerned that the remapped error code would make it all the way out to user space and confuse various tools. But Reinecke assured him that the remapped errors stop at the block layer. Being able to distinguish between actual I/O errors and the end-of-disk condition will also allow the partition scanner to stop probing in the presence of I/O errors, he said.

In another session, on day two, Reinecke presented a proposal for recovering from SCSI errors at various levels (LUN, target, bus, ...). In addition, doing resets at some of the levels do not make any sense depending on the kind of error detected, he said. If the target is unreachable, for example, trying to reset the LUN, target, or bus is pointless; instead a transport reset should be tried, if that fails, a host reset would be next. This would be the path taken when either a command times out or returns an error.

There were lots of complaints from those in attendance about resetting more than is absolutely necessary. That disrupts traffic to other LUNs when a single LUN or target has an error, even though the other LUNs are handling I/O just fine. Part of the problem, according to Reinecke, is that the LUN reset command does not time out.

But Roland Dreier noted that one missed I/O can cause a whole storage array to get reset, which can take a minute or more to clear. In addition, once the error handler has been entered, all I/O to the host in question is stopped. In some large fabrics, one dropped packet can lead to no I/O for quite some time, he said. Reinecke disputed that a dropped frame would lead to that, since commands are retried, but agreed that a more serious error could lead to that situation.

Complicating things further, of course, is that storage vendors all do different things for different errors. The recovery process for one vendor may or may not be the same as what is needed for another. In the end, it seemed like there was agreement that Reinecke's changes would make things better than what we have now, which is obviously a step in the right direction.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (none posted)

A revoke() update and more

By Jake Edge
April 2, 2014
2014 LSFMM Summit

Al Viro gave an update on the long-awaited revoke() system call to the 2014 Linux Storage, Filesystem, and Memory Management (LSFMM) Summit. revoke() is meant to close() any existing file descriptors open for a given pathname so that a process can know that it has exclusive use of the file or device in question. Viro also discussed some work he has been doing to unify the multiple variants of the read() and write() system calls.

[Al Viro]

Viro started out by saying that revoke() was the less interesting part of his session. It is getting "more or less close to done", he said. We looked at an earlier version of this work a year ago. Files will be able to be declared revokable at open() time. If they are, a counter will track the usage of the file_operations functions at any given time. Once revoke() is called, it waits for all currently active threads to exit the file_operations, and makes sure that no more are allowed to start.

There are places in procfs and sysfs where something similar is open-coded, Viro said, that could be removed once the revoke() changes go in. One of the keys is to ensure that the common path does not slow down for revoke() since most files will not be revokable. There are several areas that still need work, including poll(), which "provides some complications", and mmap(), which has always been problematic for revoke().

In a bit of an aside, Viro noted that there is a lot of code that is "just plain broke". For example, if a file in debugfs is opened and the underlying code removes the file from the debugfs directory, any read or write operation using the open file descriptor will oops the kernel. Dynamic debugfs is completely broken, Viro said. He hopes that the revoke() code will be in reasonable shape in a couple of cycles—"it's getting there". Dynamic debugfs will be one of the first users, he said.

Viro then moved on to the unification of plain read() and write() with the readv()/writev() variants as well as splice_read() and splice_write(). The regular and vector variants (readv()/writev()) have mostly been combined, he said. It is "not pretty", but it is tolerable. The splice variants got "really messy".

Ideally, the code for all of the variants should look the same all the way down, until you get to the final disposition. But each of the variants has its own view of the data; the splice variants get/put their data into pages, which doesn't fit well with the iovec used by the other two variants (in most implementations, plain read() and write() are translated to an iovec of length one). Creating a new data structure that can hold both user and kernel iovec members, along with struct page for the splice variants may be the way to go, Viro said.

Something that "fell out" of his work in this area is the addition of iov_iter. The iov_shorten() operation tries to recalculate the number of network segments that fall into a given iovec area, but the result is that the iovec gets modified when there are short reads or writes. Worse still, how the iovec gets modified is protocol-dependent, which makes it hard for users. In fact, someone from the CIFS team said that it makes a copy of any iovec before passing it in because it doesn't know what it will get back.

Having it be protocol-dependent is "just wrong", Viro said. He has been getting rid of iov_shorten() calls, as well as other places that shorten iovec arrays. That might allow sendpage() to be removed entirely; protocols that want to be smart can set up an iov_iter, he said.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (6 posted)

Thin provisioning

By Jake Edge
April 2, 2014
2014 LSFMM Summit

Eric Sandeen, Lukáš Czerner, Mike Snitzer, and Dmitry Monakhov discussed thin provisioning support in Linux at the 2014 LSFMM Summit. Thin provisioning is a way for a storage device to pretend to have more space than it actually does, relying on the administrator to add more storage "just in time". For the most part, Snitzer said, thin provisioning using the device mapper (dm-thin) is pretty stable. But there are some performance issues that they would like to address.

One of the problem areas in terms of performance is that the block allocator from dm-thin is "dumb". Multi-threaded writes end up on the disk with bad ordering that leads to bad read performance. What Snitzer would like to do is to split writes into different lists, one per thin volume, then sort the struct bio entries in the list. He doesn't want to add a full-on bio-based elevator, but does want to get some better locality to improve performance.

As part of that, Snitzer would like to have a way to ask an XFS or ext4 filesystem about its allocation group boundaries. Those could be used as a hint for block allocation in the thin provisioning code. But Joel Becker wondered why that information was needed, and why the logical block address (LBA) information was not enough. Dave Chinner agreed, noting that the filesystem relies on the LBA information to make its allocation decisions.

Becker suggested that what Snitzer was really after is the "borders at which we stop caring about locality"—basically the distance between two writes that would not be considered "close". Snitzer said that he is looking for something concrete that dm-thin can use. Ted Ts'o thought that both ext4 and XFS could provide some values that would be reasonable for dm-thin to use to determine locality.

Monakhov noted that filesystems spread out their data throughout their volume, which causes problems for dm-thin. The problem, Chinner said, is that the filesystem needs to tell the block layer where the free space is. Chinner said that one dm-thin developer was asking for information on how the filesystem will spread things out, while another was asking that filesystems not spread things out.

There needs to be an automatic way for the filesystem to tell the block layer about free space, Ts'o said. Discard is one mechanism to do so, but Roland Dreier said that most administrators are disabling discard. In addition, TRIM command (that tells devices about unused blocks) support has been spotty, Martin Petersen said. Unqueued TRIM didn't work early on, but works now, while queued TRIM support is being added and is not yet working. Unqueued TRIM requires stopping all other activity on the device, so performance suffers; queued TRIM was added to the standard relatively recently to avoid that problem.

Someone said that fstrim (offline discard) is probably the right solution for most workloads. Snitzer said that mount -o discard (online discard) could be used with dm-thin. It passes the discard information to dm-thin, which doesn't (necessarily) pass it down to the storage device. That gives dm-thin the information it needs on free space, however.

Another problem for dm-thin is that fallocate() will reserve space, but that isn't getting passed down to the block layer. The result is that even after a successful fallocate() call, applications can still get ENOSPC—exactly the outcome fallocate() was meant to avoid. Ts'o said that can't be solved without handing the block allocation job to the block layer. But, Chinner said, it is not necessarily right for dm-thin to handle allocation.

Sandeen noted another problem: filesystems act differently when they run out of space. For XFS, it will keep trying to write any pending metadata, while ext4 and Btrfs do not. Chinner explained that XFS already has the metadata on stable storage in the log, so it retries to see if the administrator wants to add more storage.

More generally, there are different classes of errors for the different filesystems, Chinner said. Some are considered transient errors by some filesystems, which leads to different behavior between them. But, Monakhov noted, "user space goes crazy" when it gets an ENOSPC.

Monakhov went on to suggest (as he had in a lightning talk the previous day) that there be a standard way for the filesystem to report errors to the logs. James Bottomley said the obvious way to do that was with a uevent from the virtual filesystem (VFS). There is general agreement that some kind of event framework is needed, he said, but "we don't know who is going to do it, or when it will be done". As with many of the LSFMM sessions, few conclusions were drawn though some progress was clearly made.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (1 posted)

Block multi-queue status

By Jake Edge
April 2, 2014
2014 LSFMM Summit

The block layer multi-queue work was the subject of a discussion led by Nic Bellinger at this year's Linux Storage, Filesystem, and Memory Management (LSFMM) Summit. One might have expected Jens Axboe and Christoph Hellwig to be part of any discussion of that sort, but Axboe was ill and Hellwig was boycotting the location, Bellinger said. That left it up to him, though Axboe did provide some notes for the block multi-queue work.

From those notes, which Bellinger also provided to me, he said that the initial multi-queue work was merged for the 3.13 kernel. It only supported the virtio_block driver and "mostly worked". There have been changes since that time, but overall it appears to be architecturally sound.

A basic conversion of the Micron mtip32xx SSD driver has been done. The existing driver is a single queue with shared tags. After the conversion, there are eight queues available. It runs at about 1.8 million I/O operations per second (IOPS), which is about the same as the unpatched driver. It works well on a two-socket system, but falls down on a four-socket machine.

Part of the problem is a lack of tags. The percpu-ida code is not going to cut it for tag assignment. An audience member said they had replaced percpu-ida recently, which eliminated the tags problem. Matthew Wilcox noted another tag problem: the Linux implementation makes them unique per logical unit number (LUN), while the specification says they need only be unique per target. In addition, James Bottomley said, the specification allows for 16-bit tags, rather than the 8-bit tags currently being used.

Bellinger then moved on to his and Hellwig's work on adding multiple queues to the SCSI subsystem. Since 2008, there have been reports of small-block random I/O performance issues in the SCSI core. Most of that is due to cache-line bouncing of the locks. It limits the performance of that type of I/O to 250K IOPS. Getting performance to 1 million IOPS using multiple SCSI hosts was taking up to 1/3 of the CPU on spinlock contention.

So Bellinger used the block multi-queue infrastructure to preallocate SCSI commands, sense buffers, protection information, and requests. His initial prototype had no error handling, any errors would oops the system. But he was able to get 1.7 million IOPS out of that prototype code. Hellwig got error handling working and has been driving it to something that could be merged.

There are plans for an initial merge, but Bottomley was concerned that Bellinger and Hellwig did not agree on whether the faster IOPS mode was the default case or not, with Bellinger on the side of it being an exception. Bellinger said that there had been no agreement yet on that, which would make merging difficult, Bottomley said.

Converting drivers should be fairly easy, Bellinger said, though Bottomley cautioned that there would need to be a lot of work done on lock elimination in the drivers. There is also a question of per-queue vs. per-host mailboxes, Bottomley said. There is work to do to determine which submission model will work best, he said.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (1 posted)

Large-sector drives

By Jake Edge
April 2, 2014
2014 LSFMM Summit

Supporting large-sector (>4K) drives was the topic of Ric Wheeler's fairly brief session at the 2014 LSFMM Summit. Supporting drives with 4K sectors only came about in the last few years, but 4K sectors are not the end of the story.

Wheeler asked the drive vendor representatives if they chose 4K as their sector size willingly, or simply because it is the largest size supported by operating systems. While no one directly answered the question, it was evident that drive vendors would like more flexibility in sector sizes. While they may not want sectors as large as 256M (jokingly suggested by Martin Petersen), storage arrays have been using 64K and 128K sectors internally for some time, Wheeler said.

Dave Chinner asked which layers would need to change to handle larger sector sizes, and suggested that filesystems and partition-handling code were likely suspects. He also said that the ability to do page-sized I/O is a fundamental assumption throughout the page cache code. Jan Kara mentioned reclaim as another area that makes that assumption. Linux pages are typically 4K in size.

One way to deal with that would be with chunks, as IRIX did with its chunk cache, Chinner said. It was a multi-page buffer cache that was created to handle exactly this case. But, he is not at all sure we want to go down that path. Petersen mentioned that there is another commercial Unix that has a similar scheme.

There could also be a layer that allows for sub-sector reads and writes. Though, it would have to do read-modify-write cycles to write smaller pieces, which would be slow, Chinner said.

Unlike the advent of 4K drives, the industry is not pushing for support of larger sector sizes immediately, Wheeler said. There is time to solve the problems correctly. The right approach is to handle sectors that are larger than the page size first, then to build on that, Ted Ts'o said.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (none posted)

Direct I/O status

By Jake Edge
April 2, 2014
2014 LSFMM Summit

Kent Overstreet spoke about his rewrite of the direct I/O (DIO) code in a session at this year's Linux Storage, Filesystem, and Memory Management (LSFMM) Summit. Direct I/O accesses the underlying device directly and avoids data being stored in the page cache.

Overstreet began by noting that the immutable biovec work was now upstream. That allows for more easily splitting a struct bio. The only remaining prerequisite for the DIO rewrite is a generic make_request() that can take arbitrarily sized bios. Once that's there, drivers will need to do the splitting.

So, the kernel can implement a DIO operation by allocating a bio and putting a "bunch of pages into it". It reduces the code size and complexity significantly. A lot of the work currently done to manipulate arrays of pages just goes away. The end goal should be that dio should just be able to hand a bio to a filesystem and let it do the lookups, he said. In addition, hopefully the lower layers of the buffered I/O code can use the generic make_request() changes as well. Splitting a bio is now no less efficient that allocating two to begin with.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (1 posted)

FedFS, NFS, Samba, and user-space file servers

By Jake Edge
April 2, 2014
2014 LSFMM Summit

Ostensibly, the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit is broken up into three tracks, but for the most part there is enough overlap between the Storage and Filesystem parts that joint sessions between them are the norm. The only session where that wasn't the case was a discussion led by Chuck Lever and Jim Lieb that spanned two slots. It covered user-space file servers and, in particular, FedFS and user-space NFS. As I was otherwise occupied in the Storage-only track, Elena Zannoni sat in on the discussion and provided some detailed notes on what went on.

Lever kicked things off by describing FedFS (short for "Federated Filesystem") as a way for administrators to create a single root namespace from disparate portions of multiple remote filesystems from multiple file servers. For that to work, there are filesystem entities known as "referrals" that contain the necessary information to track down the real file or directory entry on another filesystem. By building up referrals into the hierarchy the administrator wants to present, a coherent "filesystem" made up of files from all over can be created.

Referrals are physically stored in objects known as "junctions". Samba uses symbolic links to store the metadata needed to access the referred-to data, while FedFS uses the extended attributes (xattrs) on an empty directory for that purpose. At last year's meeting, it was agreed to converge on a single format for junctions. Symbolic links would be used, though Samba would still use the linked-to name, while FedFS would use xattrs attached to the link.

Since then, that decision was vetoed by the NFS developers, Lever said. So FedFS will stay with the empty directory to represent junctions. These empty directories are similar to the Windows concept of "reparse points", according to Steve French. It is an empty directory with a special attribute to distinguish it from a normal empty directory.

It would be nice to be able to add a new type of inode (or mode bit) to the virtual filesystem (VFS) to support that, French said. But Ted Ts'o noted that doing so would require all filesystems to change to support it

Lever also explained that a single administrative interface that could manage junctions for both Samba and NFS is needed. In answer to a question from Jeff Layton, Lever said that FedFS was looking for help from the kernel on reparse points (or junctions), as well as performance help to reduce lookups for discovering these referrals and where they go. Lieb noted that the latter would also help the Ganesha user-space NFS file server.

Layton went on to explore why the symlink scheme could not be used, but Trond Myklebust was clear that using symbolic links was too ugly and hacky. It also limited the referral information to a single page, so supporting multiple protocols for a single referral was difficult. It is a "nasty hack" that Samba uses, but it is not sensible to spread it further, Myklebust said. It was agreed that more discussion was needed before any kind of proposal could be made.

The session switched over to Lieb at that point. He had a number of topics where user-space file servers (like Ganesha) needed kernel help. The first is for file-private locks, which may have been solved with Layton's work on that feature. Lieb hopes to see that get merged in the 3.15 merge window. The next step will be to get GNU C library (glibc) patches to support the new style of locks merged.

Another problem for Ganesha is with filtering inotify events. It would like to be able to get events for anything that some other filesystem does to the exported directories, while not getting notified for events generated by its own activities. The inotify events are used to invalidate caches in Ganesha and it is getting swamped by events from its own actions, Lieb said. Patches have been posted, but he would like to see the feature get added.

Dealing with user credentials is another area where Ganesha could use some kernel help. Right now, for many operations, Ganesha must do seven system calls to perform what is (conceptually) one operation. It must call setuid(), setfsgid(), setgroups(), then the operation followed by an unwinding of the three credentials calls. He would like a simpler way, with fewer system calls.

After a long discussion, it became clear that what Lieb was looking for was a way to cache a particular set of credentials in the kernel that could be recalled whenever Ganesha needed to do an operation as that user. Currently, the constant set up and teardown of the credentials information is time consuming. Ts'o thought he had a solution for that problem and he suggested that he and Lieb finish the discussion outside of the meeting.

The never-ending battle for some sort of enhanced version of readdir() was next up. There is a need to get much more information than what readdir() provides. A number of proposals have been made over the years, including David Howells's xstat() system call. Those proposals have all languished for lack of someone driving the effort. The older patches need to be resurrected, refreshed, and reposted, but it will require finding someone to push them before they will be merged.

The last problem discussed is support for access-control lists (ACLs). There are two kinds of ACLs being used today: POSIX and rich (NFSv4) ACLs. They have different semantics and the question is how the kernel can support both. Currently, the kernel has support for POSIX ACLs, but Samba, NFSv4, and others use the rich ACLs.

One possible solution would be to just add rich ACLs to Linux, essentially sitting parallel to the existing POSIX ACLs. But Al Viro believes it is too complicated to have two similar features with slightly different semantics both in the kernel. There is also some thought that perhaps POSIX ACLs could be emulated by rich ACLs, but it is unclear if that is true. In the end, the kernel needs to do the ACL checking, since races and other problems are present if user space tries to do that job. The slot ended before much in the way of conclusions could be drawn.

[ Thanks to Elena Zannoni for her extensive notes. Thanks, also, to the Linux Foundation for travel support to attend LSFMM. ]

Comments (none posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O


Page editor: Jake Edge
Next page: Distributions>>

Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds