The 2010 Linux Storage and Filesystem Summit, day 1

By Jonathan Corbet
August 9, 2010

The fourth Linux storage and filesystem summit was held August 8 and 9 in Boston, immediately prior to LinuxCon. This time around, a number of developers from the memory management community were present as well. Your editor was also there; what follows are his notes from the first day of the summit.

Testing tools

The first topic of the workshop was "testing and tools," led by Eric Sandeen. The 2009 workshop identified a generic test suite as something that the community would very much like to have. One year later, quite a bit of progress has been made in the form of the xfstests package. As the name suggests, this test suite has its origins in the XFS filesystem, and it is still somewhat specific to XFS. But, with the addition of generic testing last May, about 70 of the 240 tests are now generic. Xfstests is concerned primarily with regression testing; it is not, generally, a performance-oriented test suite. Tests tend to get added when somebody stumbles across a bug and wants to verify that it's fixed - and that it stays fixed. Xfstests also does not have any real facility for the creation of test filesystems; tools like Impressions are best used for that purpose.

About 40 new tests have been added to xfstests over the last year; it is now heavily used in ext4 development. Most tests look for specific bugs; there isn't a whole lot of coverage for extreme situations - millions of files in one directory and such. Those tests just tend to take too long.

It was emphasized that just running xfstests is not enough on its own; tests must be run under most or all reasonable combinations of mount options to get good coverage. Ric Wheeler also pointed out that different types of storage have very different characteristics. Most developers, he fears, tend to test things on their laptops and call the result good. Testing on other types of storage, of course, requires access to the hardware; USB sticks are easy, but not all developers can test on enterprise-class storage arrays.

Tests which exercise more of the virtual memory and I/O paths would also be nice. There is one package which covers much of this ground: FIO, available from kernel.dk. Destructive power failure testing is another useful area which Red Hat (at least) is beginning to do. There has also been some work done using hdparm to corrupt individual sectors on disk to see how filesystems respond. A wishlist item was better latency measurement, with an emphasis on seeing how long I/O requests sit within drivers which do their own queueing. It was suggested that what is really needed is some sort of central site for capturing wishlist ideas for future tests; then, whenever somebody has some time available, those ideas are available.

In an attempt to better engage the memory management developers in the room, it was asked: how can we make tests which cover writeback? The key, it seems, is to choose a workload which is large enough to force writeback, but not so large that it pushes the system into heavy swapping. One simple test is a large tar run; while that is happening, monitor /proc/vmstat to see when writeback kicks in, and when things get bad enough that direct reclaim is invoked. An arguably more representative test can be had with sysbench; again, the key is to tune it so that the shared buffers fit within physical memory.

But, as Nick Piggin pointed out, anything that dirties memory is, in the end, a writeback test. The key is to find ways of making tests which adequately model real-world workloads.

Memory-management testing

Your editor is messing with the timing here, but the session on testing of memory management changes fits well with the above. So please ignore the fact that this session actually happened after lunch.

The question here is simple: how can memory management changes be tested to satisfy everybody? This is a subject which has been coming up for years; memory management changes seem to be especially subject to "have you tested with this other kind of workload?" questions. Developers find this frustrating; it never seems to be possible to do enough testing to satisfy everybody, especially since people asking for testing of different workloads are often unable or unwilling to supply an actual test program.

It was suggested that the real test should be "put the new code on the Google cluster and see if the Internet breaks." There are certain practical difficulties with this approach, however. So the question remains: how can a developer conclude that a memory management change actually works? Especially in a situation where "works" means different things to different people? There is far too wide a variety of workloads to test them all. Beyond that, memory management changes often involve tradeoffs - making one workload better may mean making another one worse. Changes which make life better for everybody are rare.

Still, it was agreed that a standard set of tests would help. Some suggestions were made, including hackbench, netperf, various database benchmarks (pgbench or sysbench, for example), and the "compilebench" test which is popular with kernel developers. There was also some talk of microbenchmarks; Nick Piggin noted that microbenchmarks are terrible when it comes to arguing for the inclusion of a change, but they can be useful for the detection of performance regressions.

Sometimes running a single benchmark is not enough; many memory management problems are only revealed when the system comes under a combination of stresses. And Andrea Arcangeli made the point that, in the end, only one test really matters: how much time does it take the system to complete running a workload which exceeds the amount of physical RAM available?

There was some discussion of the challenges involved in tracking down problems; Mel Gorman stated that the debugability of the virtual memory subsystem is "a mess." Tracepoints can be useful for this purpose, but they are hard to get merged, partly due to Andrew Morton's hostility to tracepoints in general. There is also ongoing concern about the ABI status of tracepoints; what happens when programs (perhaps being run by large customers of enterprise distributions) depend on tracepoints which expose low-level kernel details? Those tracepoints may no longer make sense after the code changes, but breaking them may not be an option.

Filesystem freeze/thaw

The filesystem freeze feature enables a system administrator to suspend writes to a filesystem, allowing it to be backed up or snapshotted while in a consistent state. It had its origins in XFS, but has since become part of the Linux VFS layer. There are a few issues with how freeze works in current kernels, though.

The biggest of these problems is unmounting - what happens when the administrator unmounts a frozen filesystem? In current kernels, the whole thing hangs, leaving the system with an unusable, un-unmountable filesystem - behavior which does not further the Linux World Domination goal. So four possible solutions were proposed:

Simply disallow the unmounting of frozen filesystems. Al Viro stated that this solution is not really an option; there are cases where the unmount cannot be disallowed. When the final process exits the namespace in which the filesystem is mounted is one of those cases. Disallowing unmounts would also break the useful umount -l option, which is meant to work at all times.
Keep the filesystem frozen across the unmount, so that the filesystem would still be frozen after the next mount. The biggest problem here is that there may be changes that the filesystem code needs to write to the device; if the system reboots before that can happen, bad things can result.
Automatically thaw filesystems on unmount.
Add a new ioctl() command which will cause the thawing of an unmounted filesystem.

Al suggested a variant on #3, in the form of a new freeze command. The proper way to handle freeze is to return a file descriptor; as long as that file descriptor is held open, the filesystem remains frozen. This solves the "last process exits" problem because the file descriptor will be closed as the process exits, automatically causing the filesystem to be thawed. Also, as Al noted, the kill command is often the system recovery tool of choice for system administrators, so having a suitably-targeted kill cause a frozen filesystem to be thawed makes sense.

There seemed to be a consensus that the file descriptor approach is the best long-term solution. Meanwhile, though, there are tools based on the older ioctl() commands which will take years to replace in the field. So we might also see an implementation of #4, above, to help in the near future.

Barriers

Contemporary filesystems go to great lengths to avoid losing data - or corrupting things - if the system crashes. To that end, quite a bit of thought goes into writing things to disk in the correct order. As a simple example, operations written to a filesystem journal must make it to the media before the commit record which marks those operations as valid. Otherwise, the filesystem could end up replaying a journal with random data, with an outcome that few people would love.

All of that care is for nothing, though, if the storage device reorders writes on their way to the media. And, of course, reordering is something that storage devices do all the time in the name of increasing performance. The solution, for years now, has been "barrier" operations; all writes issued before a barrier are supposed to complete before any writes issued after the barrier. The problem is that barriers have not always been well supported in the Linux block subsystem, and, when they are supported, they have a significant impact on performance. So, even now, many systems run with barriers disabled.

Barriers have been discussed among the filesystem and storage developers for years; it was hoped that this year, with the memory management developers present as well, some better solutions might be found.

There was some discussion about the various ways of implementing barriers and making them faster. The key point in the discussion, though, was the assertion that barriers are not really the same as the memory barriers they were patterned after. There are, instead, two important aspects to block subsystem barriers: request ordering and forcing data to disk. That led, eventually, to one of the clearest decisions in the first day of the summit: barriers, as such, will be no more. The problem of ordering will be placed entirely in the hands of filesystem developers, who will ensure ordering by simply waiting for operations to complete when needed. There will be no ordering issues as such in the block layer, but block drivers will be responsible for explicitly flushing writes to physical media when needed.

Whether this decision will lead to better-performing and more robust filesystem I/O remains to be seen, but it is a clearer description of the division of responsibilities than has been seen in the past.

Transparent hugepages

At this point, the summit split into three tracks for storage, filesystem, and memory management developers. Your editor followed the memory management track; with luck, we'll eventually have writeups from the other tracks as well.

Andrea Arcangeli presented his transparent hugepages work, starting with a discussion of the advantages of hugepages in general. Hugepages are a feature of many contemporary processors; they allow the memory management subsystem to use larger-than-normal page sizes in parts of the virtual address range. There are a number of advantages to using hugepages in the right places, but it all comes down to performance.

A hugepage takes much of the pressure off the processor's translation lookaside buffer (TLB), speeding memory access. When a TLB miss happens anyway, a 2MB hugepage requires traversing three levels of page table rather than four, saving a memory access and, again, reducing TLB pressure. The result is a doubling of the speed with which initial page faults can be handled, and better application performance in general. There can be some costs, especially when entire hugepages must be cleared or copied; that can wipe out much of the processor's cache. But this cost tends to be overwhelmed by the performance advantages that hugepages bring.

Those advantages, incidentally, are multiplied when hugepages are used on systems hosting virtualized guests. Using hugepages in this situation can eliminate much of the remaining cost of running virtualized systems.

Hugepages on Linux are currently accessed through the hugetlbfs filesystem, which was discussed in great detail by Mel Gorman on LWN earlier this year. There are some real limitations associated with hugetlbfs, though: hugepages are not swappable, they must be reserved at boot time, there is no mixing of page sizes in the same virtual memory area, etc. Many of these problems could be fixed, but, as Andrea put it, hugetlbfs is becoming a sort of secondary - and inferior - Linux virtual memory subsystem. It is time to turn hugepages into first-class citizens in the real Linux VM.

Transparent hugepages eliminate much of the need for hugetlbfs by automatically grouping together sections of a process's virtual address space into hugepages when warranted. They take away the hassles of hugetlbfs and make it possible for the system to use hugepages with no need for administrator intervention or application changes. There seems to be a fair amount of interest in the feature; a Google developer said that the feature is attractive for internal use.

At the core of the patch is a new thread called khugepaged, which is charged with scanning memory and creating hugepages where it makes sense. Other parts of the VM can split those hugepages back into normally-sized pieces when the need arises. Khugepaged works by allocating a hugepage, then using the migration mechanism to copy the contents of the smaller component pages over. There was some talk of trying to defragment memory and "collapse in place" instead, but it doesn't seem worth the effort at this point. The amount of work to be done would be about the same except in the special case where a hugepage had been split and was being grouped back together before much had changed - a situation which is expected to be relatively rare.

Andrea put up a number of benchmarks showing how transparent hugepages improve performance; the all-important kernel compile benchmark (seen as a sort of worst case for hugepages) is 2.5% faster. Various other benchmarks show bigger improvements.

Transparent hugepages, it seems, will be enabled by default in the RHEL 6 kernel. Andrea would really like to get the feature into 2.6.36, but the merge window is already well advanced and it's not clear that things will work out that way. There is still a need to convince Linus that the feature is worthwhile, and, perhaps, some work to be done to enable the feature on SPARC systems.

mmap_sem

The memory map semaphore (mmap_sem) is a reader-writer semaphore which protects the tree of virtual memory area (VMA) structures describing each address space. It is, Nick Piggin says, one of the last nasty locking issues left in the virtual memory subsystem. Like many busy, global locks, mmap_sem can cause scalability problems through cache line bouncing. In this case, though, simple contention for the lock can be a problem; mmap_sem is often held while disk I/O is being performed. With some workloads, the amount of time that mmap_sem is held can slow things down significantly.

Various groups, including developers at HP and Google, have chipped away at the mmap_sem problem in the past, usually by trying to drop the semaphore in various paths. These patches have all run into the same problem, though: Linus hates them. In particular, he seems to dislike the additional complication added to the retry paths which must be followed when things change while the lock is dropped. So none of this work has gotten into the mainline.

There have also been some unfair rwsem proposals aimed at reducing mmap_sem contention; these have run aground over fears of writer starvation.

According to Nick, the real problem is the red-black tree used to track allocated address space; the data structure is cache-unfriendly and requires a global lock for its protection. His idea is to do away with this rbtree and associate VMAs directly with the page table entries, protecting them with the PTE locks. This approach would eliminate much of the locking entirely, since the page tables must be traversable without locks, and solve the mmap_sem problem.

That said, there are some challenges. A VMA associated with a page table entry can cover a maximum of 2MB of address space; larger areas would have to be split into (possibly a large number of) smaller VMAs. It's not clear how this mechanism would then interact with hugepages. The instantiation of large VMAs would require the creation of the full range of PTEs, which is not required now; that could hurt applications with very sparsely-populated memory areas. Growing VMAs would have its own challenges. There is also the issue of free space allocation, a problem which might be solved by preallocating ranges of addresses to each thread sharing an address space. In summary, the list of obstacles to be overcome before this idea becomes practical looks somewhat daunting.

The developers in the room seemed to not be entirely comfortable with this approach, but nobody could come up with a fundamental reason why it would not work. So we'll probably be seeing patches from Nick exploring this idea in the future.

copyfile()

The reflink() system call was originally proposed as a sort of fast copy operation; it would create a new "copy" of a file which shared all of the data blocks. If one of the files were subsequently written to, a copy-on-write operation would be performed so that the other file would not change. LWN readers last heard about this patch last September, when Linus refused to pull it for 2.6.32. Among other things, he didn't like the name.

So now reflink() is back as copyfile(), with some proposed additional features. It would make the same copy-on-write copies on filesystems that support it, but copyfile() would also be able to delegate the actual copy work to the underlying storage device when it makes sense. For example, if a file is being copied on a network-mounted filesystem, it may well make sense to have the server do the actual copy work, eliminating the need to move the data over the network twice. The system call might also do ordinary copies within the kernel if nothing faster is available.

The first question that was asked is: should copyfile() perhaps be an asynchronous interface? It could return a file descriptor which could be polled for the status of the operation. Then, graphical utilities could start a copy, then present a progress bar showing how things were going. Christoph Hellwig was adamant, though, that copyfile() should be a synchronous operation like almost all other Linux system calls; there is no need to create something weird and different here. Progress bars neither justify nor require the creation of asynchronous interfaces.

There was also opposition to the mixing of the old reflink() idea with that of copying a file. There is little perceived value in creating a bad version of cp within the kernel. The two ideas were mixed because it seems that Linus seems to want it that way, but, after this discussion, they may yet be split apart again.

Dirty limits

Jan Kara led a short discussion on the problem of dirty limits. The tuning knob found at /proc/sys/vm/dirty_ratio contains a number representing a percentage of total memory. Any time that the number of dirty pages in the system exceeds that percentage, processes which are actually writing data will be forced to perform some writeback directly. This policy has a couple of useful results: it helps keep memory from becoming entirely filled with dirty pages, and it serves to throttle the processes which are creating dirty pages in the first place.

The default value for dirty_ratio is 20, meaning that 20% of memory can be dirty before processes are conscripted into writeback duties. But that turns out to be too low for a number of applications. In particular, it seems that many Berkeley DB applications exhibit behavior where they dirty a lot of pages all over memory; setting dirty_ratio too low causes a lot of excessive I/O and serious performance issues. For this reason, distributions like RHEL raise this limit to 40% by default.

But 40% is not an ideal number either; it can lead to a lot of wasted memory when the system's workloads are mostly sequential. Lots of dirty pages can also cause fsync() calls to take a very long time, especially with the ext3 filesystem. What's really needed is a way to set this parameter in a more automatic, adaptive way, but exactly how that should be done is not entirely clear.

What is likely to happen in the short term is that a user-space daemon will be written to experiment with various policies for dirty_ratio. Some VM tracepoints can be used to catch events and tune things accordingly. A system which is handling a lot of fsync() calls should probably have a lower value of dirty_ratio, for example. In the absence of reasons to the contrary, the daemon can try to nudge the limit higher and try to see if applications perform better. This kind of heuristic experimentation has its hazards, but there does not seem to be a better method on offer at the moment.

Topology and alignment

There was a brief session on storage device topology issues; unfortunately, it was late in the day and your editor's notes are increasingly fuzzy. Much of the discussion had to do with 4K-sector disks. There are still issues, it seems, with drives which implement strange sector alignments in an attempt to get better performance with some versions of Windows. Linux can cope with these drives, but only if the drives themselves export information on what they are doing. Not all hardware provides that information, unfortunately.

Meanwhile, the amount of software which does make use of the topology information exported through the kernel is increasing. Partitioning tools are getting smarter, and the device mapper now uses this information properly. The readahead code will be tweaked to create properly-aligned requests when possible.

Lightning talks

The last session of the day was dedicated to three lightning talks. The first, by Matthew Wilcox, had to do with merging of git trees. Quite a bit of work in the VM/filesystem/storage area depends on changes made in a number of different trees. Making those trees fit together can be a bit of a challenge. That problem can be solved in linux-next, but those solutions do not necessarily carry over into the mainline, where trees may be pulled in just about any order - or not at all. The result is a lot of work and merge-window scrambling by developers, who are getting a little tired of it.

So, it was asked, is it time for a git tree dedicated to storage as a whole, and a storage maintainer to go with it? The idea was to create something like David Miller's networking tree, which is the merge point for almost all networking-related patches. James Bottomley made the mistake of suggesting that this kind of discussion could not go very far without a volunteer to manage that tree; he was then duly volunteered for the job.

The discussion moved on to how this tree would work, and, in particular, whether its maintainer would become the "overlord of storage," or whether it would just be a more convenient place to work out merge conflicts. If its maintainer is to be a true overlord, a fairly hardline approach will need to be taken with regard to when patches would have to be ready for merging. It's not clear whether the storage community is ready to deal with such a maintainer. So, for the near future, James will run the tree as a merge point to see whether that helps developers get their code into the mainline. If it seems like there is need for a real storage maintainer, that question will be addressed after a development cycle or two.

Dan Magenheimer presented his Cleancache proposal, mostly with an eye toward trying to figure out a way to get it merged. There is still some opposition to it, and its per-filesystem hooks in particular. It's hard to see how those hooks can be avoided, though; Cleancache is not suitable for all filesystems and, thus, may not be a good fit for the VFS layer. The crowd seemed reasonably amenable to merging the patches, but the chief opponent - Christoph Hellwig - was not in the room at the time. So no real conclusions have been reached.

The final lightning talk came from Boaz Harrosh, who talked about "stable pages." Currently, pages which are currently under writeback can be modified by filesystem code. That's potentially a data integrity problem, and it can be fatal in situations where, for example, checksums of page contents are being made. That is why the RAID5 code must copy all pages being written to an array; changing data would break the RAID5 checksums. What, asked Boaz, would break if the ability to change pages under writeback were withdrawn?

The answer seems to be that nothing would break, but that some filesystems might suffer performance impacts. The only way to find out for sure, though, is to try it. As it happens, this is a relatively easy experiment to run, so filesystem developers will probably start playing with it sometime soon.

That was the end of the first day of the summit; reports from the second day will be posted as soon as they are ready.

Index entries for this article
Kernel	Block layer
Kernel	Filesystems/Workshops
Conference	Storage, Filesystem, and Memory-Management Summit/2010

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 4:51 UTC (Mon) by neilbrown (subscriber, #359) [Link] (13 responses)

> barriers, as such, will be no more.

Hurray!!!!! I've never liked barriers, as such.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 5:30 UTC (Mon) by dlang (guest, #313) [Link] (11 responses)

on the other hand, forcing the filesystem to figure out how to know when the hardware has completed a write is not very good either.

And if the hardware supports a 'do not reorder across this' barrier, the need to fully flush things to disk before writing the things that would be after the barrier is a significant performance loss

I don't see how pushing the implementation further from the hardware will help, but we'll see what happens.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 7:02 UTC (Mon) by neilbrown (subscriber, #359) [Link] (2 responses)

The hardware has lower latency for some operations, but the filesystem has more knowledge about what is required. Moving decisions away from the hardware can be bad, but moving them closer to the filesystem can be good. Finding the right balance is hard.

As you say, we'll see what happens.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:17 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

the latency is that without the ability to pass the instruction on to the lower levels, the only thing the filesystem can do is the let all the queues empty

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:25 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link]

Only if there's nothing else to put in the queues, and that's not the case you need to optimize for as much.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:13 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link] (7 responses)

Considering that Chris Mason is in favor of this... who is a filesystem author... that should tell you something.

The problem is mainly that current barriers don't really mean any one thing, they're poorly defined... and what filesystems really need is to know when something is on disk.

Also, ordering is really not a simple matter. It's not just a matter of the disk reordering it... it's the queue, and any virtual block devices in between (raid/lvm/caching (I'm the author of bcache, so this has been on my mind lately). Introducing an artificial global ordering on all the ios you see is a pain in the ass... so if the filesystems don't need it, and it isn't needed for performance, that's a lot of complexity you can cut out.

Personally I think being able to specify an ordering on bios you submit would be a useful thing, partly in filesystem complexity and with higher latency storage it should potentially be a performance gain. But do not think it's a simple thing to implement, or necessary - and the current barriers certainly aren't that, so getting rid of them is a good thing.

I was just thinking today that if we are going to try and implement the ability to order in flight bios, probably the way to do it would be to first implement it only in the generic block layer, _maybe_ the io scheduler at first - and come up with some working semantics. The generic block layer could support it regardless of hardware support simply by waiting to queue a bio until it had no outstanding dependencies.

Such an interface could then live or die depending on if filesystems were actually able to make good use of it - if it makes life harder for them, it's probably not worth it. If it did prove useful, the implementation could be extended down the layers till it got to the hardware.

-------------

Just musing for a bit about what that interface might be - here's a guess:
You probably want a bio to be able to depend on multiple bios, the way that looks sanest to me is add two fields:
atomic_t bi_depends /* bio may not be started until 0 */
struct bio *bi_parent /* if not NULL, decrement bi_parent->bi_depends */

You'd then have to add a bit in bi_flags to indicate error on a depending bio - so it was never started: if when completing a bio, before you decrement bi_parent->bi_depends, if (error) set the error flag on the parent.

With that feature I don't know if you could use NCQ - I'm not a SCSI guy - but without it it seems fairly useless, or else dangerous to use; you can't, for example, rewrite your superblock to point to the new btree root if writing the new root failed. I suppose you could use it if you used a log or a journal to write the current btree root, and then pointers to btree nodes contained the checksum of the node they pointed to - and both are good ideas, but there's still plenty of other situations where you wouldn't be able to recover (and in that case you don't really need write ordering anyways).

The Linux Storage and Filesystem Summit, day 1

Posted Aug 10, 2010 17:29 UTC (Tue) by butlerm (subscriber, #13312) [Link] (3 responses)

The straightforward way to use barriers without killing I/O performance is to have a concept of I/O "threads" (typically one or two per mounted filesystem) and make the barriers apply on a per I/O thread basis. Very easy to understand and to implement - when a barrier is issued on one I/O thread requests on other I/O threads are allowed to proceed unimpeded.

The barrier itself can be implemented using a (queued) cache flush operation on devices that do not support barriers, a simple write barrier operation on devices that support a single I/O thread, and an I/O thread specific barrier operation on devices that support multiple I/O threads.

A typical journaled filesystem would normally have a minimum of two I/O threads per mounted filesystem or filesystem group - one for metadata operations and one for data operations. The idea of course is to allow a write barrier operation on the metadata I/O thread to hold up only future metadata writes without affecting outstanding data writes (i.e. writeback) on the data I/O thread.

If something like this is not done, every metadata sync operation (fsync for example) will necessarily require a complete flush of volatile device write caches for the pertinent devices, which is not exactly ideal if there is a considerable amount of data that doesn't really need to be flushed.

The SCSI architecture model allows parallel I/O threads on a _single_ device to be implemented using what they call "task sets", and it would be unfortunate if there were not a way for filesystems to take advantage of this capability, given the potential performance gains possible for any application that issues fsync or fdatasync operations in the presence of significant I/O contention from other threads processes.

In fact it would be ideal to dynamically allocate I/O threads to files on which fdatasync operations are regularly issued, so that the underlying device can write just those blocks to disk (or non-volatile cache) without the need to write anything else (assuming the size of the file/inode has not changed).

The Linux Storage and Filesystem Summit, day 1

Posted Aug 11, 2010 2:28 UTC (Wed) by koverstreet (✭ supporter ✭, #4296) [Link] (2 responses)

Yeah, the idea I sketched out is roughly equivalent (in behavior) to doing it with threads. I should make it explicit that my idea doesn't do anything filesystems can't do themselves.

Threads are unwieldy when you want to express something more complicated than linear dependencies. Like you suggested, if you're just segregating metadata and data that's fine, but the actual dependencies are in practice usually more complex, so - provided you have an easy way of expressing them - it could in theory be a performance gain.

Anyway, if you want to pipeline ios all the way down to the SCSI layer you need a way of expressing dependencies to the block layer, which needs more than threads.

I might have to write an actual patch and see what people think...

The Linux Storage and Filesystem Summit, day 1

Posted Aug 12, 2010 18:21 UTC (Thu) by butlerm (subscriber, #13312) [Link]

The advantage of threads is that they are simple, are already implemented by many existing block devices (SCSI ones at any rate), and allow the optimization of many common cases - journal write before metadata write without a round trip (or worse a cache flush) in between, for example.

They also make it very convenient for a filesystem to gain notification when a series of block writes have been committed to disk without being too involved with the low level details of how that is known to be the case.

On some devices any write barrier is most efficiently translated into a full cache flush, on others completion of a series of writes with force unit access specified. If the block interface does not provide I/O threads with write barriers or the equivalent, presumably a filesystem would be forced to choose one or the other, which would be highly inefficient in a number of cases.

With the proper threaded interface, the lower level device driver can choose how to implement the write barrier most efficiently. SATA devices (which seem to be unusually backward in this regard) probably need a full cache flush. Other devices you can either issue an explicit barrier, or you can efficiently wait for a series of force unit access writes to individually complete. The filesystem shouldn't have to care about what is most efficient for any given device.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 19, 2010 13:13 UTC (Thu) by cypherpunks (guest, #1288) [Link]

I've had a similar idea, which is specifically designed for easy hardware implementation: allow an operation to have a (small integer) tag, then provide a command to "wait for all operations with tag #k to complete".

More generally, you could let every operation have a prerequisite tag that must be completed (you need one reserved tag number which is never used to specify commands with no prerequisites), and have the wait operation be a NOP with a prerequisite.

To merge threads, the wait operation can have a tag #n which differs from the #k it is waiting for. After it is issued, waiting for #n effectively waits for both.

Now, you can merge independent operation streams by doing address translation on tags. And you can compress tag space (down to simple barriers, in the limiting case) by allowing false sharing.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 10, 2010 17:57 UTC (Tue) by butlerm (subscriber, #13312) [Link]

Correction: That should really be SCSI "I_T_L_Q nexus" or execution queue per I/O thread, not a SCSI "task set".

The Linux Storage and Filesystem Summit, day 1

Posted Aug 12, 2010 19:19 UTC (Thu) by butlerm (subscriber, #13312) [Link]

It turns out SCSI only supports ordering within the context of an initiator target (I_T) nexus, which means any given initiator usually only gets one I/O thread per device. The way around that limitation is to use to establish separate connections (I_T nexuses) for each I/O thread, but that probably isn't practical in most cases, even on something like iSCSI.

That is not to say that the SCSI folks shouldn't add real I/O thread support, because a write barrier at the device level (for all practical purposes) is not much more useful than a full cache flush.

The Linux Storage and Filesystem Summit, day 1

Posted Apr 5, 2011 14:55 UTC (Tue) by bredelings (subscriber, #53082) [Link]

>Introducing an artificial global ordering on all the ios you see is a pain
>in the ass...
Sure, what we want is a *partial* ordering, right?

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 20:11 UTC (Mon) by masoncl (subscriber, #47138) [Link]

>> barriers, as such, will be no more.

> Hurray!!!!! I've never liked barriers, as such.

Grin, there seems to be a large party dancing around the grave of the ordered barriers code.

But to clarify for the comments below, we do still want to issue cache flushes, the filesystems just promise to use wait_on_buffer/page for everything we care about first.

Most of the time these waits are already there...reiserfs is the biggest exception but that is very easily fixed since it is inside a big if (ordered_barriers) statement.

-chris

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 5:20 UTC (Mon) by neilbrown (subscriber, #359) [Link] (5 responses)

> but there does not seem to be a better method on offer at the moment.

As far as I can see, the main reason for setting dirty_ratio below about 50% is to limit that time it takes for "sync" to complete (and fsync on ext3 data=ordered filesystems) (as you go above 50% direct reclaim will trigger significantly more often and slow down memory allocation a lot).

So the tunable should be "how long is sync allowed to take". Then you need an estimate of the throughput of each bdi, and don't allow any bdi to gather more dirty memory than that estimate multiplied by the tunable.

Of course this is much more easily said than done - getting a credible estimate in an efficient manner is non-trivial. You can only really measure throughput during intensive write-out, and that probably happens mostly once dirty_ratio is reached, which is a bit late to be setting dirty_ratio.

I suspect some adaptive thing could be done - the first sync might be too slow, but long term it would sort it self out.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:24 UTC (Mon) by koverstreet (✭ supporter ✭, #4296) [Link]

My thought when I read that is that what we really want is some statistics on dirty data - probably index it by length of sequential area (and track the average sequential size), and some heuristics for how likely dirty data is to be redirtied, if someone can come up with useful ones.

The idea being that if you're say, copying iso files there's no point in queuing up a gigabyte's worth - but bdb doing random io should be allowed to use more memory. Especially if you maintained those statistics per process, you'd be in good shape to do that.

Having never looked at the writeback code I've no idea what it does already, but it seems to me once you're keeping track of sequential chunks of dirty data it seems to me it'd be a great idea to write them out roughly in order of sequential size - writing out the isos you're copying before your berkeley db.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:56 UTC (Mon) by james_ (guest, #55070) [Link] (3 responses)

As an example of a case where setting the ratio low is advantageous:

We were testing a NAS system recently. Our tests use 54 systems writing to the NAS server. The default value of /proc/sys/vm/dirty_ratio was 40. We saw very bad performance when we applied a large write to the system. The vendors technical support noted that the writes where going to the NAS out of order and that because we had a large number of writes we where defeating the NASs cache forcing the out of order writes to become a read modify write cycle. By dropping the value to for example 2 we saw the NAS system perform.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 9:31 UTC (Mon) by neilbrown (subscriber, #359) [Link] (2 responses)

I completely agree - the case where I have seen the need for a low ratio was also when the writes were going out via NFS. The machine had, I think, 32Gig of RAM, so 40% was a lot. Even 1%, the smallest non-zero setting, took longer to flush than we really wanted.

Problems with out-of-order writes is an interesting twist on that!

The Linux Storage and Filesystem Summit, day 1

Posted Aug 15, 2010 17:52 UTC (Sun) by kleptog (subscriber, #1183) [Link] (1 responses)

I have a another case where the default settings don't work well. A process that is somewhat realtime produces 30MB/s of data writing to disk. Under the default settings the kernel will wait 30 seconds before writing anything (900MB) and if it doesn't get it out in a reasonable way that the process gets stuck because it used up all of the 20% of memory for its data.

The solution is to have the kernel check much more often the amount of data waiting (every second rather than every 5 seconds) and drastically reduce the amount of dirty memory there's allowed to be before write back happens.

Without this the kernel suddenly realises it has more than a gigabyte of data to writeback (20% of 8GB = 1.6GB) and manages to starve other processes trying to get it out. Whereas if it just writebacks small amounts in the background continuously everything goes smoothly. 1% works well, since that's what the storage subsystem can handle quickly.

Pity it's a global setting though, other processes would probably work better with a higher writeback threshold, but you can't pick and choose.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 19, 2010 13:04 UTC (Thu) by cypherpunks (guest, #1288) [Link]

I can't help thinking that the solution looks a lot like a PID controller. That is, page write speed is defined by the sum of three terms: one proportional to the amount of excess dirty pages, one proportional to the integral, and one proportional to the derivative.

The latter is the "feed-forward" term, and helps respond quickly to sudden changes. If the rate of page dirtying increases sharply, the rate of writeback should likewise take a sudden jump.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 12:16 UTC (Mon) by theraphim (subscriber, #25955) [Link] (7 responses)

> Progress bars neither justify nor require the creation of asynchronous interfaces.

LOL.
Seriously.
How one is supposed to figure out the state of the operation?
Let's say I'm trying to copy 500gig in a single file, somewhere on the network.
Is there any way I can monitor this? Am I making any progress during the past 1.5 hours of silence, or I was cut off the net for past hour?

Progress bars can be useful, and they are not always "bars".

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 12:43 UTC (Mon) by cesarb (subscriber, #6266) [Link] (4 responses)

> Is there any way I can monitor this?

Just stat() the destination?

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 13:58 UTC (Mon) by theraphim (subscriber, #25955) [Link] (2 responses)

There are different ways in which copy can be implemented. Especially if it's a copy handled by some remote filesystem.
Not all of them stat-friendly.
The similarly wrong approach would be to use "df" to see if space is being eaten.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 14:46 UTC (Mon) by rvfh (guest, #31018) [Link] (1 responses)

Maybe duplicate the file descriptor and monitor the offset?

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 21:24 UTC (Mon) by theraphim (subscriber, #25955) [Link]

Here you assume you are doing linear copy in a single copying thread. There are cases in which this assumption is untrue, for example, when copying large file from one storage cluster to another you may want to copy chunks in parallel between the chunk servers.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 14:46 UTC (Mon) by sync (guest, #39669) [Link]

A good copy would preallocate the destination file. stat() wouldn't work with this (always 100%).

The Linux Storage and Filesystem Summit, day 1

Posted Aug 10, 2010 5:16 UTC (Tue) by lkundrak (subscriber, #43452) [Link]

How do you monitor progress of, say, fsync()? I mean, how many system calls were designed with progress monitoring in mind?

I'm wondering why a call with less than file granularity was not considered (at least I am not aware of). If you could pick smaller units (multiplies of filesystem blocks?) to clone, you could probably also find more interesting uses apart from being able to reasonably control/monitor progress.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 19, 2010 13:18 UTC (Thu) by renox (guest, #23785) [Link]

If I understood well, this 'copy' is a 'virtual copy' ie a kind of hardlink with COW semantic so this should be a fast operation so in this case you don't really need a progress bar.
For a real copy, the issue is there though.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 16:10 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

a Google developer said that the feature is attractive for internal use

It's attractive for any application that has a heap of >2Mb (for the portions of the heap before the end), which is most of them. It'll probably even speed swapping: swapping in enforced 2Mb chunks is much more suited to modern disks than swapping in tiny little 4K pieces.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 13, 2010 11:18 UTC (Fri) by i3839 (guest, #31386) [Link]

That's not generally true for SSDs, especially if NCQ is supported. When swapping in latency is usually more important than throughput, so you would want to only swap-in what you need. Maximizing thoughput by reading stuff you don't need is slightly silly.

With write-out 8 channels * 4 NAND chips per channel * 4KB page size = 128KB should be enough to get max write throughput (less for SLC). Buf if you're swapping, you're generally also swapping in, and writing a lot data will kill read latency, so you might want to limit the writes anyway.

Typo ?

Posted Aug 9, 2010 18:12 UTC (Mon) by abacus (guest, #49001) [Link] (1 responses)

There has also been some work done using hdparm to corrupt individual sectors in memory [ ... ]

Shouldn't that read "on disk" instead of "in memory" ?

Typo ?

Posted Aug 9, 2010 18:21 UTC (Mon) by rvfh (guest, #31018) [Link]

Please send typos to lwn@lwn.net :-)

For more info: http://lwn.net/op/FAQ.lwn#contact

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 20:07 UTC (Mon) by alonz (subscriber, #815) [Link]

WRT the freeze issue: isn't it easier to simply use the fd that issued the ioctl(FIFREEZE) as the “controlling” one?

That is—if this fd is closed, automatically thaw the filesystem?

This may require some magic somewhere in the VFS, but provides Al's new semantics without changing existing code…

The Linux Storage and Filesystem Summit, day 1

Posted Aug 10, 2010 14:59 UTC (Tue) by stevef (guest, #7712) [Link] (1 responses)

Resurrecting the copyfile id sounds great. CIFS has a copyfile equivalent on the wire. SMB2 has a "copy chunk" request (copy all or part of a file by range on the server).

The Linux Storage and Filesystem Summit, day 1

Posted Aug 25, 2010 21:38 UTC (Wed) by joib (subscriber, #8541) [Link]

FWIW, the NFS people are also planning to add something similar to the NFS protocol. See http://tools.ietf.org/html/draft-lentini-nfsv4-server-sid...

The 2010 Linux Storage and Filesystem Summit, day 1

Posted Aug 11, 2010 16:53 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

Did no one talk about fixing our I/O path on swap?

Not long ago I had an emacs session go crazy pushing my laptop deep into
swap. I logged in via ssh and killed the emacs free up about 4G and removing the need for the system to be in swap at all.

Running swapoff -a I only achieved about 1MB/s instead of the 15-30MB/s
that the disk should have been able to achieve.

Grumble. All of the fancy algorithms in the world don't help unless we get the fundamentals right.

The 2010 Linux Storage and Filesystem Summit, day 1

Posted Aug 12, 2010 12:48 UTC (Thu) by jlayton (subscriber, #31672) [Link] (1 responses)

> The final lightning talk came from Boaz Harrosh, who talked about "stable
> pages." Currently, pages which are currently under writeback can be
> modified by filesystem code. That's potentially a data integrity problem,
> and it can be fatal in situations where, for example, checksums of page
> contents are being made.

This would be a boon for network filesystems too. Right now, it's rather nasty to deal with things like signing in NFS and CIFS. We have to take a checksum of the packet contents, but by the time you do that the page contents can change.

This is especially a problem with CIFS as the server will often drop the connection if the packet integrity seems to be compromised, and dropped connections with CIFS can mean the loss of a lot of state (open files, locks, etc).

The 2010 Linux Storage and Filesystem Summit, day 1

Posted Apr 5, 2011 23:13 UTC (Tue) by butlerm (subscriber, #13312) [Link]

This sounds like a good idea, but suppose a file had several pages that were mapped into some processes' memory with PROT_WRITE access. Wouldn't this require the filesystem to do the following?:

(1) temporarily mark those pages as write only while writeback was in progress
(2) catch any page faults during that time
(3) either duplicate the page using a copy on write strategy or hold the faulting process until writeback on that page was complete?