The straightforward way to use barriers without killing I/O performance is to have a concept of I/O "threads" (typically one or two per mounted filesystem) and make the barriers apply on a per I/O thread basis. Very easy to understand and to implement - when a barrier is issued on one I/O thread requests on other I/O threads are allowed to proceed unimpeded.
The barrier itself can be implemented using a (queued) cache flush operation on devices that do not support barriers, a simple write barrier operation on devices that support a single I/O thread, and an I/O thread specific barrier operation on devices that support multiple I/O threads.
A typical journaled filesystem would normally have a minimum of two I/O threads per mounted filesystem or filesystem group - one for metadata operations and one for data operations. The idea of course is to allow a write barrier operation on the metadata I/O thread to hold up only future metadata writes without affecting outstanding data writes (i.e. writeback) on the data I/O thread.
If something like this is not done, every metadata sync operation (fsync for example) will necessarily require a complete flush of volatile device write caches for the pertinent devices, which is not exactly ideal if there is a considerable amount of data that doesn't really need to be flushed.
The SCSI architecture model allows parallel I/O threads on a _single_ device to be implemented using what they call "task sets", and it would be unfortunate if there were not a way for filesystems to take advantage of this capability, given the potential performance gains possible for any application that issues fsync or fdatasync operations in the presence of significant I/O contention from other threads processes.
In fact it would be ideal to dynamically allocate I/O threads to files on which fdatasync operations are regularly issued, so that the underlying device can write just those blocks to disk (or non-volatile cache) without the need to write anything else (assuming the size of the file/inode has not changed).
Posted Aug 11, 2010 2:28 UTC (Wed) by koverstreet (subscriber, #4296)
[Link]
Yeah, the idea I sketched out is roughly equivalent (in behavior) to doing it with threads. I should make it explicit that my idea doesn't do anything filesystems can't do themselves.
Threads are unwieldy when you want to express something more complicated than linear dependencies. Like you suggested, if you're just segregating metadata and data that's fine, but the actual dependencies are in practice usually more complex, so - provided you have an easy way of expressing them - it could in theory be a performance gain.
Anyway, if you want to pipeline ios all the way down to the SCSI layer you need a way of expressing dependencies to the block layer, which needs more than threads.
I might have to write an actual patch and see what people think...
The Linux Storage and Filesystem Summit, day 1
Posted Aug 12, 2010 18:21 UTC (Thu) by butlerm (subscriber, #13312)
[Link]
The advantage of threads is that they are simple, are already implemented by many existing block devices (SCSI ones at any rate), and allow the optimization of many common cases - journal write before metadata write without a round trip (or worse a cache flush) in between, for example.
They also make it very convenient for a filesystem to gain notification when a series of block writes have been committed to disk without being too involved with the low level details of how that is known to be the case.
On some devices any write barrier is most efficiently translated into a full cache flush, on others completion of a series of writes with force unit access specified. If the block interface does not provide I/O threads with write barriers or the equivalent, presumably a filesystem would be forced to choose one or the other, which would be highly inefficient in a number of cases.
With the proper threaded interface, the lower level device driver can choose how to implement the write barrier most efficiently. SATA devices (which seem to be unusually backward in this regard) probably need a full cache flush. Other devices you can either issue an explicit barrier, or you can efficiently wait for a series of force unit access writes to individually complete. The filesystem shouldn't have to care about what is most efficient for any given device.
The Linux Storage and Filesystem Summit, day 1
Posted Aug 19, 2010 13:13 UTC (Thu) by cypherpunks (guest, #1288)
[Link]
I've had a similar idea, which is specifically designed for easy hardware implementation: allow an operation to have a (small integer) tag, then provide a command to "wait for all operations with tag #k to complete".
More generally, you could let every operation have a prerequisite tag that must be completed (you need one reserved tag number which is never used to specify commands with no prerequisites), and have the wait operation be a NOP with a prerequisite.
To merge threads, the wait operation can have a tag #n which differs from the #k it is waiting for. After it is issued, waiting for #n effectively waits for both.
Now, you can merge independent operation streams by doing address translation on tags. And you can compress tag space (down to simple barriers, in the limiting case) by allowing false sharing.