LWN.net Logo

The Linux Storage and Filesystem Summit, day 1

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 5:30 UTC (Mon) by dlang (✭ supporter ✭, #313)
In reply to: The Linux Storage and Filesystem Summit, day 1 by neilbrown
Parent article: The 2010 Linux Storage and Filesystem Summit, day 1

on the other hand, forcing the filesystem to figure out how to know when the hardware has completed a write is not very good either.

And if the hardware supports a 'do not reorder across this' barrier, the need to fully flush things to disk before writing the things that would be after the barrier is a significant performance loss

I don't see how pushing the implementation further from the hardware will help, but we'll see what happens.


(Log in to post comments)

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 7:02 UTC (Mon) by neilbrown (subscriber, #359) [Link]

The hardware has lower latency for some operations, but the filesystem has more knowledge about what is required. Moving decisions away from the hardware can be bad, but moving them closer to the filesystem can be good. Finding the right balance is hard.

As you say, we'll see what happens.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:17 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

the latency is that without the ability to pass the instruction on to the lower levels, the only thing the filesystem can do is the let all the queues empty

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:25 UTC (Mon) by koverstreet (subscriber, #4296) [Link]

Only if there's nothing else to put in the queues, and that's not the case you need to optimize for as much.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 9, 2010 8:13 UTC (Mon) by koverstreet (subscriber, #4296) [Link]

Considering that Chris Mason is in favor of this... who is a filesystem author... that should tell you something.

The problem is mainly that current barriers don't really mean any one thing, they're poorly defined... and what filesystems really need is to know when something is on disk.

Also, ordering is really not a simple matter. It's not just a matter of the disk reordering it... it's the queue, and any virtual block devices in between (raid/lvm/caching (I'm the author of bcache, so this has been on my mind lately). Introducing an artificial global ordering on all the ios you see is a pain in the ass... so if the filesystems don't need it, and it isn't needed for performance, that's a lot of complexity you can cut out.

Personally I think being able to specify an ordering on bios you submit would be a useful thing, partly in filesystem complexity and with higher latency storage it should potentially be a performance gain. But do not think it's a simple thing to implement, or necessary - and the current barriers certainly aren't that, so getting rid of them is a good thing.

I was just thinking today that if we are going to try and implement the ability to order in flight bios, probably the way to do it would be to first implement it only in the generic block layer, _maybe_ the io scheduler at first - and come up with some working semantics. The generic block layer could support it regardless of hardware support simply by waiting to queue a bio until it had no outstanding dependencies.

Such an interface could then live or die depending on if filesystems were actually able to make good use of it - if it makes life harder for them, it's probably not worth it. If it did prove useful, the implementation could be extended down the layers till it got to the hardware.

-------------

Just musing for a bit about what that interface might be - here's a guess:
You probably want a bio to be able to depend on multiple bios, the way that looks sanest to me is add two fields:
atomic_t bi_depends /* bio may not be started until 0 */
struct bio *bi_parent /* if not NULL, decrement bi_parent->bi_depends */

You'd then have to add a bit in bi_flags to indicate error on a depending bio - so it was never started: if when completing a bio, before you decrement bi_parent->bi_depends, if (error) set the error flag on the parent.

With that feature I don't know if you could use NCQ - I'm not a SCSI guy - but without it it seems fairly useless, or else dangerous to use; you can't, for example, rewrite your superblock to point to the new btree root if writing the new root failed. I suppose you could use it if you used a log or a journal to write the current btree root, and then pointers to btree nodes contained the checksum of the node they pointed to - and both are good ideas, but there's still plenty of other situations where you wouldn't be able to recover (and in that case you don't really need write ordering anyways).

The Linux Storage and Filesystem Summit, day 1

Posted Aug 10, 2010 17:29 UTC (Tue) by butlerm (subscriber, #13312) [Link]

The straightforward way to use barriers without killing I/O performance is to have a concept of I/O "threads" (typically one or two per mounted filesystem) and make the barriers apply on a per I/O thread basis. Very easy to understand and to implement - when a barrier is issued on one I/O thread requests on other I/O threads are allowed to proceed unimpeded.

The barrier itself can be implemented using a (queued) cache flush operation on devices that do not support barriers, a simple write barrier operation on devices that support a single I/O thread, and an I/O thread specific barrier operation on devices that support multiple I/O threads.

A typical journaled filesystem would normally have a minimum of two I/O threads per mounted filesystem or filesystem group - one for metadata operations and one for data operations. The idea of course is to allow a write barrier operation on the metadata I/O thread to hold up only future metadata writes without affecting outstanding data writes (i.e. writeback) on the data I/O thread.

If something like this is not done, every metadata sync operation (fsync for example) will necessarily require a complete flush of volatile device write caches for the pertinent devices, which is not exactly ideal if there is a considerable amount of data that doesn't really need to be flushed.

The SCSI architecture model allows parallel I/O threads on a _single_ device to be implemented using what they call "task sets", and it would be unfortunate if there were not a way for filesystems to take advantage of this capability, given the potential performance gains possible for any application that issues fsync or fdatasync operations in the presence of significant I/O contention from other threads processes.

In fact it would be ideal to dynamically allocate I/O threads to files on which fdatasync operations are regularly issued, so that the underlying device can write just those blocks to disk (or non-volatile cache) without the need to write anything else (assuming the size of the file/inode has not changed).

The Linux Storage and Filesystem Summit, day 1

Posted Aug 11, 2010 2:28 UTC (Wed) by koverstreet (subscriber, #4296) [Link]

Yeah, the idea I sketched out is roughly equivalent (in behavior) to doing it with threads. I should make it explicit that my idea doesn't do anything filesystems can't do themselves.

Threads are unwieldy when you want to express something more complicated than linear dependencies. Like you suggested, if you're just segregating metadata and data that's fine, but the actual dependencies are in practice usually more complex, so - provided you have an easy way of expressing them - it could in theory be a performance gain.

Anyway, if you want to pipeline ios all the way down to the SCSI layer you need a way of expressing dependencies to the block layer, which needs more than threads.

I might have to write an actual patch and see what people think...

The Linux Storage and Filesystem Summit, day 1

Posted Aug 12, 2010 18:21 UTC (Thu) by butlerm (subscriber, #13312) [Link]

The advantage of threads is that they are simple, are already implemented by many existing block devices (SCSI ones at any rate), and allow the optimization of many common cases - journal write before metadata write without a round trip (or worse a cache flush) in between, for example.

They also make it very convenient for a filesystem to gain notification when a series of block writes have been committed to disk without being too involved with the low level details of how that is known to be the case.

On some devices any write barrier is most efficiently translated into a full cache flush, on others completion of a series of writes with force unit access specified. If the block interface does not provide I/O threads with write barriers or the equivalent, presumably a filesystem would be forced to choose one or the other, which would be highly inefficient in a number of cases.

With the proper threaded interface, the lower level device driver can choose how to implement the write barrier most efficiently. SATA devices (which seem to be unusually backward in this regard) probably need a full cache flush. Other devices you can either issue an explicit barrier, or you can efficiently wait for a series of force unit access writes to individually complete. The filesystem shouldn't have to care about what is most efficient for any given device.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 19, 2010 13:13 UTC (Thu) by cypherpunks (guest, #1288) [Link]

I've had a similar idea, which is specifically designed for easy hardware implementation: allow an operation to have a (small integer) tag, then provide a command to "wait for all operations with tag #k to complete".

More generally, you could let every operation have a prerequisite tag that must be completed (you need one reserved tag number which is never used to specify commands with no prerequisites), and have the wait operation be a NOP with a prerequisite.

To merge threads, the wait operation can have a tag #n which differs from the #k it is waiting for. After it is issued, waiting for #n effectively waits for both.

Now, you can merge independent operation streams by doing address translation on tags. And you can compress tag space (down to simple barriers, in the limiting case) by allowing false sharing.

The Linux Storage and Filesystem Summit, day 1

Posted Aug 10, 2010 17:57 UTC (Tue) by butlerm (subscriber, #13312) [Link]

Correction: That should really be SCSI "I_T_L_Q nexus" or execution queue per I/O thread, not a SCSI "task set".

The Linux Storage and Filesystem Summit, day 1

Posted Aug 12, 2010 19:19 UTC (Thu) by butlerm (subscriber, #13312) [Link]

It turns out SCSI only supports ordering within the context of an initiator target (I_T) nexus, which means any given initiator usually only gets one I/O thread per device. The way around that limitation is to use to establish separate connections (I_T nexuses) for each I/O thread, but that probably isn't practical in most cases, even on something like iSCSI.

That is not to say that the SCSI folks shouldn't add real I/O thread support, because a write barrier at the device level (for all practical purposes) is not much more useful than a full cache flush.

The Linux Storage and Filesystem Summit, day 1

Posted Apr 5, 2011 14:55 UTC (Tue) by bredelings (subscriber, #53082) [Link]

>Introducing an artificial global ordering on all the ios you see is a pain
>in the ass...
Sure, what we want is a *partial* ordering, right?

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds