Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
Hurray!!!!! I've never liked barriers, as such.
The Linux Storage and Filesystem Summit, day 1
Posted Aug 9, 2010 5:30 UTC (Mon) by dlang (✭ supporter ✭, #313)
And if the hardware supports a 'do not reorder across this' barrier, the need to fully flush things to disk before writing the things that would be after the barrier is a significant performance loss
I don't see how pushing the implementation further from the hardware will help, but we'll see what happens.
Posted Aug 9, 2010 7:02 UTC (Mon) by neilbrown (subscriber, #359)
As you say, we'll see what happens.
Posted Aug 9, 2010 8:17 UTC (Mon) by dlang (✭ supporter ✭, #313)
Posted Aug 9, 2010 8:25 UTC (Mon) by koverstreet (subscriber, #4296)
Posted Aug 9, 2010 8:13 UTC (Mon) by koverstreet (subscriber, #4296)
The problem is mainly that current barriers don't really mean any one thing, they're poorly defined... and what filesystems really need is to know when something is on disk.
Also, ordering is really not a simple matter. It's not just a matter of the disk reordering it... it's the queue, and any virtual block devices in between (raid/lvm/caching (I'm the author of bcache, so this has been on my mind lately). Introducing an artificial global ordering on all the ios you see is a pain in the ass... so if the filesystems don't need it, and it isn't needed for performance, that's a lot of complexity you can cut out.
Personally I think being able to specify an ordering on bios you submit would be a useful thing, partly in filesystem complexity and with higher latency storage it should potentially be a performance gain. But do not think it's a simple thing to implement, or necessary - and the current barriers certainly aren't that, so getting rid of them is a good thing.
I was just thinking today that if we are going to try and implement the ability to order in flight bios, probably the way to do it would be to first implement it only in the generic block layer, _maybe_ the io scheduler at first - and come up with some working semantics. The generic block layer could support it regardless of hardware support simply by waiting to queue a bio until it had no outstanding dependencies.
Such an interface could then live or die depending on if filesystems were actually able to make good use of it - if it makes life harder for them, it's probably not worth it. If it did prove useful, the implementation could be extended down the layers till it got to the hardware.
Just musing for a bit about what that interface might be - here's a guess:
You probably want a bio to be able to depend on multiple bios, the way that looks sanest to me is add two fields:
atomic_t bi_depends /* bio may not be started until 0 */
struct bio *bi_parent /* if not NULL, decrement bi_parent->bi_depends */
You'd then have to add a bit in bi_flags to indicate error on a depending bio - so it was never started: if when completing a bio, before you decrement bi_parent->bi_depends, if (error) set the error flag on the parent.
With that feature I don't know if you could use NCQ - I'm not a SCSI guy - but without it it seems fairly useless, or else dangerous to use; you can't, for example, rewrite your superblock to point to the new btree root if writing the new root failed. I suppose you could use it if you used a log or a journal to write the current btree root, and then pointers to btree nodes contained the checksum of the node they pointed to - and both are good ideas, but there's still plenty of other situations where you wouldn't be able to recover (and in that case you don't really need write ordering anyways).
Posted Aug 10, 2010 17:29 UTC (Tue) by butlerm (subscriber, #13312)
The barrier itself can be implemented using a (queued) cache flush operation on devices that do not support barriers, a simple write barrier operation on devices that support a single I/O thread, and an I/O thread specific barrier operation on devices that support multiple I/O threads.
A typical journaled filesystem would normally have a minimum of two I/O threads per mounted filesystem or filesystem group - one for metadata operations and one for data operations. The idea of course is to allow a write barrier operation on the metadata I/O thread to hold up only future metadata writes without affecting outstanding data writes (i.e. writeback) on the data I/O thread.
If something like this is not done, every metadata sync operation (fsync for example) will necessarily require a complete flush of volatile device write caches for the pertinent devices, which is not exactly ideal if there is a considerable amount of data that doesn't really need to be flushed.
The SCSI architecture model allows parallel I/O threads on a _single_ device to be implemented using what they call "task sets", and it would be unfortunate if there were not a way for filesystems to take advantage of this capability, given the potential performance gains possible for any application that issues fsync or fdatasync operations in the presence of significant I/O contention from other threads processes.
In fact it would be ideal to dynamically allocate I/O threads to files on which fdatasync operations are regularly issued, so that the underlying device can write just those blocks to disk (or non-volatile cache) without the need to write anything else (assuming the size of the file/inode has not changed).
Posted Aug 11, 2010 2:28 UTC (Wed) by koverstreet (subscriber, #4296)
Threads are unwieldy when you want to express something more complicated than linear dependencies. Like you suggested, if you're just segregating metadata and data that's fine, but the actual dependencies are in practice usually more complex, so - provided you have an easy way of expressing them - it could in theory be a performance gain.
Anyway, if you want to pipeline ios all the way down to the SCSI layer you need a way of expressing dependencies to the block layer, which needs more than threads.
I might have to write an actual patch and see what people think...
Posted Aug 12, 2010 18:21 UTC (Thu) by butlerm (subscriber, #13312)
They also make it very convenient for a filesystem to gain notification when a series of block writes have been committed to disk without being too involved with the low level details of how that is known to be the case.
On some devices any write barrier is most efficiently translated into a full cache flush, on others completion of a series of writes with force unit access specified. If the block interface does not provide I/O threads with write barriers or the equivalent, presumably a filesystem would be forced to choose one or the other, which would be highly inefficient in a number of cases.
With the proper threaded interface, the lower level device driver can choose how to implement the write barrier most efficiently. SATA devices (which seem to be unusually backward in this regard) probably need a full cache flush. Other devices you can either issue an explicit barrier, or you can efficiently wait for a series of force unit access writes to individually complete. The filesystem shouldn't have to care about what is most efficient for any given device.
Posted Aug 19, 2010 13:13 UTC (Thu) by cypherpunks (guest, #1288)
More generally, you could let every operation have a prerequisite tag that must be completed (you need one reserved tag number which is never used to specify commands with no prerequisites), and have the wait operation be a NOP with a prerequisite.
To merge threads, the wait operation can have a tag #n which differs from the #k it is waiting for. After it is issued, waiting for #n effectively waits for both.
Now, you can merge independent operation streams by doing address translation on tags. And you can compress tag space (down to simple barriers, in the limiting case) by allowing false sharing.
Posted Aug 10, 2010 17:57 UTC (Tue) by butlerm (subscriber, #13312)
Posted Aug 12, 2010 19:19 UTC (Thu) by butlerm (subscriber, #13312)
That is not to say that the SCSI folks shouldn't add real I/O thread support, because a write barrier at the device level (for all practical purposes) is not much more useful than a full cache flush.
Posted Apr 5, 2011 14:55 UTC (Tue) by bredelings (subscriber, #53082)
Posted Aug 9, 2010 20:11 UTC (Mon) by masoncl (subscriber, #47138)
> Hurray!!!!! I've never liked barriers, as such.
Grin, there seems to be a large party dancing around the grave of the ordered barriers code.
But to clarify for the comments below, we do still want to issue cache flushes, the filesystems just promise to use wait_on_buffer/page for everything we care about first.
Most of the time these waits are already there...reiserfs is the biggest exception but that is very easily fixed since it is inside a big if (ordered_barriers) statement.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds