Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
XFS: the filesystem of the future?
Posted Jan 23, 2012 18:40 UTC (Mon) by wazoox (subscriber, #69624)
Posted Jan 29, 2012 3:47 UTC (Sun) by sbergman27 (guest, #10767)
I hope you don't connect to the Internet with that security-hole ridden kernel you're running. You should reboot after kernel updates.
Posted Jan 30, 2012 8:03 UTC (Mon) by youareretarded (guest, #82640)
Posted Jan 23, 2012 21:46 UTC (Mon) by dgc (subscriber, #6611)
"Consumer storage" violates the write ordering guarantees that these filesystems require to have journal recovery work because they have volatile write caches. That's why we have write barriers and use them by default on these filesystems these days. XFS was the first filesystem to enable them by default, another reason it was always slower on metadata intensive workloads than ext3/4.
"server storage" doesn't violate write ordering in an effort to improve performance, so XFS has always worked fine and performed well on that class of storage.
Posted Jan 31, 2012 17:36 UTC (Tue) by Cato (subscriber, #7643)
As far as I can tell, some consumer drives do lie about when writes have been flushed, and write back caching is the default anyway.
Some relevant links:
http://brad.livejournal.com/2116715.html - disk testing tool
Posted Jan 31, 2012 19:30 UTC (Tue) by dlang (✭ supporter ✭, #313)
Posted Feb 2, 2012 21:19 UTC (Thu) by dgc (subscriber, #6611)
Any device with a volatile write cache tells the OS that the write IO has been completed before it actually is written to stable storage. IO completion is supposed to mean "the IO is complete" and any device with a volatile write cache is actually lying - the write is not yet on stable storage, so is "lying" about the completion status of the IO to the OS. Pretty much all consumer devices ship with a volatile write cache enabled by default for performance reasons.
Barriers and cache flushes were introduced to provide a mechanism that allowed filesystems to force such drives to order writes the way the filesystem wants correctly. The original barrier mechanism was "cache flush, write, cache flush" and could make the drive slower than not caching in the first place depending on the workload. More recently we just use the FUA mechanism if the drive supports that, and that has neglible performance overhead.
> As far as I can tell, some consumer drives do lie about when writes
> have been flushed, and write back caching is the default anyway
If drives lie about cache flush or FUA completion on volatile writeback caches, then that's a bug in the disk firmware.
FWIW, the difference with server storage (SAS drives) is that most ship with the volatile write cache turned off by default. They don't need it for performance because the SCSI/SAS protocol is much more efficient than SATA and so in most cases a write cache isn't necessary. You can turn it on, but you don't need to to reach full disk performance....
Indeed, it's not just filesystems that don't like volatile write caches. if you turn on volatile write caching on disks behind a RAID controller, the disk will now violate the write ordering guarantees that the RAID controller relies on to maintain data safety (exactly the same as for filesystems). You still lose data or corrupt filesystems on power loss in this case, even though the OS and RAID controller are behaving correctly.
Posted Feb 3, 2012 4:34 UTC (Fri) by raven667 (subscriber, #5198)
I think you are correct on every other point but I dont think this is right. SATA is pretty much the SCSI protocol as is SAS, they are only slightly incompatible for marketing rather than technical reasons. The big performance difference historically between consumer (IDE) and enterprise (SCSI) drives was tagged command queuing which is now very common in SATA drives as well although it wasn't so common in early SATA implementations. A tagged command queue allows the drive to implement an elevator which is a big win against a naive implementation without one.
Posted Feb 3, 2012 12:35 UTC (Fri) by Jonno (subscriber, #49613)
The SCSI command set is generally considered "better" than the ATA command set, though the difference isn't quite as large as the grand parent suggests. Write caches are still beneficial for SCSI (including SAS) performance, but the difference is not quite as large with SCSI as with ATA. That, as well as the fact that the average enterprise customer are more concerned about reliability than the average home user, are the reason that most SAS drives have write cache disabled by default, while most SATA drives have write cache enabled by default.
Posted Feb 3, 2012 17:30 UTC (Fri) by raven667 (subscriber, #5198)
Posted Feb 3, 2012 14:08 UTC (Fri) by quanstro (guest, #77996)
sata and sas send the same data in the same size fises/frames
to the drive. neither are wire-speed limited. they're spin/seek
limited; physics limited.
could you please explain the mechanism where by sas is going to
be faster than sata?
Posted Feb 3, 2012 19:00 UTC (Fri) by raven667 (subscriber, #5198)
That's the power of branding, replacing rational thought with mental shortcuts which put things in "good" or "bad" boxes.
Posted Feb 4, 2012 13:00 UTC (Sat) by zomonto (guest, #82108)
No, libata always disables FUA by default. You can enabled it with a
kernel parameter though.
Posted Feb 8, 2012 13:09 UTC (Wed) by yungchin (guest, #72949)
Dave, I was wondering, given the optimisations you discussed in the talk, where there's now lots of merging and reordering going on before sending things to the i/o scheduler, do you still expect much performance improvements from these hardware caches? (Or - that's of course the hidden question here - should we from now on happily disable them, at least for most use cases?) Thanks.
Posted Feb 1, 2012 11:33 UTC (Wed) by Cato (subscriber, #7643)
Posted Feb 1, 2012 19:28 UTC (Wed) by dlang (✭ supporter ✭, #313)
Posted Feb 1, 2012 20:13 UTC (Wed) by raven667 (subscriber, #5198)
Posted Feb 1, 2012 22:20 UTC (Wed) by magila (subscriber, #49627)
Posted Feb 2, 2012 1:39 UTC (Thu) by dlang (✭ supporter ✭, #313)
most consumer drives don't have these problems, but a few have been found to have them.
unfortunately you cannot just assume that newer dries will not have the problem. On the database mailing lists you see a couple drive models every year where someone runs across the problem yet again.
Posted Feb 2, 2012 2:40 UTC (Thu) by magila (subscriber, #49627)
I'd be rather surprised if that were the case. The code that handles cache flushing isn't something which usually changes between models. If a manufacturer's firmware had a bug in that area I'd expect to see it across the board, not just randomly poping up periodically on different SKUs.
Posted Feb 2, 2012 13:11 UTC (Thu) by cladisch (✭ supporter ✭, #50193)
You won't get any manufacturer to admit it, but this is not a bug, it's a feature (to get higher benchmark numbers).
Posted Feb 2, 2012 17:38 UTC (Thu) by magila (subscriber, #49627)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds