Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
As for the checksum, what about something like software RAID-5, where the parity block is as large as the block being written back?
Posted May 12, 2011 6:07 UTC (Thu) by djwong (subscriber, #23506)
Posted May 12, 2011 13:00 UTC (Thu) by smurf (subscriber, #17840)
Posted May 12, 2011 13:28 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246)
Now, on a separate note: One thing that wasn't clear to me was why this blocking only applies to file backed pages. Wouldn't anonymous pages headed toward swap also be subject to this if swap was on an integrity-checked volume?
Posted May 12, 2011 20:01 UTC (Thu) by djwong (subscriber, #23506)
Stable pages, posible corner cases
Posted May 12, 2011 18:07 UTC (Thu) by davecb (subscriber, #1574)
In a previous life I was involved in the performance measurement of coalescing disk writes, and we found a very large number of sequential writes could be coalesced into single writes, and then adjacent blocks coalesced into larger singe writes. This paid off particularly well when a disk was being handed writes at or beyond it's capacity, by removing unneeded writes. I think I still have the graphs somewhere (;-))
I'll comment on the non-sequential case in a sec, after I look at my archive.
Posted May 12, 2011 18:49 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246)
That's what makes it so hard (at least for me) to reason about what workloads would get hurt, since there's not a simple, immediate relationship between "application dirtied a page" and "page got scheduled for writeback." You need both of those things to happen *and* the application must subsequently try to dirty the page further before you hit the page-block.
I guess you could get some negative interactions more immediately if a 'write' call scheduled a writeback for part of a page, and then the app immediately resumed filling the rest of the page. Still, I don't think a write() syscall triggers an immediate writeback on most calls. Take a look at 'queue_io' around line 277:
Only the oldest dirtied pages get flushed, as I read that.
Posted May 12, 2011 19:23 UTC (Thu) by davecb (subscriber, #1574)
Stable pages - is this "racy" ?
Posted May 12, 2011 18:35 UTC (Thu) by davecb (subscriber, #1574)
Delaying, duplicating or COWing allows us to survive or avoid the data changing while the I/O is queued, which is a pretty long time compared to anything happening in main memory. The speed difference gives us a relatively large period in which a program can race ahead of the disk.
If the purpose is to validate the disk write, one would want to do the checksum as late as possible before the write, and verify it either as part of hardware write or via a read-after-write step. That keeps the time period tiny.
If the purpose is to validate it from end to end, I suspect you need more than one check. One check would need to be done as the data is queued, to be sure it made it to the queue ok, which would need to be amended if the page in queue is coalesced with a later write. In the latter case you have a new, amended checksum to check as-or-after the write.
Alas, I'm not following the main list these days, so I'm unclear of the fine details of the requirements you face!
Posted May 15, 2011 20:44 UTC (Sun) by giraffedata (subscriber, #1954)
The race is between Linux and the disk drive. No matter when Linux computes the checksum, if the data in the buffer changes while the disk drive is transferring the data from the buffer to itself, Linux cannot ensure that the checksum the disk drive gets is correct for the rest of the data that the disk drive gets.
It's always been pretty dicey to have the disk drive get a mixture of older and newer data for a single write, but we've always arranged it so that in the cases where than can happen, it doesn't matter that we end up with garbage. But it's a lot harder to ignore a checksum mismatch, which is designed to indicate lower level corruption.
Posted May 12, 2011 18:46 UTC (Thu) by dlang (✭ supporter ✭, #313)
if you have a page that's being modified 1000 times a second, you don't want to have 1000 copies/sec to try and write out.
but while the system is working to write the first copy, you can allow the second copy to be modified many different times, and only when you select that page for writeout (and are ready to do the checksum on it), do you set COW.
this will get the modifications to disk as quickly as the disk will support it, but will only have one copy of the page (in addition to what's in the process of being written out to disk)
Posted May 12, 2011 19:54 UTC (Thu) by djwong (subscriber, #23506)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds