User: Password:
Subscribe / Log in / New account

Stable pages

Stable pages

Posted May 12, 2011 2:54 UTC (Thu) by nirbheek (subscriber, #54111)
Parent article: Stable pages

This seems a bit odd to me.

The copy-on-writeback solution that took too much memory, and the wait-on-writeback solution that causes too much latency are two extreme solutions to this problem.

Wouldn't an intermediate solution such as copy-on-page-modify be better? That way, when a page that's under writeback needs to be modified, a copy is made, that copy is modified, and the modified page can be marked for writeback as well.

Another solution could be to copy only the checksum value instead of the whole page.

I'm probably missing something here, and I'd like to know what that is :)

(Log in to post comments)

Stable pages

Posted May 12, 2011 3:12 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

I came to suggest COW also. From the outside, it seems like a happy middle ground.

As for the checksum, what about something like software RAID-5, where the parity block is as large as the block being written back?

Stable pages

Posted May 12, 2011 6:07 UTC (Thu) by djwong (subscriber, #23506) [Link]

Yes, I'm working towards a COW solution. However, I first need to quantify the impact of wait-on-writeback on a wider variety of workloads so that I have a better idea of what I'd be changing and what good that would do. :)

Stable pages

Posted May 12, 2011 13:00 UTC (Thu) by smurf (subscriber, #17840) [Link]

One kind of workload that's affected negatively would be any low-latency process which writes to disk.
When I do that, in order to guarantee that the main program responds immediately I lock the whole application in memory and use a separate writing thread.
But if you lock a couple of my process' pages when writing, that lock will affect unrelated data structures which simply happen to be on the same meory page. I can thus no longer guarantee that my main task will no longer block on random memory writes. That's not acceptable.

Stable pages

Posted May 12, 2011 13:28 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

Well, for one, you could allocate your write buffers in dedicated pages with "memalign". That might not be a bad idea anyway.

Now, on a separate note: One thing that wasn't clear to me was why this blocking only applies to file backed pages. Wouldn't anonymous pages headed toward swap also be subject to this if swap was on an integrity-checked volume?

Stable pages

Posted May 12, 2011 20:01 UTC (Thu) by djwong (subscriber, #23506) [Link]

I _think_ the swap cache tries to erase all the mappings to a particular page just prior to swapping the page out to disk, and doesn't write the page if it can't. I'm not 100% sure, however, that there isn't a race between the page being mapped back in while the swapout is in progress, so I'll check.

Stable pages, posible corner cases

Posted May 12, 2011 18:07 UTC (Thu) by davecb (subscriber, #1574) [Link]

Perhaps I'm misunderstanding, but won't a series of small sequential writes trigger wait-on-writeback? Or does this not apply to appending to a file-backed page?

In a previous life I was involved in the performance measurement of coalescing disk writes, and we found a very large number of sequential writes could be coalesced into single writes, and then adjacent blocks coalesced into larger singe writes. This paid off particularly well when a disk was being handed writes at or beyond it's capacity, by removing unneeded writes. I think I still have the graphs somewhere (;-))

I'll comment on the non-sequential case in a sec, after I look at my archive.


Stable pages, posible corner cases

Posted May 12, 2011 18:49 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

It seems like it should, but only if the page starts getting flushed to disk during the series of writes. Dirty pages don't get flushed to disk immediately unless there's memory pressure, too many dirty pages, they've been sitting around awhile, or you've asked them to be flushed. All those thresholds are defined throughout here:

That's what makes it so hard (at least for me) to reason about what workloads would get hurt, since there's not a simple, immediate relationship between "application dirtied a page" and "page got scheduled for writeback." You need both of those things to happen *and* the application must subsequently try to dirty the page further before you hit the page-block.

I guess you could get some negative interactions more immediately if a 'write' call scheduled a writeback for part of a page, and then the app immediately resumed filling the rest of the page. Still, I don't think a write() syscall triggers an immediate writeback on most calls. Take a look at 'queue_io' around line 277:

Only the oldest dirtied pages get flushed, as I read that.

Stable pages, posible corner cases

Posted May 12, 2011 19:23 UTC (Thu) by davecb (subscriber, #1574) [Link]

Excellent, thanks! --dave

Stable pages - is this "racy" ?

Posted May 12, 2011 18:35 UTC (Thu) by davecb (subscriber, #1574) [Link]

I had a look at the paper the work I measured was based on, and wonder if we're really looking at a race condition: we take a checksum, queue the data for I/O and compare the data as part of or after the I/O to see if an error has occurred.

Delaying, duplicating or COWing allows us to survive or avoid the data changing while the I/O is queued, which is a pretty long time compared to anything happening in main memory. The speed difference gives us a relatively large period in which a program can race ahead of the disk.

If the purpose is to validate the disk write, one would want to do the checksum as late as possible before the write, and verify it either as part of hardware write or via a read-after-write step. That keeps the time period tiny.

If the purpose is to validate it from end to end, I suspect you need more than one check. One check would need to be done as the data is queued, to be sure it made it to the queue ok, which would need to be amended if the page in queue is coalesced with a later write. In the latter case you have a new, amended checksum to check as-or-after the write.

Alas, I'm not following the main list these days, so I'm unclear of the fine details of the requirements you face!


Stable pages - is this "racy" ?

Posted May 15, 2011 20:44 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

The race is between Linux and the disk drive. No matter when Linux computes the checksum, if the data in the buffer changes while the disk drive is transferring the data from the buffer to itself, Linux cannot ensure that the checksum the disk drive gets is correct for the rest of the data that the disk drive gets.

It's always been pretty dicey to have the disk drive get a mixture of older and newer data for a single write, but we've always arranged it so that in the cases where than can happen, it doesn't matter that we end up with garbage. But it's a lot harder to ignore a checksum mismatch, which is designed to indicate lower level corruption.

Stable pages

Posted May 12, 2011 18:46 UTC (Thu) by dlang (subscriber, #313) [Link]

the COW can be further optimized by not turning on COW until the system is ready to start processing the page

if you have a page that's being modified 1000 times a second, you don't want to have 1000 copies/sec to try and write out.

but while the system is working to write the first copy, you can allow the second copy to be modified many different times, and only when you select that page for writeout (and are ready to do the checksum on it), do you set COW.

this will get the modifications to disk as quickly as the disk will support it, but will only have one copy of the page (in addition to what's in the process of being written out to disk)

Stable pages

Posted May 12, 2011 19:54 UTC (Thu) by djwong (subscriber, #23506) [Link]

I was actually thinking that instead of doing the writeback wait, we could instead (ab)use the page migration code to remap all processes' mappings to a different page.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds