Stable pages

By Jonathan Corbet
May 11, 2011

When a process writes to a file-backed page in memory (through either a memory mapping or with the write() system call), that page is marked dirty and must eventually be written to its backing store. The writeback code, when it gets around to that page, will mark the page read-only, set the "under writeback" page flag, and queue the I/O operation. The write-protection of the page is not there to prevent changes to the page; its purpose is to detect further writes which would require that another writeback be done. Current kernels will, in most situations, allow a process to modify a page while the writeback operation is in progress.

Most of the time, that works just fine. In the worst case, the second write to the page will happen before the first writeback I/O operation begins; in that case, the more recently written data will also be written to disk in the first I/O operation and a second, redundant disk write will be queued later. Either way, the data gets to its backing store, which is the real intent.

There are cases where modifying a page that is under writeback is a bad idea, though. Some devices can perform integrity checking, meaning that the data written to disk is checksummed by the hardware and compared against a pre-write checksum provided by the kernel. If the data changes after the kernel calculates its checksum, that check will fail, causing a spurious write error. Software RAID implementations can be tripped up by changing data as well. As a result of problems like this, developers working in the filesystem area have been convinced for a while that the kernel needs to support "stable pages" which are guaranteed not to change while they are under writeback.

When LWN looked at stable pages in February, Darrick Wong had just posted a patch aimed at solving this problem. In situations where integrity checking was in use, the kernel would make a copy of each page before beginning a writeback operation. Since nobody in user space knew about the copy, it was guaranteed to remain unmolested for the duration of the write operation. This patch solved the problem for the integrity checking case, but all of those copy operations were expensive. Given that providing stable pages in all situations was seen as desirable, that cost was considered to be too high.

So Darrick has come back with a new patch set which takes a different - and simpler - approach. In short, with this patch, any attempt to write to a page which is under writeback will simply wait until the writeback completes. There is no need to copy pages or engage in other tricks, but there may be a cost to this approach as well.

As noted above, a page will be marked read-only when it is written back; there is also a page flag which indicates that writeback is in progress. So all of the pieces are there to trap writes to pages under writeback. To make it even easier, the VFS layer already has a callback (page_mkwrite()) to notify filesystems that a read-only page is being made writable; all Darrick really needed to do was to change how those page_mkwrite() callbacks operate in presence of writeback.

Some filesystems do not provide page_mkwrite() at all; for those, Darrick created a generic empty_page_mkwrite() function which locks the page, waits for any writeback to complete, then returns the locked page. More complicated filesystems do have page_mkwrite() handlers, though, so Darrick had to add similar functionality for ext2, ext4, and FAT. Btrfs has implemented stable pages internally for some time, so no changes were required there. Ext3 turns out to have some complicated interactions with the journal layer which make a stable page implementation hard; since invasive changes to ext3 are not welcomed at this point, that filesystem may never get stable page support.

There have been concerns expressed that this approach could slow down applications which repeatedly write to the same part of a file. Before this change, writeback would not slow down subsequent writes; afterward, those writes will wait for writeback to complete. Darrick ran some benchmarks to test this case and found a performance degradation of up to 12%. This slowdown is unwelcome, but there also seems to be a consensus that there are very few applications which would actually run into this problem. Repetitively rewriting data is a relatively rare pattern; indeed, the developers involved are saying that they don't even know of a real-world case they can test.

Lack of awareness of applications which would be adversely affected by this change does not mean that they don't exist, of course. This is the kind of change which can create real problems a few years down the line when the code is finally shipped by distributors and deployed by users; by then, it's far too late to go back. If there are applications which would react poorly to this change, it would be good to get the word out now. Otherwise the benefits of stable pages are likely to cause them to be adopted in most settings.

Index entries for this article
Kernel	Data integrity
Kernel	Stable pages

Stable pages

Posted May 12, 2011 2:54 UTC (Thu) by nirbheek (subscriber, #54111) [Link] (12 responses)

This seems a bit odd to me.

The copy-on-writeback solution that took too much memory, and the wait-on-writeback solution that causes too much latency are two extreme solutions to this problem.

Wouldn't an intermediate solution such as copy-on-page-modify be better? That way, when a page that's under writeback needs to be modified, a copy is made, that copy is modified, and the modified page can be marked for writeback as well.

Another solution could be to copy only the checksum value instead of the whole page.

I'm probably missing something here, and I'd like to know what that is :)

Stable pages

Posted May 12, 2011 3:12 UTC (Thu) by jzbiciak (guest, #5246) [Link] (11 responses)

I came to suggest COW also. From the outside, it seems like a happy middle ground.

As for the checksum, what about something like software RAID-5, where the parity block is as large as the block being written back?

Stable pages

Posted May 12, 2011 6:07 UTC (Thu) by djwong (subscriber, #23506) [Link] (8 responses)

Yes, I'm working towards a COW solution. However, I first need to quantify the impact of wait-on-writeback on a wider variety of workloads so that I have a better idea of what I'd be changing and what good that would do. :)

Stable pages

Posted May 12, 2011 13:00 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

One kind of workload that's affected negatively would be any low-latency process which writes to disk.
When I do that, in order to guarantee that the main program responds immediately I lock the whole application in memory and use a separate writing thread.
But if you lock a couple of my process' pages when writing, that lock will affect unrelated data structures which simply happen to be on the same meory page. I can thus no longer guarantee that my main task will no longer block on random memory writes. That's not acceptable.

Stable pages

Posted May 12, 2011 13:28 UTC (Thu) by jzbiciak (guest, #5246) [Link] (1 responses)

Well, for one, you could allocate your write buffers in dedicated pages with "memalign". That might not be a bad idea anyway.

Now, on a separate note: One thing that wasn't clear to me was why this blocking only applies to file backed pages. Wouldn't anonymous pages headed toward swap also be subject to this if swap was on an integrity-checked volume?

Stable pages

Posted May 12, 2011 20:01 UTC (Thu) by djwong (subscriber, #23506) [Link]

I _think_ the swap cache tries to erase all the mappings to a particular page just prior to swapping the page out to disk, and doesn't write the page if it can't. I'm not 100% sure, however, that there isn't a race between the page being mapped back in while the swapout is in progress, so I'll check.

Stable pages, posible corner cases

Posted May 12, 2011 18:07 UTC (Thu) by davecb (subscriber, #1574) [Link] (2 responses)

Perhaps I'm misunderstanding, but won't a series of small sequential writes trigger wait-on-writeback? Or does this not apply to appending to a file-backed page?

In a previous life I was involved in the performance measurement of coalescing disk writes, and we found a very large number of sequential writes could be coalesced into single writes, and then adjacent blocks coalesced into larger singe writes. This paid off particularly well when a disk was being handed writes at or beyond it's capacity, by removing unneeded writes. I think I still have the graphs somewhere (;-))

I'll comment on the non-sequential case in a sec, after I look at my archive.

--dave

Stable pages, posible corner cases

Posted May 12, 2011 18:49 UTC (Thu) by jzbiciak (guest, #5246) [Link] (1 responses)

It seems like it should, but only if the page starts getting flushed to disk during the series of writes. Dirty pages don't get flushed to disk immediately unless there's memory pressure, too many dirty pages, they've been sitting around awhile, or you've asked them to be flushed. All those thresholds are defined throughout here:

http://lxr.free-electrons.com/source/mm/page-writeback.c

That's what makes it so hard (at least for me) to reason about what workloads would get hurt, since there's not a simple, immediate relationship between "application dirtied a page" and "page got scheduled for writeback." You need both of those things to happen *and* the application must subsequently try to dirty the page further before you hit the page-block.

I guess you could get some negative interactions more immediately if a 'write' call scheduled a writeback for part of a page, and then the app immediately resumed filling the rest of the page. Still, I don't think a write() syscall triggers an immediate writeback on most calls. Take a look at 'queue_io' around line 277:

http://lxr.free-electrons.com/source/fs/fs-writeback.c

Only the oldest dirtied pages get flushed, as I read that.

Stable pages, posible corner cases

Posted May 12, 2011 19:23 UTC (Thu) by davecb (subscriber, #1574) [Link]

Excellent, thanks! --dave

Stable pages - is this "racy" ?

Posted May 12, 2011 18:35 UTC (Thu) by davecb (subscriber, #1574) [Link] (1 responses)

I had a look at the paper the work I measured was based on, and wonder if we're really looking at a race condition: we take a checksum, queue the data for I/O and compare the data as part of or after the I/O to see if an error has occurred.

Delaying, duplicating or COWing allows us to survive or avoid the data changing while the I/O is queued, which is a pretty long time compared to anything happening in main memory. The speed difference gives us a relatively large period in which a program can race ahead of the disk.

If the purpose is to validate the disk write, one would want to do the checksum as late as possible before the write, and verify it either as part of hardware write or via a read-after-write step. That keeps the time period tiny.

If the purpose is to validate it from end to end, I suspect you need more than one check. One check would need to be done as the data is queued, to be sure it made it to the queue ok, which would need to be amended if the page in queue is coalesced with a later write. In the latter case you have a new, amended checksum to check as-or-after the write.

Alas, I'm not following the main list these days, so I'm unclear of the fine details of the requirements you face!

--dave

Stable pages - is this "racy" ?

Posted May 15, 2011 20:44 UTC (Sun) by giraffedata (guest, #1954) [Link]

The race is between Linux and the disk drive. No matter when Linux computes the checksum, if the data in the buffer changes while the disk drive is transferring the data from the buffer to itself, Linux cannot ensure that the checksum the disk drive gets is correct for the rest of the data that the disk drive gets.

It's always been pretty dicey to have the disk drive get a mixture of older and newer data for a single write, but we've always arranged it so that in the cases where than can happen, it doesn't matter that we end up with garbage. But it's a lot harder to ignore a checksum mismatch, which is designed to indicate lower level corruption.

Stable pages

Posted May 12, 2011 18:46 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

the COW can be further optimized by not turning on COW until the system is ready to start processing the page

if you have a page that's being modified 1000 times a second, you don't want to have 1000 copies/sec to try and write out.

but while the system is working to write the first copy, you can allow the second copy to be modified many different times, and only when you select that page for writeout (and are ready to do the checksum on it), do you set COW.

this will get the modifications to disk as quickly as the disk will support it, but will only have one copy of the page (in addition to what's in the process of being written out to disk)

Stable pages

Posted May 12, 2011 19:54 UTC (Thu) by djwong (subscriber, #23506) [Link]

I was actually thinking that instead of doing the writeback wait, we could instead (ab)use the page migration code to remap all processes' mappings to a different page.

Stable pages

Posted May 12, 2011 6:08 UTC (Thu) by djwong (subscriber, #23506) [Link] (1 responses)

Just to pick nits, it's Darrick with an 'a' not an 'e'. :)

Stable pages

Posted May 12, 2011 6:17 UTC (Thu) by jake (editor, #205) [Link]

> Just to pick nits, it's Darrick with an 'a' not an 'e'. :)

That seems a bit more serious than a 'nit', sorry about that, fixed now.

jake

Stable pages

Posted May 12, 2011 10:32 UTC (Thu) by dgm (subscriber, #49227) [Link]

I cannot tell for sure, but a good candidate for such behavior can be a b-tree index in a database under heavy write load, where pages holding the tree structure are modified repeatedly as information goes in. Maybe sqlite while populating a new table?

Pathological corner cases

Posted May 12, 2011 13:06 UTC (Thu) by alex (subscriber, #1355) [Link]

I suspect RRD files might trip up on this. You are repeatedly dirtying a page as you step through each write in the round robin database before eventually reaching the next page boundary.

However performance is currently sucky enough that heavy RRD users are already using the caching daemon to ameliorate the effect.

Stable pages

Posted May 13, 2011 1:32 UTC (Fri) by smithj (guest, #38034) [Link] (1 responses)

As for applications which might be affected, what about shred? Obviously you are quickly writing over the same blocks over and over again.

Then again, I doubt many people consider shred to be performance-critical.

Stable pages

Posted May 13, 2011 2:04 UTC (Fri) by njs (subscriber, #40338) [Link]

I believe the point of shred is to write to the same disk block repeatedly, not write to the same memory block repeatedly and then flush the final result out to disk?

Stable pages

Posted May 13, 2011 5:27 UTC (Fri) by bazsi (guest, #63084) [Link] (2 responses)

I'm modifying a page in place at a high rate to keep the internal state of syslog-ng intact even in the case of a crash:

syslog-ng is following a logfile, its current position is kept in a file-backed memory region. In case the daemon is crashed the position remains there, so we can continue where we left off.

syslog-ng can update that page 100k/sec (especially if there are multiple such files and multiple threads reading), and I'm sure it's not waiting for writeback all the time, but that would probably negatively affect this and similar workloads.

Stable pages

Posted May 13, 2011 20:06 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

if the modified page isn't getting written out, you aren't getting the safety that you are looking for.

Stable pages

Posted May 17, 2011 20:29 UTC (Tue) by bazsi (guest, #63084) [Link]

it gets written out eventually. it's a dirty page after all. even if the process exits.

the only thing to prevent that is an OS level crash.