The trouble with stable pages

By Jonathan Corbet
March 13, 2012

Traditionally, the kernel has allowed the modification of pages in memory while those pages are in the process of being written back to persistent storage. If a process writes to a section of a file that is currently under writeback, that specific writeback operation may or may not contain all of the most recently written data. This behavior is not normally a problem; all the data will get to disk eventually, and developers (should) know that if they want to get data to disk at a specific time, they should use the fsync() system call to get it there. That said, there are times when modifying under-writeback pages can create problems; those problems have been addressed, but now it appears that the cure may be as bad as the disease.

Some storage hardware can transmit and store checksums along with data; those checksums can provide assurance that the data written to (or read from) disk matches what the processor thought it was writing. If the data in a page changes after the calculation of the checksum, though, that data will appear to be corrupted when the checksum is verified later on. Volatile data can also create problems on RAID devices and with filesystems implementing advanced features like data compression. For all of these reasons, the stable pages feature was added to ext4 for the 3.0 release (some other filesystems, btrfs included, have had stable pages for some time). With this feature, pages under writeback are marked as not being writable; any process attempting to write to such a page will block until the writeback completes. It is a relatively simple change that makes system behavior more deterministic and predictable.

That was the thought, anyway, and things do work out that way most of the time. But, occasionally, as described by Ted Ts'o, processes performing writes can find themselves blocked for lengthy periods (multiple seconds) of time. Occasional latency spikes are not the sort of deterministic behavior the developers were after; they also leave users unamused.

In a general sense, it is not hard to imagine what may be going on after seeing this kind of problem report. The system in question is very busy, with many processes contending for the available I/O bandwidth. One process is happily minding its own business while appending to its log file. At some point, though, the final page in that log file is submitted for writeback; it then becomes unwritable. As soon as our hapless process tries to add another line to the file, it will be blocked waiting for that writeback to complete. Since the disks are contended and the I/O queues are long, that wait can go on for some time. By the time the process is allowed to proceed, it has suffered an extensive, unexpected period of latency.

Ted's proposed solution was to only implement stable pages if the data integrity features are built into the kernel. That fix is unlikely to be merged in that form for a few reasons. Many distributor kernels are likely to have the feature enabled, but it will actually be used on relatively few systems. As noted above, there are other places where changing data in pages under writeback can create problems. So the real solution may be some sort of runtime switch - perhaps a filesystem mount option - indicating when stable pages are needed.

It is also possible that the real problem is somewhere else. Chris Mason expressed discomfort with the idea of only using stable pages where they are strictly needed:

I'm not against only turning on stable pages when they are needed, but the code that isn't the default tends to be somewhat less used. So it does increase testing burden when we do want stable pages, and it tends to make for awkward bugs that are hard to reproduce because someone neglects to mention it.

According to Chris, writeback latencies simply should not be seen on the scale of multiple seconds; he would like to see some effort put into figuring out why that is happening. Then, perhaps, the real problem could be fixed. But it may be that the real problem is simply that the system's resources are heavily oversubscribed and the I/O queues are long. In that case, a real fix may be hard to come by.

Boaz Harrosh suggested avoiding writeback on the final pages of any files that have been modified in the last few seconds. That might help in the "appending to a log file" case, but will not avoid unpredictable latency resulting from modification of the file at any location other than the end. People have suggested that pages modified while under writeback could be copied, allowing the modification to proceed immediately and not interfere with the writeback. That solution, though, requires more memory (perhaps during a time when the system is desperately trying to free memory) and copying pages is not free. Another option, suggested by Ted, would be to add a callback to be invoked by the block layer just before a page is passed on to the device; that callback could calculate checksums and mark the page unwritable only for the (presumably much shorter) time that it is actually under I/O.

Other solutions certainly exist. The first step, though, would appear to be to get a real handle on the problem so that solutions are written with an understanding of where the latency is actually coming from. Then, perhaps, we can have a stable pages implementation that provides stable data with stable latency in all situations.

Index entries for this article
Kernel	Stable pages

The trouble with stable pages

Posted Mar 15, 2012 10:32 UTC (Thu) by intgr (subscriber, #39733) [Link] (2 responses)

> occasionally, as described by Ted Ts'o, processes performing writes can
> find themselves blocked for lengthy periods (multiple seconds) of time

Really, why wasn't that obvious at the time the patch written? While Wu Fengguang is working hard to improve interactivity during heavy write I/O, other developers are hell bent on adding new nasty behavior. Here's another instance of pretty much the same problem: https://lwn.net/Articles/467328/

The whole point of writeback is that user space shouldn't have to wait behind slow disks. But suddenly, now, it's OK to make user space wait for the whole I/O queue to clear in common cases, for the sake of obscure new features.

The trouble with stable pages

Posted Mar 15, 2012 13:06 UTC (Thu) by Spudd86 (subscriber, #51683) [Link] (1 responses)

The feature isn't obscure... it's a correctness feature in btrfs, and hardware support elsewhere.

Basically btrfs needs to know that the checksum is right because it will check the checksum when the page is read from disk next, so if it can't do that it WILL report spurious errors and potentially loose data depending on how your app or you react.

The trouble with stable pages

Posted Mar 15, 2012 14:00 UTC (Thu) by intgr (subscriber, #39733) [Link]

> The feature isn't obscure...

Well btrfs users had to deal with this from day one there's no regression there.

But most users are using ext3/4, and most of them certainly aren't using compression or hardware block checksums -- hence obscure.

The trouble with stable pages

Posted Mar 15, 2012 12:07 UTC (Thu) by slashdot (guest, #22014) [Link] (3 responses)

Uh?

Why doesn't Linux just copy-on-write the page under writeback instead of waiting for the block device?!?

If I correctly understand it, the current behavior is simply unacceptable and totally broken: it means that program modifying mmapped memory that fully fits in RAM will randomly block waiting for IO!

The trouble with stable pages

Posted Mar 15, 2012 13:10 UTC (Thu) by Spudd86 (subscriber, #51683) [Link] (2 responses)

Read the article, it's because the writeback might be triggered by memory pressure and COW will make the memory pressure worse, personally I like the third option, which was delay the checksum until the page is actually about to be under IO, which will make the time that an app could block in much shorter (basically exactly how long it takes to do the checksum and write the one page, which is basically a short fixed constant amount of time for a given system)

The trouble with stable pages

Posted Mar 15, 2012 13:54 UTC (Thu) by slashdot (guest, #22014) [Link] (1 responses)

I am suggesting to only do the copy if the application tries to access the page under writeback and takes the fault, which is hopefully relative rare (especially if the checksum is also delayed as you suggest).

With a single program, the worst that can happen is that the COW operation itself blocks because no pages are available, which is no worse than blocking on disk access.

Also, the additional pages are bound by the number of pages under writeback, which should be bounded by a value proportional to the number of simultaneous requests the hardware supports, which is small, so it shouldn't be an issue even with multiple programs.

And of course, systems with huge RAID arrays supporting bazillions of simultaneous request are also likely to have huge amounts of RAM.

The trouble with stable pages

Posted Mar 15, 2012 14:01 UTC (Thu) by slashdot (guest, #22014) [Link]

Oh, and yes, waiting 1-2 seconds before writing out dirty pages is a good idea for data coming from mmap and from write() calls only partially writing that page (but not for those entirely contained in the write() range), as it also reduces the likelyhood of conflicts.

The trouble with stable pages

Posted Mar 15, 2012 15:22 UTC (Thu) by sbohrer (guest, #61058) [Link]

I reported this regression back in September on XFS:

https://lkml.org/lkml/2011/9/15/191

The stalls are unacceptable to us, and so far my solution has been to revert this patch in our kernels. I would love to see some progress on fixing this issue for real.

Goodness gracious, are we fixing the right problem?

Posted Mar 15, 2012 17:24 UTC (Thu) by davecb (subscriber, #1574) [Link] (4 responses)

Ted's suggestion seems far closer to a solution that changing stable pages does. If one computes the checksum at the last possible time, for the filesystems which support and/or require it, the whole idea of freezing the page in the writeback queue.

The latter sounds like the logically proper place for the checksum calculation, and might end up entirely inside the code which knows if checksums are necessary, not in a callback at al.

--dave

Goodness gracious, are we fixing the right problem?

Posted Mar 17, 2012 21:32 UTC (Sat) by giraffedata (guest, #1954) [Link] (3 responses)

I think there are any number of other ways a writer of file pages might want only a consistent set of data to get hardened to disk, so a checksum computing callback isn't a very general solution.

Apparently, the way it works with stable pages is that something locks out the page from getting scheduled for writeback while the page is being updated and having its checksum calculated. So it sounds like a better solution is to have that thing lock out the page not from being scheduled, but from having I/O actually started. The page could move through the I/O queue while being locked/updated, but when it reaches the head of the queue if it is locked (in the middle of an update) at that moment, the scheduler starts something else instead, while the locked one otherwise retains its position at head of the queue.

You don't want to waste your time writing out a page that's just going to get dirty again immediately anyway.

Goodness gracious, are we fixing the right problem?

Posted Mar 17, 2012 22:53 UTC (Sat) by davecb (subscriber, #1574) [Link] (2 responses)

I quite agree: there should be several other ways to meet our needs than the one we first tried. My father used to say "if you can't think of at least three ways to do something, your not thinking hard enough". We've suggested two, perhaps others can suggest some more.

A minor niggle about deferring writes of locked pages: you need to delay not just the locked page but also any the depend upon it. When updating files, for example, you need to write the file data, the inode data and then the directory (if the file is new). Delaying the file write until after the inode write breaks the critical ordering we depend upon for consistency.

Of course, one might also change the logic to achieve consistency during writes by something other than critical orderings at this low a level: a good commit log of both metadata and data would allow us to enthusiastically reorder writes so much we could start risking starvation (;-))

--dave

Goodness gracious, are we fixing the right problem?

Posted Mar 17, 2012 23:49 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)

you need to delay not just the locked page but also any the depend upon it.

Where such ordering is required, it must be implemented today with write barriers, because otherwise the device driver, not to mention the device, is free to do I/Os from the queue in any order it pleases. But I don't think anyone would be updating a page that is scheduled for I/O and is in front of a write barrier - it would defeat the purpose.

Goodness gracious, are we fixing the right problem?

Posted Mar 18, 2012 0:03 UTC (Sun) by davecb (subscriber, #1574) [Link]

I fear one might do so unintentionally (;-))

--dave (exceedingly fallible) c-b

The trouble with stable pages

Posted Mar 22, 2012 18:35 UTC (Thu) by Zizzle (guest, #67739) [Link]

Firefox is a dog with respect to SQL write IO.

It seems to update DBs as you type or click in the main UI thread. So it's laggy often.

Sure it's a crappy design, and their working on fixing it, but stable pages could make it much worse.

The trouble with stable pages

Posted Mar 26, 2012 6:17 UTC (Mon) by mfedyk (guest, #55303) [Link]

It is interesting that the same kernel version this experiment was introduced is also the base kernel version released in recent updates to suse and oracle linux.

I wonder if they reverted support for stable pages, made some other change, or introduced regressions to their enterprise customers.