The trouble with stable pages
Some storage hardware can transmit and store checksums along with data; those checksums can provide assurance that the data written to (or read from) disk matches what the processor thought it was writing. If the data in a page changes after the calculation of the checksum, though, that data will appear to be corrupted when the checksum is verified later on. Volatile data can also create problems on RAID devices and with filesystems implementing advanced features like data compression. For all of these reasons, the stable pages feature was added to ext4 for the 3.0 release (some other filesystems, btrfs included, have had stable pages for some time). With this feature, pages under writeback are marked as not being writable; any process attempting to write to such a page will block until the writeback completes. It is a relatively simple change that makes system behavior more deterministic and predictable.
That was the thought, anyway, and things do work out that way most of the time. But, occasionally, as described by Ted Ts'o, processes performing writes can find themselves blocked for lengthy periods (multiple seconds) of time. Occasional latency spikes are not the sort of deterministic behavior the developers were after; they also leave users unamused.
In a general sense, it is not hard to imagine what may be going on after seeing this kind of problem report. The system in question is very busy, with many processes contending for the available I/O bandwidth. One process is happily minding its own business while appending to its log file. At some point, though, the final page in that log file is submitted for writeback; it then becomes unwritable. As soon as our hapless process tries to add another line to the file, it will be blocked waiting for that writeback to complete. Since the disks are contended and the I/O queues are long, that wait can go on for some time. By the time the process is allowed to proceed, it has suffered an extensive, unexpected period of latency.
Ted's proposed solution was to only implement stable pages if the data integrity features are built into the kernel. That fix is unlikely to be merged in that form for a few reasons. Many distributor kernels are likely to have the feature enabled, but it will actually be used on relatively few systems. As noted above, there are other places where changing data in pages under writeback can create problems. So the real solution may be some sort of runtime switch - perhaps a filesystem mount option - indicating when stable pages are needed.
It is also possible that the real problem is somewhere else. Chris Mason expressed discomfort with the idea of only using stable pages where they are strictly needed:
According to Chris, writeback latencies simply should not be seen on the scale of multiple seconds; he would like to see some effort put into figuring out why that is happening. Then, perhaps, the real problem could be fixed. But it may be that the real problem is simply that the system's resources are heavily oversubscribed and the I/O queues are long. In that case, a real fix may be hard to come by.
Boaz Harrosh suggested avoiding writeback on the final pages of any files that have been modified in the last few seconds. That might help in the "appending to a log file" case, but will not avoid unpredictable latency resulting from modification of the file at any location other than the end. People have suggested that pages modified while under writeback could be copied, allowing the modification to proceed immediately and not interfere with the writeback. That solution, though, requires more memory (perhaps during a time when the system is desperately trying to free memory) and copying pages is not free. Another option, suggested by Ted, would be to add a callback to be invoked by the block layer just before a page is passed on to the device; that callback could calculate checksums and mark the page unwritable only for the (presumably much shorter) time that it is actually under I/O.
Other solutions certainly exist. The first step, though, would appear to
be to get a real handle on the problem so that solutions are written with
an understanding of where the latency is actually coming from. Then,
perhaps, we can have a stable pages implementation that provides stable
data with stable latency in all situations.
| Index entries for this article | |
|---|---|
| Kernel | Stable pages |
Posted Mar 15, 2012 10:32 UTC (Thu)
by intgr (subscriber, #39733)
[Link] (2 responses)
Really, why wasn't that obvious at the time the patch written? While Wu Fengguang is working hard to improve interactivity during heavy write I/O, other developers are hell bent on adding new nasty behavior. Here's another instance of pretty much the same problem: https://lwn.net/Articles/467328/
The whole point of writeback is that user space shouldn't have to wait behind slow disks. But suddenly, now, it's OK to make user space wait for the whole I/O queue to clear in common cases, for the sake of obscure new features.
Posted Mar 15, 2012 13:06 UTC (Thu)
by Spudd86 (subscriber, #51683)
[Link] (1 responses)
Basically btrfs needs to know that the checksum is right because it will check the checksum when the page is read from disk next, so if it can't do that it WILL report spurious errors and potentially loose data depending on how your app or you react.
Posted Mar 15, 2012 14:00 UTC (Thu)
by intgr (subscriber, #39733)
[Link]
Well btrfs users had to deal with this from day one there's no regression there.
But most users are using ext3/4, and most of them certainly aren't using compression or hardware block checksums -- hence obscure.
Posted Mar 15, 2012 12:07 UTC (Thu)
by slashdot (guest, #22014)
[Link] (3 responses)
Why doesn't Linux just copy-on-write the page under writeback instead of waiting for the block device?!?
If I correctly understand it, the current behavior is simply unacceptable and totally broken: it means that program modifying mmapped memory that fully fits in RAM will randomly block waiting for IO!
Posted Mar 15, 2012 13:10 UTC (Thu)
by Spudd86 (subscriber, #51683)
[Link] (2 responses)
Posted Mar 15, 2012 13:54 UTC (Thu)
by slashdot (guest, #22014)
[Link] (1 responses)
With a single program, the worst that can happen is that the COW operation itself blocks because no pages are available, which is no worse than blocking on disk access.
Also, the additional pages are bound by the number of pages under writeback, which should be bounded by a value proportional to the number of simultaneous requests the hardware supports, which is small, so it shouldn't be an issue even with multiple programs.
And of course, systems with huge RAID arrays supporting bazillions of simultaneous request are also likely to have huge amounts of RAM.
Posted Mar 15, 2012 14:01 UTC (Thu)
by slashdot (guest, #22014)
[Link]
Posted Mar 15, 2012 15:22 UTC (Thu)
by sbohrer (guest, #61058)
[Link]
https://lkml.org/lkml/2011/9/15/191
The stalls are unacceptable to us, and so far my solution has been to revert this patch in our kernels. I would love to see some progress on fixing this issue for real.
Posted Mar 15, 2012 17:24 UTC (Thu)
by davecb (subscriber, #1574)
[Link] (4 responses)
The latter sounds like the logically proper place for the checksum calculation, and might end up entirely inside the code which knows if checksums are necessary, not in a callback at al.
--dave
Posted Mar 17, 2012 21:32 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (3 responses)
I think there are any number of other ways a writer of file pages might want only a consistent set of data to get hardened to disk, so a checksum computing callback isn't a very general solution.
Apparently, the way it works with stable pages is that something locks out the page from getting scheduled for writeback while the page is being updated and having its checksum calculated. So it sounds like a better solution is to have that thing lock out the page not from being scheduled, but from having I/O actually started. The page could move through the I/O queue while being locked/updated, but when it reaches the head of the queue if it is locked (in the middle of an update) at that moment, the scheduler starts something else instead, while the locked one otherwise retains its position at head of the queue.
You don't want to waste your time writing out a page that's just going to get dirty again immediately anyway.
Posted Mar 17, 2012 22:53 UTC (Sat)
by davecb (subscriber, #1574)
[Link] (2 responses)
A minor niggle about deferring writes of locked pages: you need to delay not just the locked page but also any the depend upon it. When updating files, for example, you need to write the file data, the inode data and then the directory (if the file is new). Delaying the file write until after the inode write breaks the critical ordering we depend upon for consistency.
Of course, one might also change the logic to achieve consistency during writes by something other than critical orderings at this low a level: a good commit log of both metadata and data would allow us to enthusiastically reorder writes so much we could start risking starvation (;-))
--dave
Posted Mar 17, 2012 23:49 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (1 responses)
Where such ordering is required, it must be implemented today with write barriers, because otherwise the device driver, not to mention the device, is free to do I/Os from the queue in any order it pleases. But I don't think anyone would be updating a page that is scheduled for I/O and is in front of a write barrier - it would defeat the purpose.
Posted Mar 18, 2012 0:03 UTC (Sun)
by davecb (subscriber, #1574)
[Link]
--dave (exceedingly fallible) c-b
Posted Mar 22, 2012 18:35 UTC (Thu)
by Zizzle (guest, #67739)
[Link]
It seems to update DBs as you type or click in the main UI thread. So it's laggy often.
Sure it's a crappy design, and their working on fixing it, but stable pages could make it much worse.
Posted Mar 26, 2012 6:17 UTC (Mon)
by mfedyk (guest, #55303)
[Link]
I wonder if they reverted support for stable pages, made some other change, or introduced regressions to their enterprise customers.
The trouble with stable pages
> find themselves blocked for lengthy periods (multiple seconds) of time
The trouble with stable pages
The trouble with stable pages
The trouble with stable pages
The trouble with stable pages
The trouble with stable pages
The trouble with stable pages
The trouble with stable pages
Goodness gracious, are we fixing the right problem?
Goodness gracious, are we fixing the right problem?
Goodness gracious, are we fixing the right problem?
Goodness gracious, are we fixing the right problem?
you need to delay not just the locked page but also any the depend upon it.
Goodness gracious, are we fixing the right problem?
The trouble with stable pages
The trouble with stable pages
