The trouble with stable pages
Some storage hardware can transmit and store checksums along with data; those checksums can provide assurance that the data written to (or read from) disk matches what the processor thought it was writing. If the data in a page changes after the calculation of the checksum, though, that data will appear to be corrupted when the checksum is verified later on. Volatile data can also create problems on RAID devices and with filesystems implementing advanced features like data compression. For all of these reasons, the stable pages feature was added to ext4 for the 3.0 release (some other filesystems, btrfs included, have had stable pages for some time). With this feature, pages under writeback are marked as not being writable; any process attempting to write to such a page will block until the writeback completes. It is a relatively simple change that makes system behavior more deterministic and predictable.
That was the thought, anyway, and things do work out that way most of the time. But, occasionally, as described by Ted Ts'o, processes performing writes can find themselves blocked for lengthy periods (multiple seconds) of time. Occasional latency spikes are not the sort of deterministic behavior the developers were after; they also leave users unamused.
In a general sense, it is not hard to imagine what may be going on after seeing this kind of problem report. The system in question is very busy, with many processes contending for the available I/O bandwidth. One process is happily minding its own business while appending to its log file. At some point, though, the final page in that log file is submitted for writeback; it then becomes unwritable. As soon as our hapless process tries to add another line to the file, it will be blocked waiting for that writeback to complete. Since the disks are contended and the I/O queues are long, that wait can go on for some time. By the time the process is allowed to proceed, it has suffered an extensive, unexpected period of latency.
Ted's proposed solution was to only implement stable pages if the data integrity features are built into the kernel. That fix is unlikely to be merged in that form for a few reasons. Many distributor kernels are likely to have the feature enabled, but it will actually be used on relatively few systems. As noted above, there are other places where changing data in pages under writeback can create problems. So the real solution may be some sort of runtime switch - perhaps a filesystem mount option - indicating when stable pages are needed.
It is also possible that the real problem is somewhere else. Chris Mason expressed discomfort with the idea of only using stable pages where they are strictly needed:
According to Chris, writeback latencies simply should not be seen on the scale of multiple seconds; he would like to see some effort put into figuring out why that is happening. Then, perhaps, the real problem could be fixed. But it may be that the real problem is simply that the system's resources are heavily oversubscribed and the I/O queues are long. In that case, a real fix may be hard to come by.
Boaz Harrosh suggested avoiding writeback on the final pages of any files that have been modified in the last few seconds. That might help in the "appending to a log file" case, but will not avoid unpredictable latency resulting from modification of the file at any location other than the end. People have suggested that pages modified while under writeback could be copied, allowing the modification to proceed immediately and not interfere with the writeback. That solution, though, requires more memory (perhaps during a time when the system is desperately trying to free memory) and copying pages is not free. Another option, suggested by Ted, would be to add a callback to be invoked by the block layer just before a page is passed on to the device; that callback could calculate checksums and mark the page unwritable only for the (presumably much shorter) time that it is actually under I/O.
Other solutions certainly exist. The first step, though, would appear to
be to get a real handle on the problem so that solutions are written with
an understanding of where the latency is actually coming from. Then,
perhaps, we can have a stable pages implementation that provides stable
data with stable latency in all situations.
| Index entries for this article | |
|---|---|
| Kernel | Stable pages |
