In defense of per-BDI writeback

By Jonathan Corbet
September 30, 2009

Last week's quotes of the week included a complaint from Andrew Morton about the replacement of the writeback code in 2.6.32. According to Andrew, a bunch of critical code had been redone, replacing a well-tested implementation with new code without any hard justification. It's a complaint which should be taken seriously; replacing the writeback code has the potential to introduce performance regressions for specific workloads. It should not be done without a solid reason.

Chris Mason has tried to provide that justification with a combination of benchmark results and explanations. The benchmarks show a clear - and large - performance improvement from the use of per-BDI writeback. That is good, but does not, by itself, justify the switch to per-BDI writeback; Andrew had suggested that the older code was slower as the result of performance regressions introduced over time by other changes. If the 2.6.31 code could be fixed, the performance improvement could be (re)gained without replacing the entire subsystem.

What Chris is saying is that the old, per-CPU pdflush method could not be fixed. The fundamental problem with pdflush is that it would back off when the backing device appeared to be congested. But congestion is easy to cause, and no other part of the system backs off in the same way. So pdflush could end up not doing writeback for significant periods of time. Forcing all other writers to back off in the face of congestion could improve things, but that would be a big change which doesn't address the other problem: congestion-based backoff can defeat attempts by filesystem code and the block layer to write large, contiguous segments to disk.

As it happens, there is a more general throttling mechanism already built into the block layer: the finite number of outstanding requests allowed for any specific device. Once requests are exhausted, threads generating block I/O operations are forced to wait until request slots become free again. Pdflush cannot use this mechanism, though, because it must perform writeback to multiple devices at once; it cannot block on request allocation. A per-device writeback thread can block there, though, since it will not affect I/O to any other device. The per-BDI patch creates these per-device threads and, as a result, it is able to keep devices busier. That, it seems, is why the old writeback code needed to be replaced instead of patched.

Index entries for this article
Kernel	Block layer/Writeback
Kernel	Memory management/Writeback

pdflush

Posted Oct 1, 2009 8:33 UTC (Thu) by axboe (subscriber, #904) [Link]

pdflush() was not per-CPU, it didn't make any affinity guarantees of that nature.

In defense of per-BDI writeback

Posted Oct 1, 2009 14:54 UTC (Thu) by peter_w_morreale (guest, #30066) [Link]

There were other problems with the old writeback code - most especially if you had storage devices of varying throughput on the same system.

The old writeback code traversed super blocks in order, skipping over those currently congested and without regard to the throughput of the devices backing the supers. Recall that the old writeback code/pdflush indiscriminately issues writes until the memory threshold is reached.

This could have (and probably did) lead to possible performance penalties for applications referencing the *fast* devices while consequently improving the performance of apps on the slow devices. It certainly lead to unfairness issues wrt who dirties memory and who cleans it.

Consider the followed kludged example to illustrate the point. Two apps, both dirtying pages at the same rate, one app backed by a "fast" device, the other by a "slow" device. Both apps are contributing to the dirty page count at the same rate, so now pdflush and writeback are kicking in.

Since the slow device will maintain a "congested" state longer (since it is "slow"), the faster device will eventually account for more cleaning of pages than the slow device.

This has two effects:

1) Dirty pages for the app on the slow device potentially stay in memory longer and have a better chance of being re-referenced without I/O.

2) Dirty pages for the "fast" device are more likely to be written out and consequently require an I/O for re-reference.

So we wind up penalizing the app on the fast storage device. In theory at least. :-)

I haven't looked at the per-BDI code, but with such it is now possible to apply fairness to ensure that each device cleans its share of dirty pages. (Whether that is a good thing or not, I don't know, its just that it enables the capability.)

In defense of per-BDI writeback

Posted Oct 3, 2009 15:00 UTC (Sat) by anton (subscriber, #25547) [Link] (2 responses)

The fundamental problem with pdflush is that it would back off when the backing device appeared to be congested.

That might explain the huge slowdowns I saw (on Linux 2.6.19 and 2.6.27) when writing several GB to flash devices. One was a pretty fast 8GB SD card (SDHC class 6 (i.e., >6MB/s on a certain workload), and I typically saw >10MB/s when writing several hundred MB), yet it took several hours to fill up; I no longer remember if the system also suffered in other ways. Calling sync now and then seemed to help, but the whole thing still took a very long time.

I do not think that the problem was in the flash device, because it was originally new (no need to shuffle old data around), the slowdown occured pretty soon (not only near the end), and various measures taken at the host end helped (like invoking sync, or writing the data in smaller batches which syncing in between).

I had a similar experience when trying to fill my 8GB ogg player with music, except that this device was slow to begin with (3MB/s when writing a few hundred MB), but filling it up still should not have taken 8 hours (280KB/s).

In defense of per-BDI writeback

Posted Oct 11, 2009 6:30 UTC (Sun) by mfedyk (guest, #55303) [Link] (1 responses)

I was doing a similar test, but over network filesystems (cifs in this case -- centos5 on both sides).

If I copied files with cp or mv, I noticed a marked improvement in throughput compared to the gnome file manager.

Try it again with mv or cp and see if there is a difference.

In defense of per-BDI writeback

Posted Oct 11, 2009 12:59 UTC (Sun) by anton (subscriber, #25547) [Link]

I did use cp.