By Jonathan Corbet
September 15, 2010
Writeback is the process of writing dirty memory pages (i.e. those which
have been modified by applications) back to persistent storage, saving the
data and potentially freeing the pages for other use. System performance
is heavily dependent on getting writeback right; poorly-done writeback can
lead to poor I/O rates and extreme memory pressure. Over the last year, it
has become increasingly clear that the Linux kernel is not doing writeback
as well as it should; several developers have been putting time into
improving the situation. The
dynamic dirty throttling limits
patch from Wu Fengguang demonstrates a new, relatively complex approach
to making writeback better.
One of the key concepts behind writeback handling is that processes which
are contributing the most to the problem should be the ones to suffer the most for it. In
the kernel, this suffering is managed through a call to
balance_dirty_pages(), which is meant to throttle a process's
memory-dirtying behavior until the situation improves. That throttling is
done in a straightforward way: the process is given a shovel and told to
start digging. In other words, a process which has been tossed into
balance_dirty_pages() is put to work finding dirty pages and
arranging to have them written to disk. Once a certain number of pages
have been cleaned, the process is allowed to get back to the vital task of
creating more dirty pages.
[PULL QUOTE:
So, when the system is under memory pressure and very much
needs optimal performance from its block devices, it goes into a mode which
makes that performance worse.
END QUOTE]
There are some problems with cleaning pages in this way, many of which have
been covered elsewhere. But one of the key ones is that it tends to
produce seeky I/O traffic. When writeback is handled normally in the
background, the kernel does its best to clean substantial numbers of pages
of the same file at the same time. Since filesystems work hard to lay out
file blocks contiguously whenever possible, writing all of a file's pages
together should cause a relatively small number of head seeks, improving
I/O bandwidth. As soon as balance_dirty_pages() gets into the
act, though, the block layer is suddenly confronted with writeback from
multiple sources; that can only lead to a seekier I/O pattern and reduced
bandwidth. So, when the system is under memory pressure and very much
needs optimal performance from its block devices, it goes into a mode which
makes that performance worse.
Fengguang's 17-part patch makes a number of changes, starting with removing
any direct writeback work from balance_dirty_pages(). Instead,
the offending process simply goes to sleep for a while, secure in the
knowledge that writeback is being handled by other parts of the system.
That should lead to better I/O performance, but also to more predictable
and controllable pauses for memory-intensive applications.
Much of the rest of the patch series is aimed at improving that pause
calculation. It adds a new mechanism for estimating the actual bandwidth
of each backing device - something the kernel does not have a good handle
on, currently. Using that information, combined with the number of pages
that the kernel would like to see written out before allowing a dirtying
process to continue, a reasonable pause duration can be calculated. That
pause is not allowed to exceed 200ms.
The patch set tries to be smarter than that, though. 200ms is a long time
to pause a process which is trying to get some work done. On the other
hand, without a bit of care, it is also possible to pause processes for a
very short period of time, which is bad for throughput. For this patch
set, it was decided that optimal pauses would be between 10ms and 100ms.
This range is achieved by maintaining a separate
"nr_dirtied_pause" limit for every process; if the number of
dirtied pages for that process is below the limit, it is not forced to
pause. Any time that balance_dirty_pages() calculates a pause
time of less than 10ms, the limit is raised; if the pause turns out to be
over 100ms, instead, the limit is cut in half. The desired result is a
pause within the selected range which tends quickly toward the 10ms end
when memory pressure drops.
Another change made by this patch series is to try to come up with a global
estimate of the memory pressure on the system. When normal memory scanning
encounters dirty pages, the pressure estimate is increased. If, instead,
the kswapd process on the most memory-stressed node in the system
goes idle, then the estimate is decreased. This estimate is then used to
adjust the throttling limits applied to processes; when the system is under
heavy memory pressure, memory-dirtying processes will be put on hold sooner
than they otherwise would be.
There is one other important change made in this patch set. Filesystem
developers have been complaining for a while that the core memory
management code tells them to write back too little memory at a time. On a
fast device, overly small writeback requests will fail to keep the device
busy, resulting in suboptimal performance. So some filesystems (xfs and
ext4) actually ignore the amount of requested writeback; they will write
back many more pages than they were asked to do. That can improve
performance, but it is not without its problems; in particular, sending
massive write operations to slow devices can stall the system for
unacceptably long times.
Once this patch set is in place, there's a better way to calculate the best
writeback size. The system now knows what kind of bandwidth it can expect
from each device; using that information, it can size its requests to keep
the device busy for one second at a time. Throttling limits are also based
on this one-second number; if there are not enough dirty pages in the
system for one second of I/O activity, the backing device is probably not
being used to its full capacity and the number of dirty pages should be
allowed to increase. In summary: the bandwidth estimation allows the
kernel to scale dirty limits and I/O sizes to make the best use of all of
the devices in the system, regardless of any specific device's performance
characteristics.
Getting this code into the mainline could take a while, though. It is a
complicated set of changes to core code which is already complex; as such,
it will be hard for others to review. There have been some concerns raised
about the specifics of some of the heuristics. A large amount of
performance testing will also be required to get this kind of change
merged. So we may have to wait for a while yet, but better writeback
should be coming eventually.
(
Log in to post comments)