Improving readahead

By Jonathan Corbet
February 3, 2010

Readahead is the process of speculatively reading file data into the page cache in the hope that it will be useful to an application in the near future. When readahead works well, it can significantly improve the performance of I/O bound applications by avoiding the need for those applications to wait for data and by increasing I/O transfer size. On the other hand, readahead risks making performance worse as well: if it guesses wrong, scarce memory and I/O bandwidth will be wasted on data which will never be used. So, as is the case with memory management in general, readahead algorithms are both performance-critical and heavily based on heuristics.

As is also generally the case with such code, few people dare to wander into the readahead logic; it tends to be subtle and quick to anger. One of those who dare is Wu Fengguang, who has worked on readahead a few times over the years. His latest contribution is this set of patches which tries to improve readahead performance in the general case while also making it more responsive to low-memory situations.

The headline feature of this patch set is an increase in the maximum readahead size from 128KB to 512KB. Given the size of today's files and storage devices, 512KB may well seem a bit small. But there are costs to readahead, including the amount of memory required to store the data and the amount of I/O bandwidth required to read it. If a larger readahead buffer causes other useful data to be paged out, it could cause a net loss in system performance even if all of the readahead data proves to be useful. Larger readahead operations will occupy the storage device for longer, causing I/O latencies to increase. And one should remember that there can be a readahead buffer associated with every open file descriptor - of which there can be thousands - in the system. Even a small increase in the amount of readahead can have a large impact on the behavior of the system.

The 512K number was reached by way of an extensive series of benchmark runs using both rotating and solid-state storage devices. With rotating disks, bumping the maximum readahead size to 512KB nearly tripled I/O throughput with a modest increase in I/O latency; any further increases, while increasing throughput again, caused latency increases that were deemed to be unacceptable. On solid-state devices the throughput increase was less (on a percentage basis) but still significant.

These numbers hold for a device with reasonable performance, though. A typical USB thumb drive, not being a device with reasonable performance, can run into real trouble with an increased readahead size. To address this problem, the patch set puts a cap on the readahead window size for small devices. For a 2MB device (assuming such a thing can be found), readahead is limited to 4KB; for a 2GB drive, the limit is 128KB. Only at 32GB does the full 512KB readahead window take effect.

This heuristic is not perfect. Jens Axboe protested that some solid-state devices are relatively small in capacity, but they can be quite fast. Such devices may not perform as well as they could with a larger readahead size.

Another part of this patch set is the "context readahead" code which tries to prevent the system from performing more readahead than its memory can handle. For a typical file stream with no memory contention, the contents of the page cache can be visualized (within your editor's poor drawing skills) like this:

Here, we are looking at a representation of a stream of pages containing the file's data; the green pages are those which are in the page cache at the moment. Several recently-consumed pages behind the offset have not yet been evicted, and the full readahead window is waiting for the application to get around to consuming it.

If memory is tight, though, we could find a situation more like this:

Because the system is scrambling for memory, it has been much more aggressive about evicting this file's pages from the page cache. There is much less history there, but, more importantly, a number of pages which were brought in via readahead have been pushed back out before the application was able to actually make use of them. This sort of thrashing behavior is harmful to system performance; the readahead occupied memory when it was needed elsewhere, and that data will have to be read a second time in the near future. Clearly, when this sort of behavior is seen, the system should be doing less readahead.

Thrashing behavior is easily detected; if pages which have already been read in via readahead are missing when the application tries to actually read them, things are going amiss. When that happens, the code will get an estimate of the amount of memory it can safely use by counting the number of history pages (those which have already been consumed by the application) which remain in the page cache. If some history remains, the number of history pages is taken as a guess for what the size of the readahead window should be.

If, instead, there's no history at all, the readahead size is halved. In this case, the readahead code will also carefully shift any readahead pages which are still in memory to the head of the LRU list, making it less likely that they will be evicted immediately prior to their use. The file descriptor will be marked as "thrashed," causing the kernel to continue to use the history size as a guide for the readahead window size in the future. That, in turn, will cause the window to expand and contract as memory conditions warrant.

Readahead changes can be hard to get into the mainline. The heuristics can be tricky, and, as Linus has noted, it can be easy to optimize the system for a subset of workloads:

The problem is, it's often easier to test/debug the "good" cases, ie the cases where we _want_ read-ahead to trigger. So that probably means that we have a tendency to read-ahead too aggressively, because those cases are the ones where people can most easily look at it and say "yeah, this improves throughput of a 'dd bs=8192'".

The stated goal of this patch set is to make readahead more aggressive by increasing the maximum size of the readahead window. But, in truth, much of the work goes in the other direction, constraining the readahead mechanism in situations where too much readahead can do harm. Whether these new heuristics reliably improve performance will not be known until a considerable amount of benchmarking has been done.

Index entries for this article
Kernel	Memory management/Readahead
Kernel	Readahead

Improving readahead

Posted Feb 4, 2010 2:40 UTC (Thu) by NCunningham (guest, #6457) [Link] (1 responses)

I have similar issues in TuxOnIce. The approach I've used, which might also be useful here, is (among other things) to throttle readahead based on the rate at which data has been read so far. Perhaps a throughput-based limit might be worth a look here, too? (Particularly in the context of Jens' objection).

Improving readahead

Posted Feb 10, 2010 23:40 UTC (Wed) by ccurtis (guest, #49713) [Link]

My first thought was the following. Is this the same thing you are saying?

Set read-ahead pages to 'x'
if( offset != 0 && recent_pages_still_in_ram < 'x' )
set read-ahead = recent_pages_still_in_ram * 'y'

'y' could be some scale value, such as '1' (no scaling) or 0.5.

The idea here being that if this file handle is having pages ejected, the read-ahead is limited to the number of pages still in ram that fit within the read-ahead window.

So, at first read from offset 0, the full window is available. If half of those pages are ejected before they are referenced, the pages still in RAM value is now 1/2 of the read ahead window, so that next read-ahead window is then limited to 1/2 size. If 1/2 of the pages read are also ejected from the head of the file, the read-ahead window would then be 1/4 of the maximum window size (because 1/4 were ejected from the front, and 1/2 from the end - leaving us 1/2 way into the window with only the most recent 1/4 still in RAM).

Referencing our editor's drawings, using this algorithm in the first example would only read ahead 7 blocks into the window (instead of 13 because there are 7 prior to the current offset) and 3 (instead of 2) in the second example.

In the example file usage pattern like RRD, the seek to the middle of the file would only read ahead 1 block. But even if given a 'dd' pattern from the middle of a file, the read-ahead window would at least grow exponentially (1, 2, 4, 8, ...) until the full window size was saturated, somewhat similar to TCP's slow-start.

This presumes, of course, that it is cheap to determine contiguous blocks of memory in RAM prior to the current offset. But I wonder if something like that would be a simple (and workable) compromise. The maximum read-ahead size could then grow to be very large, helping the SSD scenario.

Pathological cases

Posted Feb 5, 2010 10:41 UTC (Fri) by alex (subscriber, #1355) [Link] (2 responses)

One pathological case I cam across recently is the handling of RRD files.
Updates to RRD files tend to be to the header and one other block in the
file. Most of the time any readahead is a complete waste of time from an
efficient I/O point of view.

Do the readahead heuristics handle the case where the readahead is never
actually used by the process? Just turning down the global readahead knob is
a rather clunky solution as I'm sure it's useful for a lot of other cases
(spooling programs up for example).

Pathological cases

Posted Feb 5, 2010 11:03 UTC (Fri) by farnz (subscriber, #17727) [Link]

That sort of situation is what posix_fadvise is meant for. If you know in advance that readahead will be wasted, posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM); tells the kernel not to bother.

In general, if you know that you have an abnormal access pattern ahead of time, the right call to posix_fadvise won't hurt you, and can help.

Pathological cases

Posted Feb 5, 2010 14:10 UTC (Fri) by corbet (editor, #1) [Link]

Yes, the readahead code tracks hits and stops trying when the program skips around a lot - that happens now. One other little heuristic in this patch set that I didn't work into the article is a tweak which suppresses readahead when the application seeks to the beginning of the file. That should nicely take care of the "update the header" case described here.

Improving readahead

Posted Feb 5, 2010 22:26 UTC (Fri) by giraffedata (guest, #1954) [Link] (4 responses)

When readahead works well, it can significantly improve the performance of I/O bound applications by avoiding the need for those applications to wait for data ...

This is poorly worded. It should say the speedup is from allowing the application to overlap computing and I/O. It doesn't eliminate the wait so much as allow the application, without any fancy threading, to do something while it waits.

and by increasing I/O transfer size

And another equally important effect: it allows the system to order the reads more efficiently (which is very important in a filesystem that has seek time).

Improving readahead

Posted Feb 5, 2010 22:58 UTC (Fri) by corbet (editor, #1) [Link] (3 responses)

Oh come on. If the data has been read ahead, the application need not wait for it. If it's read through a larger-than-otherwise I/O operation, then the wait is truly eliminated.

Improving readahead

Posted Feb 6, 2010 3:03 UTC (Sat) by giraffedata (guest, #1954) [Link] (2 responses)

I guess I wasn't clear, or I totally misparsed your sentence:

When readahead works well, it can significantly improve the performance of I/O bound applications by avoiding the need for those applications to wait for data and by increasing I/O transfer size

You say readahead improves application performance by doing A and B. B is increasing transfer size, so I assumed A is something separate. Not having to wait for I/O isn't a reason for higher performance; it's essentially the definition (like saying a car arrives at its destination sooner because it goes faster). So I asked myself what wait-reducing effect you were referring to, and I figured you meant that when an application does processing while the I/O is going on, when it gets around to requesting the I/O, there's no waiting left to do. Well, an clearer way to explain that performance-improving effect is to say the application can overlap processing and I/O.

Note that this effect isn't present if the application doesn't have any processing to do. Readahead won't help that application -- won't reduce the wait for I/O -- except by one of the other effects. That's obvious when you refer to the overlap, but not when you just say it reduces wait time.

So much for the language analysis. The key thing for people to remember about readhead is that it improves application performance by reducing waiting time for I/O in these three ways:

the application can overlap processing and I/O without explicit threading.
the system can do larger I/Os.
the system can order I/Os better.

Improving readahead

Posted Feb 7, 2010 2:19 UTC (Sun) by dmag (guest, #17775) [Link] (1 responses)

>> it can significantly improve the performance of I/O bound applications by avoiding the need for those applications to wait for data

> Note that this effect isn't present if the application doesn't have any processing to do.

Nope. Two reasons:

1) Most people want to write simple single-threaded programs that do synchronous I/O. That means they only issue one request at a time. If their app is I/O bound, then there will always be some latency between the end of one request and the beginning of the next. The disk will be 100% idle during that latency (barring other access). Read-ahead means the disk is busy instead.

2) When *other* applications have the CPU, the disk could go idle too. Readahead could be still issuing requests on behalf of the app during that time.

In theory, a sophisticated (async reads) app might help replicate #1, and a big enough I/O request buffer might help #2. But let's face it, sync code is much easier to write and debug.

Improving readahead

Posted Feb 7, 2010 3:36 UTC (Sun) by giraffedata (guest, #1954) [Link]

Note that this effect isn't present if the application doesn't have any processing to do.
Nope. Two reasons:
...there will always be some latency between the end of one request and the beginning of the next

OK, I did oversimplify when I referred to an application with no processing to do, as that's impossible. To be more precise, I could say to the extent that an application has no processing to do, this overlap effect of readahead doesn't produce any application speedup. There will always be some speedup.

When *other* applications have the CPU, the disk could go idle too. Readahead could be still issuing requests on behalf of the app during that time

I guess you're talking about time the application is waiting for the CPU, which the readahead overlaps as well. Yes, I missed that. But in an I/O-bound application, this should be negligible, shouldn't it? With decent CPU allocation/scheduling, the process ought to get scheduled really soon after its read completes and keep the CPU until it starts the next one.