LWN.net Logo

What happened to disk performance in 2.6.39

By Jonathan Corbet
January 31, 2012
Herbert Poetzl recently reported an interesting performance problem. His SSD-equipped laptop could read data at about 250MB/s with the 2.6.38 kernel, but performance dropped to 25-50MB/s on anything more recent. An order-of-magnitude performance drop is just not the sort of benefit that most people look forward to when upgrading their kernel, so this report quickly gained the attention of a number of developers. The resolution of the problem turned out to be simple, but it offers an interesting view of how high-performance disk I/O works in the kernel.

An explanation of the problem requires just a bit of background, and, in particular, the definition of a couple of terms. "Readahead" is the process of speculatively reading file data into memory with the idea that an application is likely to want it soon. Reasonable performance when reading a file sequentially depends on proper readahead; that is the only way to ensure that reading and consuming the data can be done in parallel. Without readahead, applications will spend more time than necessary waiting for data to be read from disk.

"Plugging," instead, is the process of stopping I/O request submissions to the low-level device for a period of time. The motivation for plugging is to allow a number of I/O requests to accumulate; that lets the I/O scheduler sort them, merge adjacent requests, and apply any sort of fairness policy that might be in effect. Without plugging, I/O requests would tend to be smaller and more scattered across the device, reducing performance even on solid-state disks.

Now imagine that we have a process about to start reading through a long file, as indicated by your editor's unartistic rendering here:

[Bad art]

Once the application starts reading from the beginning of the file, the kernel will set about filling the first readahead window (which is 128KB with larger files) and submit I/O for the second window, so the situation will look something like this:

[Reading begins]

Once the application reads past 128KB into the file, the data it needs will hopefully be in memory. The readahead machinery starts up again, initiating I/O for the window starting at 256KB; that yields a situation that looks something like this:

[Next window]

This process continues indefinitely with the kernel running to always stay ahead of the application and have the data there by the time that application gets around to reading it.

The 2.6.39 kernel saw some significant changes to how plugging is handled, with the result that the plugging and unplugging of queues is now explicitly managed in the I/O submission code. So, starting with 2.6.39, the readahead code will plug the request queue before it submits a batch of read operations, then unplug the queue at the end. The function that handles basic buffered file I/O (generic_file_aio_read()) also now does its own plugging. And that is where the problems begin.

Imagine a process that is doing large (1MB) reads. As the first large read gets into generic_file_aio_read(), that function will plug the request queue and start working through the file pages already in memory. When it gets to the end of the first readahead window (at 128KB), the readahead code will be invoked as described above. But there's a problem: the queue is still plugged by generic_file_aio_read(), which is still working on that 1MB read request, so the I/O operations submitted by the readahead code are not passed on to the hardware; they just sit in the queue.

So, when the application gets to the end of the second readahead window, we see a situation like this:

[Bummer]

At this point, everything comes to a stop. That will cause the queue to be unplugged, allowing the readahead I/O requests to be executed at last, but it is too late. The application will have to wait. That wait is enough to hammer performance, even on solid-state devices.

The fix is to simply remove the top-level plugging in generic_file_aio_read() so that readahead-originated requests can get through to the hardware. Developers who have been able to reproduce the slowdown report that this patch makes the problem go away, so this issue can be considered solved. Look for this fix to appear in a stable kernel release sometime soon.


(Log in to post comments)

What happened to disk performance in 2.6.39

Posted Feb 2, 2012 9:15 UTC (Thu) by alankila (subscriber, #47141) [Link]

Plugging sounds like another voodoo feature of dubious utility.

Why can't plugging occur naturally in sense that if device is already busy with I/O, let the additional I/O sit in queue and get adjacent reads merged and other such transformations? But if there's no utilization of device, it would seem that it would probably be better just to submit the I/O right away.

What happened to disk performance in 2.6.39

Posted Feb 2, 2012 20:45 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

the problem is the queue size

if the queue size is not large enough, then you can't fit enough requests into the queue to have them available to combine later.

If the queue size is too large, then a new process making a request will not get it's request serviced until everything ahead of it in the queue gets processed (unless you have some fairness process to not put the new processes request at the end of the queue)

I don't like the concept of plugging, but it seems to be a hack that tends to work.

as an example.

In rsyslog, when the ability to process multiple messages from the queue at once was added (so that multiple messages could be inserted to a database in a single transaction for example), we discussed delaying pulling the first message from the queue to give the queue a chance to build up several messages that would then be handled more efficiently (in one pass), but we decided to not do this because the process ended up being self-regulating.

If the messages arrived slowly enough, they are handled one at a time.

If the messages arrive faster than this, some messages queue up while the prior messages are handled and then the backlog gets processed at one time (up to a limit)

This is very good for latency, but the trade-off is that the output is doing far more work than it would need to do if the work was batched more. As the load builds up, it will ramp up the utilisation of the output in the most inefficient mode (one message at a time), and then when it saturates the output, it will become more efficient to process more messages while keeping the output at max utilisation.

Networks have the same type of problem (the too large buffer situation is what's called bufferbloat), the answer there seems to be to put in a more complex queuing engine (SFQ seems to be the winner right now) that priorities packets from new or sparse connections ahead of heavy connections.

I wonder if a similar approach could work for disk I/O? If this would allow for significantly larger queue sizes without the latency problems that usually come with large queues, it may give almost the same long-term effect of plugging, without the problems that plugging introduces.

What happened to disk performance in 2.6.39

Posted Feb 3, 2012 4:54 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> Networks have the same type of problem

It would be interesting to see more sharing of notes between network and disk IO systems because some of the problems they solve are broadly similar. IO throughput and contention behaviors are a matter of science and I'm sure share a lot of math.

What happened to disk performance in 2.6.39

Posted Feb 20, 2012 23:35 UTC (Mon) by jmm82 (guest, #59425) [Link]

Networking does have this same concept build into tcp called Nagle's algorithm.

What happened to disk performance in 2.6.39

Posted Feb 3, 2012 13:05 UTC (Fri) by alankila (subscriber, #47141) [Link]

Plugging or not, I'm pretty sure there are still queues involved just the same. Reading the other links in this article, it seems that plugging goes away as soon as the system determines that it has any work in its internal queues to do, therefore it's strictly a "first request" optimization. In any case, it doesn't seem to improve throughput (because it gets disabled) and worsens latency (because it delays first request service time), so it sounds useless to me in every case.

Disk schedulers already use their own variant of fair queueing, afaik CFQ gives all processes their chance to do some disk transaction when it comes their turn, in this being fairly similar to SFQ which arranges outbound network traffic into number of pre-existing queues, submitting the head element of each queue in turn (if any), giving all flows a fairly equal chance to progress.

What happened to disk performance in 2.6.39

Posted Feb 4, 2012 21:04 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

I've always been a fierce opponent of queue plugging. I'm not saying there's no case where it's good, but everywhere I see it, it's based on the misconception that capacity matters when you're not using it all. I'm talking about the principle that a 10,000 liter tank is no better than a 5,000 liter tank for an application that never stores more than 2,000 liters.

Sending small scattered I/Os to a disk drive is not a problem as long as the drive is keeping up with it, and if the drive isn't, then your queue is building up anyway, without a plug.

I've seen plugging used to overcome a defect in the thing serving the queue wherein it improperly speed-matches. I think this is what's going on with the network "buffer bloat" issue. I saw it more simply in a disk storage server that thought it was doing its client a favor by accepting I/Os as fast as the client could send them and sticking them in a buffer, then passing them one by one, FIFO, to the disk arms. The server was essentially lying and saying it had capacity when it was really overloaded.

This was fixed with queue plugging in the client, but later fixed better just by making the client send ahead enough work to overwhelm the server's buffer and make it admit that it couldn't keep up.

dlang, in your defense of an application of queue plugging:

the trade-off is that the output is doing far more work than it would need to do if the work was batched more
you omit an important factor: what is wrong with the output doing more work than it otherwise would? In many cases, that doesn't make any difference.

What happened to disk performance in 2.6.39

Posted Feb 6, 2012 2:48 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

I agree that most of the time it really doesn't matter that the resource is working a little harder than it would need to be. But that is the only justification I can see for plugging.

the resource being busier can make it take more power.

the resource being busier could cause added latency for a new request.

there are probably other ways that the resource being busier can cost, even if it's not completely maxed out.

but overall I agree that these are probably not significant in almost all conditions.

I think the biggest problem is that large queues have not been handled sanely in the past, which has made "large queue" == "high latency" in many people's minds

what's needed is a large queue to gather possible work, but then smart management of that queue.

In the case of disk I/O that smart management has to do with combining work that's in the queue, but not together, prioritizing reads over rights (except for writes with sync dependencies), elevator reordering, etc.

If you have a raid array it can mean trying to schedule work so that different spindles can be working at the same time.

if you have a SSD or raid array, it can mean trying to do things in larger blocks (stripe size and alignment, eraseblock size and alignment)

In the case of network buffers, it has to do with prioritizing interactive, traffic management, and blocking packets ahead of bulk transfers, dropping packets that you aren't going to be able to get through before they are worthless (which is not just bulk transfers that will get there after a retry has already been sent, but also VoIP packets that have been deleyed too much)

As processors get faster compared to the I/O, it becomes possible to spend more effort in smart queue management while still keeping up with the I/O channel.

What happened to disk performance in 2.6.39

Posted Feb 2, 2012 10:08 UTC (Thu) by jezuch (subscriber, #52988) [Link]

I've just run a test and my SSD can read at full speed. I'm running 3.2.2 so I guess the fix is already included in at least one stable release?

Also, it is interesting that it took so long to discover this problem. Does that mean that people don't really care about speed of sequential file access? :) [Well, I know that I do care more about random access performance, that's why I bought the SSD in the first place. And I also know that sequential read performance used to be a major selling point for manufacturers of drives with horrendous random access performance.]

What happened to disk performance in 2.6.39

Posted Feb 2, 2012 14:13 UTC (Thu) by corbet (editor, #1) [Link]

Did you test with large reads? That is the failure case. If you do smaller operations, the plug gets pulled between them.

What happened to disk performance in 2.6.39

Posted Feb 2, 2012 12:57 UTC (Thu) by petkan (subscriber, #54713) [Link]

One wonders what exactly the patch author(s) has been testing so he missed an order of magnitude slowdown. It looks like the issue either shows up rarely or we've been neglecting block I/O performance for about an year.

The fix is not present in 3.2.2 and i am not finding it in Greg's 3.2.3-pre patches.

What happened to disk performance in 2.6.39

Posted Feb 4, 2012 20:49 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Actually, I wonder what the patch author intended to accomplish (and that his testing presumably demonstrated he did). The article mentions larger less scattered I/Os, but in a situation where the queue tends to be empty, I can't see where that would accomplish anything. When I look at this change, I just see a frontal assault on doing stuff in parallel with the I/O turnaround time.

What happened to disk performance in 2.6.39

Posted Feb 3, 2012 18:15 UTC (Fri) by lonely_bear (subscriber, #2726) [Link]

I am running 3.2.2, and just happened to play an audio CD, it chokes frequently after a while. After switching to 2.6.38.4 (my stock Slackware kernel), it plays smoothly. Will try the fixed and see what happened.

Regressions galore?

Posted Feb 9, 2012 22:58 UTC (Thu) by blujay (guest, #39961) [Link]

Is it just me or is Linux suffering more needless regressions than ever before? I'm getting the impression that some devs are writing code that's too smart for its own good. I honestly think that 10 years ago my Debian system running the then-current kernel had better interactive performance, especially while swapping, than 2.6 or 3.0 have today. Yeah, I know a bunch of interactivity-related patches have been added during this time--but in the end, I remember using systems with less than half as much memory as I have now and swapping taking less time and apps blocking less. I don't remember my cursor movement lagging back then--now it happens whenever swapping happens. Is the kernel's desktop suitability on the decline? :(

Regressions galore?

Posted Feb 10, 2012 12:59 UTC (Fri) by jospoortvliet (subscriber, #33164) [Link]

Kernel 3.2 is certainly noticeably better than 3.1 and 3.0 in the interactivity regard, esp in low memory/swap situations and on high IO. But you're right in that older linux versions had better behavior in many situations - worse in others, however. Like how often it happened that one high-cpu task would hog your system and make your mouse cursor or even music skip. That rarely happens these days, with the exception of cases with high swapping.

Regressions galore?

Posted Apr 5, 2012 15:55 UTC (Thu) by Andrew_Cady (guest, #83993) [Link]

Maybe your memory usage has grown faster than the speed of your swap disk.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds