The CFQ "low latency" mode

[Posted October 7, 2009 by corbet]

One of the changes slipped into the 2.6.32-rc3 release was the addition of a "low latency" mode for the CFQ I/O scheduler. Normally the scheduler will try to delay many new I/O requests for a short time in the hope that they can be joined with other requests which may come shortly thereafter. This behavior will minimize disk seeks and maximize I/O request size, so it is clearly good for throughput. But the addition of delays can be a problem if the overriding goal is to complete the operation as quickly as possible.

The new mode (initially called "desktop" before being renamed "low_latency") is enabled by default; it can be adjusted by setting the iosched/low_latency attribute associated with each block device in sysfs. When set, some of the delays for "synchronous operations" (reads, generally) no longer happen. The result should be more responsive I/O and, one would hope, happier users.

Note: please see the comments for a description of this change which is more, um, accurate. Your editor blames the Death Flu that his kids brought home.

What the 'low_latency' knob does

Posted Oct 8, 2009 7:11 UTC (Thu) by axboe (subscriber, #904) [Link] (1 responses)

The description of the 'low_latency' mode isn't very accurate unfortunately, perhaps Jon didn't have his coffee before reading over it :-). Let me attempt to rectify that.

The low_latency knob doesn't impact delays or merging, one of the key aspects to getting low latency for a series of operations (like starting your firefox while other IO is happening) is actually making sure we get the delays right. If we take the classic case of reader vs writer, the writer dirty speed will greatly outnumber the writeback speed. So we always have tons of dirty pages waiting to be written. The normal reader, however, does dependent reads that are serialized by each other. When one read finishes, another will be issued by the reader very shortly. Achieving good throughput and latency for the reader in CFQ is accomplished by briefly waiting for another IO when one has completed. In CFQ, this is called idling.

The two primary changes in behaviour for CFQ in -rc3 is letting seeky IO also idle, even if the hardware does command queuing (which most does these days) and limiting the damage that the async IO can do while sync IO is also happening. With the 'low_latency' knob switched to on, CFQ will only slowly build up a queue depth of async writes. This greatly helps reduce the impact that a writer will have on the system interactiveness, since when we miss a sync idle window only slightly, the amount of async writeback sent to the device will be limited by the time since that last sync IO.

The end result is that the desktop experience should be less impacted by background IO activity. It's also worth mentioning that the 'low_latency' setting defaults to on.

What the 'low_latency' knob does

Posted Oct 8, 2009 12:44 UTC (Thu) by Yenya (subscriber, #52846) [Link]

Do you think the low-latency knob could help also for situations like ordinary filesystem traffic versus background md-RAID resync?

I had a pretty bad experience with CFQ on my FTP server (ftp.linux.cz, SW RAID-5 over 8x 1TB SATA drives) - the resync of the array with CFQ took about 3 days with the overall system responsiveness being pretty bad, while with the deadline iosched (which is what I am running now) it takes less than a day, and even then the system latency for things like typing commands to a ssh session is good (read: no noticeable change against the fully reconstructed array).

-Yenya

The CFQ "low latency" mode

Posted Oct 8, 2009 23:20 UTC (Thu) by giraffedata (guest, #1954) [Link] (7 responses)

Normally the scheduler will try to delay many new I/O requests for a short time in the hope that they can be joined with other requests which may come shortly thereafter. This behavior will minimize disk seeks and maximize I/O request size, so it is clearly good for throughput.

No, that's not clear at all. Minimizing disk seeks and maximizing I/O request size is clearly good for disk efficiency -- minimizing disk utilization for a given workload -- but for throughput to be meaningful, utilization has to be about 100%. When that's the case, I/O backs up into the Linux I/O queue so that no extra delays are necessary in order to join requests with other requests.

You just can't improve throughput by deliberately letting the disk sit idle.

The CFQ "low latency" mode

Posted Oct 9, 2009 7:12 UTC (Fri) by Yenya (subscriber, #52846) [Link] (6 responses)

> You just can't improve throughput by deliberately letting the disk sit idle.

In fact, you can. Think avoiding some seeks and issuing sequential operations with a shorter-than-a-seek-time delays in between.

The CFQ "low latency" mode

Posted Oct 9, 2009 15:48 UTC (Fri) by giraffedata (guest, #1954) [Link] (5 responses)

You just can't improve throughput by deliberately letting the disk sit idle.
In fact, you can. Think avoiding some seeks and issuing sequential operations with a shorter-than-a-seek-time delays in between.

You'll have to be more specific.

It sounds like you're talking about a strategy for improving response time for a bursty workload, whereas throughput is meaningful only for a non-bursty unlimited supply of work.

The CFQ "low latency" mode

Posted Oct 12, 2009 7:56 UTC (Mon) by Yenya (subscriber, #52846) [Link] (4 responses)

No, you can increase also the _throughput_ (= the number of sectors handled in a given -large- period of time) by adding pauses shorter than the seek time. To be more specific - let's have two readers: A and B, each reading from its own part of the disk [A], and [B], respectively. For the sake of simplicity let's assume that two subsequent operations within the area [A] or within the area [B] do not require seek and are fast, while the read from the area [A] followed by the read from the area [B] requires seek, which is much slower. Then it is definitely better from the throughput point of view to issue the operations in the following order:

[A]-pause-[A]-seek-[B]-pause-[B]-seek-[A]-pause-[A]-...

than the "no-pause" variant of

[A]-seek-[B]-seek-[A]-seek-[B]-seek-...

It is not a bursty workload or a response-time-critical workload. It is an "unlimited supply of work" batch workload by my definition. And it has higher throughput with the pauses added than without them.

The CFQ "low latency" mode

Posted Oct 13, 2009 0:22 UTC (Tue) by giraffedata (guest, #1954) [Link] (1 responses)

OK, I'll buy that. Letting the disk sit idle can improve the throughput capacity for a limited workload like that (limited not because there are times when there is no work available but because there are only two streams and each apparently doesn't want to have more than one I/O at a time outstanding).

What I was thinking is that when people ask about disk throughput (capacity), it's usually on a system that drives the disk a lot harder than that -- i.e. the disk's basic capacity is in question. That means requesters throw large amounts of I/O at the disk and the speed is then determined by how quickly the disk can move the I/Os through. In the A-B scenario you describe, I would ask about the disk's response time, not its throughput, because it's the waiting for a response that governs the speed of this system.

The CFQ "low latency" mode

Posted Oct 13, 2009 18:06 UTC (Tue) by dlang (guest, #313) [Link]

it's not necessarily as different as you are making it out to be.

remember that seeks are _expensive_, you can transfer a LOT of data in the time of one seek that you can avoid doing.

so throughput optimizations like this can be relevant to the total disk response capabilities.

The CFQ "low latency" mode

Posted Oct 15, 2009 14:50 UTC (Thu) by guest (guest, #2027) [Link] (1 responses)

That only works if both readers issue requests without waiting for results. That's not how programms usually work - if they issue a read request, they wait for it to succeed before sending the next read.

Anyway, if you have such workloads and you do *not* pause, what happens? You perform the first seek to A, read, A, seek to B, read B and in the meantime, more requests for A have arrived. If it's only one, you still seem to be fast enough despite seeking - just seek back to A and go on. If the seek takes too long, multiple request should have been queued already and you can coalesque them and handle them with one seek.

IMHO, letting a disk stay idle when there's work to do is wrong!

The CFQ "low latency" mode

Posted Oct 15, 2009 17:57 UTC (Thu) by efexis (guest, #26355) [Link]

"letting a disk stay idle when there's work to do is wrong!"

Except that the disk is as good as idle while it's seeking... you can't read or write while it's happening.

I already know this to be true, I come across it on a server I part manage, which tries to schedule backups for several sites all at once. The disk thrashes, it grinds to a halt, and it takes forever to finish. So, I wrote a small bash script that when the load gets high, sends a STOP signal to all the backup processes, and then sends just one of them a CONT signal, so only that one process is running. Every few seconds it will STOP that task, and CONT a different one. The backups complete in a much shorter time, and system responsiveness is much better while it's happening. Why? Because the heads don't move away from the current reader's position as often, even though it does that same issue-read, wait, process, issue-read, wait, process etc... read pattern. So, with the amount of time the drive spends seeking reduced, the SAME drive is able to complete the SAME amount of work in LESS time with LESS effect on the rest of the system.

This is just a fact, it's real, it works, as much as it may sound counter-intuitive to you, the numbers don't lie.