User: Password:
Subscribe / Log in / New account

Explicit block device plugging

Explicit block device plugging

Posted Apr 17, 2011 7:05 UTC (Sun) by walex (subscriber, #69836)
Parent article: Explicit block device plugging

This comment about plugging is amazingly misguided because it attacks the one good point about plugging: that it does improve device utilization.

The reason is that "giraffedata" seems to be entirely unaware that there are devices called "disks" that have extremely high and variable latencies, and given these it is possible that bunching IO requests allows the elevator to minimize latencies in such a way that the pauses imposed by plugging may be worthwhile.

There are two very big problems with plugging, or rather its current implementation, which makes it a tremendously stupid thing as a result:

* Putting it in the block layer is extremely bad, because there are storage devices that don't have high and variable latencies and for which plugging is counterproductive. If plugging makes any sense it should be in the device drivers.

* Plugging quantizes the flow of IO requests making them essentially synchronous with the periodic expiry of the plugging timer, and the resulting bunching, which is indeed the intended effect as to scheduling, can have bad consequences on page cache usage, and limits the bandwidth usable for the device in common cases (long nearly contiguous writes).

Plugging was introduced IIRC as a way to cheat on some common benchmark.

Plugging is a gross mistake that should be entirely removed from the page cache, and perhaps turned into some kind of scheduling library available to device drivers for use when the relevant device latency profiles might conceivably benefit (almost never actually).

(Log in to post comments)

Explicit block device plugging

Posted Apr 17, 2011 9:44 UTC (Sun) by neilbrown (subscriber, #359) [Link]

Your comments make it sound to make like you don't understand how Linux plugging works - though maybe I misunderstand you...

There is no unplugging timer. There was a timer in the previous code that would unplug after 3ms, but that was mainly a 'just in case' measure. Normally the unplug would happen much earlier. If it was only the timer that triggered an unplug then you are right, performance would be terrible.

Also plugging was only relevant when the device was idle. If the queue was not empty it would not get plugged. So with long, nearly continuous writes, the queue would only be plugged once at the very beginning.

The purpose of plugging is primarily about latency, not bandwidth. If bandwidth is an issue, your queue will not be empty, so the time it takes for a request to get to the front of the queue is long enough for any other related requests to be seen and merged so there is no point in plugging (at least the old style - the new style still brings benefits).

The new plugging code is a bit different. Rather than only plugging idle devices it doesn't really plug devices at all. It plugs request submitters instead. (So the new code is a lot closer to the page cache than the old code).

So when a process submits a request, it gets queued in the process, not in the device. Once a process has submitted all that it wants to submit (or whenever it schedules - to avoid deadlocks), the requests queued in the process are released.

So you get a similar effect on the starting transient - a large number of pre-sorted requests gets handled all at once. However there are other advantages as well in terms of lock contention.

I think it is very hard to imaging plugging slowing down even a very fast low latency device. The page cache has a bunch of pages that it wants to perform IO on and it assembles them into a big chunk and sends them all to the device. It is true that the first request won't get there quite as soon, but unless the device is as fast as memory, then it will still get there fast enough that you probably cannot measure the difference.

And if a device is as fast as memory, then it probably shouldn't be under the request_queue code at all - a driver more like the umem.c driver might be appropriate. It just takes bios directly and turns them into DMA descriptors.. But even that uses plugging so it can start a chain of dmas at once rather than just one.

I've probably rambled a but there, but I really think you aren't being fair to plugging. It may not be perfect but I would need a lot more evidence before I could see any justification for it being a gross mistake.

Explicit block device plugging

Posted Apr 27, 2011 5:26 UTC (Wed) by dlang (subscriber, #313) [Link]

I'm also puzzled about this issue.

we had a similar discussion in rsyslog when we introduced the capability to write log entries to databases in batches rather than individually. my initial proposal was to try and queue a set number of entries to write at once (with a timer to make sure they get written _reasonably_ soon in any case), but it was pointed out that if you just write what's ready, and let everything else queue up in the meantime, the size of the writes auto-tunes itself.

i.e. by always writing whatever's pending in the queue (up to a max) when you have the ability to write, you achieve both low-latency for the initial writes (and low load), and high efficiency under heavy load (because the queue backs up while you are doing the 'inefficient' small writes). This auto-tunes for the lowest available latency.

I see two costs in doing this.

1. more device actions than an optimally batched mode

2. more CPU cycles used to process the small batches

if the system is idle enough, these don't matter, I could see them becoming an issue if they either cost more power, or use a resource that could otherwise be used by another process (cpu cycles, or bus bandwidth)

has anyone tried just doing away with plugging and see what the results are? (especially on anything that measures more than how short the device utilized time can be)

Explicit block device plugging

Posted Apr 29, 2011 7:04 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

> we had a similar discussion in rsyslog when we introduced the capability to write log entries to databases in batches rather than individually. my initial proposal was to try and queue a set number of entries to write at once (with a timer to make sure they get written _reasonably_ soon in any case), but it was pointed out that if you just write what's ready, and let everything else queue up in the meantime, the size of the writes auto-tunes itself.
If your piece of code does two writes to a normal rotating media disk without plugging - as far as I understand it - the first write will cause a disk activity which will take up to 15ms for an idle disk.
Some microseconds later your code will submit a second page. Unfortunately that will wait in some queue until the disk is finished writing the first block.
Which means you will need ~30ms in the worst case.
On the other hand, if you plugged the device before doing those writes, it will sort those writes to be in disk order and the disk will be able to do it in one rotation. Which means its ~15ms.

Explicit block device plugging

Posted Apr 29, 2011 7:27 UTC (Fri) by dlang (subscriber, #313) [Link]

almost correct if you have just a couple of writes.

if the second write arrives 15ms after the first and the device has been plugged, then with plugging it takes 30ms to get both writes on disk (15ms of delay + 15 ms of activity), without plugging it takes 15ms to get both writes on disk (15ms of activity for the first one and 15ms of activity for the second one)

if the data is submitted in a shorter period of time, then the two writes may finish faster with plugging, but if they arrive further apart (and the device remains plugged) all you are doing is delaying how long it takes for the first one to get to disk.

so you should never plug longer than it would take to write the first block to disk, but figuring out how long that will be is hard, so the timer to unplug is set long

but if you are writing a larger amount, during the 15ms while the disk is processing the first write, multiple additional requests will queue up and be able to be combined.

and if they don't, then the disk is active for 30ms instead of for 15ms, but nothing else had a need for it so why do you care? If anything else had a need for the disk it would have generated additional requests that would be combined with the second request instead of the first and second being combined and finished with the third being processed independently (possibly with plugging of it's own)

yes, some mobile users may care in an attempt to save power, but in that case they really want to have much higher latency, on the order of seconds or tens of seconds to avoid spinning up the drive, so that's not the relevant use case.

if this was being done under application control (because after all, the application is the only thing that knows what's going to happen in the future) I could see it. but you are trying to have the kernel guess if there is going to be more activity in the near future.

case 1. if there is no activity in the near future the plug just delayed the write

if there is a tiny bit of activity in the near future, each one can be treated independently as if there is no activity in the future.

case 2. if there is a lot of activity, the second request will get delayed by the time it takes to process the first request instead of by the time the plug is in place.

is there really good enough prediction of future activity to make the the kernel guessing correctly that case 2 will happen be worth the complexity, time spent manging the plugs, nd increased latency for the first activity?

Explicit block device plugging

Posted Apr 29, 2011 7:40 UTC (Fri) by dlang (subscriber, #313) [Link]

or possibly a better way of putting it

assume that each disk action takes 15ms and data arrives every 10ms

with plugging of up to 15ms or a second item you write

2 blocks starting at 10ms finishing at 25ms
2 blocks starting at 30ms finishing at 45ms
2 blocks starting at 50ms finishing at 65ms
2 blocks starting at 70ms finishing at 85ms

without plugging you write

1 block starting at 0ms finishing at 15ms
1 block starting at 15ms finishing at 30ms
2 blocks starting at 30ms finishing at 45ms
1 block starting at 45 ms finishing at 60ms
2 blocks starting at 60ms finishing at 75ms
1 block starting at 75ms finishing at 90ms

does this really make a difference? yes, in the second case the disk is busy continuously rather than having a 5ms pause between activity but does that matter?

say the data arrives twice as fast (ever 5ms)

with plugging
2 blocks starting at 5ms finishing at 20ms
3 blocks starting at 20ms finishing at 35ms
3 blocks starting at 35ms finishing at 50ms

without plugging
1 block starting at 0ms finishing at 15ms
3 blocks starting at 15ms finishing at 30ms
3 blocks starting at 30ms finishing at 45ms

where is the gain?

Explicit block device plugging

Posted Apr 29, 2011 9:29 UTC (Fri) by neilbrown (subscriber, #359) [Link]

While I don't disagree with your logic, I do disagree with its relevance.

In the linux kernel, plugging is not timer based.
(There was a timer in the previous implementation, but it was only a last-ditch unplug in case there were bugs: slow is better than frozen).

In the old code a device would plug whenever it wanted to which was typically when a new request arrived for an empty queue. It would then unplug as soon as some thread started waiting for a request on that device to complete. I think it also would unplug explicitly in some cases after submitting lots of writes that were expected to by synchronous, but I'm not 100% certain.

So in the read case for example a read syscall would submit a request to read a page, then another request to read the next page (Because it was an 8K read), then maybe a few more requests to read-ahead some more pages, then wait for that first read to complete. Waiting for the read-ahead requests maybe isn't critical, but waiting for that second page would reduce latency. Now to be fair, if the two pages were adjacent on the disk they would probably have been combined into a single request before begin submitted, and if there aren't then maybe keeping them together isn't so important. But as soon as you get 3 non-adjacent pages in the read, there is a real possible gain from sorting before starting IO.

The new plugging code is quite different. The unplug happens when the thread submitting requests has finished submitting a bunch of requests. It is explicit rather than using the heuristic of 'unplug when someone waits' (hence the title of the article). This means it happens a little bit sooner - there is never any timer delay at all.

Rather than thinking of it as 'plugging' it is probably best to think of it as early-aggregation. hch has suggested that this be even more explicit. i.e. the thread generates a collection of related requests (quite possibly several files full of writes in the write-back case) and submits them all to the device at once. Not only does this clearly give a good opportunity to sort requests - more importantly it means we only take the per-device lock once for a large number of requests. If multiple threads are writing to a device concurrently, this will reduce lock contention making it useful even when the device queue is fairly full (when normal plugging would not apply at all).

The equivalent logic in a 'syslogd' style program would be to simply always service read requests before write requests.

So when a log message comes in, it is queued to be written.
Before you actually write it though you check if another log message is ready to be read from some incoming socket. If it is you read it and queue it. You only write when there is nothing to be read, or your queue is full.

I agree that having a timed unplug event doesn't make much sense.

Explicit block device plugging

Posted Apr 29, 2011 14:57 UTC (Fri) by dlang (subscriber, #313) [Link]

note that in my example, the timer was only used to indicate the max amount of time to wait for the next item to be submitted.

how can the kernel know when the application has finished submitting a bunch of requests?

or is it that the application submits one request, but something in the kernel is breaking it into a bunch of requests that all get submitted at once, and plugging is an attempt to allow the kernel to recombine them? (but that doesn't match your comment about sorting 3 non-adjacent requests being a win, how can one action by an application generate 3 non-adjacent requests?)

I'm obviously missing something here.

if the application is doing multiple read/write commands, I don't see how the kernel can possibly know how soon the next activity will be submitted after each command is run.

if the application is doing something with a single command, it seems like the problem is that it shouldn't be broken up to begin with, so there would be no need to plug to try and combine them

Explicit block device plugging

Posted Apr 29, 2011 22:58 UTC (Fri) by neilbrown (subscriber, #359) [Link]

The actual times between plug and unplug are typically microseconds (I suspect). The old timeout was set at 3 milliseconds and that was very slow. It is almost nothing compared to device IO times.

Actions of the application and requests to devices are fairly well disconnected thanks to the page cache. An app writes to the page cache and the page cache doesn't even think about writing to the device for typically 30 seconds. Of course if the app calls fsync, that expedites things. So a partial answer to "how can the kernel know when the application has finished submitting a bunch of requests" is "the application calls 'fsync' - if it cares".

On the read side, the page cache performs read-ahead so that hopefully every read request can be served from cache - and certainly the device gets large read requests even if the app is making lots of small read requests.

Also the kernel does break things into a bunch of requests which then need to be sorted. If a file is not contiguous on disk, then you need at least one request each separate chunk. Plugging allows this sorting to happen before the first request is started.

There is a good reason why the page cache submits lots of individual requests rather than a single list with lots of requests. Every request requires an allocation. when memory gets tight (which affects writes more than reads) it could be that I cannot allocate memory for another request until the previous ones have been submitted and completed. So we submit the requests individually, but combine them at a level a little lower down, and 'unplug' that queue either when all have been submitted or when the thread 'schedules' - which it will typically only do if it blocks on a memory allocation.

So there are two distinct things here that could get confused.

Firstly there is the page cache which deliberately delays writes and expedites reads to allow large requests independent of the request size used by the application.

Then there is the fact that the page cache sends smallish requests to the device, but tends to send a lot in quick succession. These need to be combined when possible, but also flushed as soon as there is any sign of any complication. This last is what "plugging" does.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds