The pernicious USB-stick stall problem
This time around, though, Artem made an interesting observation: the system would stall when running with a 64-bit kernel, but no such problem was experienced when using a 32-bit kernel on the same hardware. One might normally expect the block I/O subsystem to be reasonably well isolated from details like the word length of the processor, but, in this case, one would be surprised.
The problem
Linus was quick to understand what was going on here. It all comes down to the problem of matching the rate at which a process creates dirty memory to the rate at which that memory can be written to the underlying storage device. If a process is allowed to dirty a large amount of memory, the kernel will find itself committed to writing a chunk of data that might take minutes to transfer to persistent storage. All that data clogs up the I/O queues, possibly delaying other operations. And, as soon as somebody calls sync(), things stop until that entire queue is written. It's a storage equivalent to the bufferbloat problem.
The developers responsible for the memory management and block I/O subsystems are not entirely unaware of this problem. To prevent it from happening, they have created a set of tweakable knobs under /proc/sys/vm to control what happens when processes create a lot of dirty pages. These knobs are:
- dirty_background_ratio specifies a percentage of memory; when
at least that percentage is dirty, the kernel will start writing those
dirty pages back to the backing device. So, if a system has 1000
pages of memory and dirty_background_ratio is set to 10% (the
default), writeback will begin when 100 pages have been dirtied.
- dirty_ratio specifies the percentage at which processes that
are dirtying pages are made to wait for writeback. If it is set to
20% (again, the default) on that 1000-page system, a process dirtying
pages will be made to wait once the 200th page is dirtied. This
mechanism will, thus, slow the dirtying of pages while the system
catches up.
- dirty_background_bytes works like
dirty_background_ratio except that the limit is specified as
an absolute number of bytes.
- dirty_bytes is the equivalent of dirty_ratio except that, once again, it is specified in bytes rather than as a percentage of total memory.
Setting these limits too low can affect performance: temporary files that will be deleted immediately will end up being written to persistent storage, and smaller I/O operations can lead to lower I/O bandwidth and worse on-disk placement. Setting the limits too high, instead, can lead to the sort of overbuffering described above.
The attentive reader may well be wondering: what happens if the administrator sets both dirty_ratio and dirty_bytes, especially if the values don't agree? The way things work is that either the percentage-based or byte-based limit applies, but not both. The one that applies is simply the one that was set last; so, for example, setting dirty_background_bytes to some value will cause dirty_background_ratio to be set to zero and ignored.
Two other details are key to understanding the behavior described by Artem: (1) by default, the percentage-based policy applies, and (2) on 32-bit systems, that ratio is calculated relative to the amount of low memory — the memory directly addressable by the kernel, not the full amount of memory in the system. In almost all 32-bit systems, only the first ~900MB of memory fall into the low-memory region. So on any current system with a reasonable amount of memory, a 64-bit kernel will implement dirty_background_ratio and dirty_ratio differently than a 32-bit system will. For Artem's 16GB system, the 64-bit dirty_ratio limit would be 3.2GB; the 32-bit system, instead, sets the limit at about 180MB.
The (huge) difference between these two limits is immediately evident when writing a lot of data to a slow storage device. The lower limit does not allow anywhere near as much dirty data to accumulate before throttling the process doing the writing, with much better results for the user of the system (unless said user wanted to give up in disgust and go for beer, of course).
Workarounds and fixes
When the problem is that clearly understood, one can start to talk about solutions. Linus suggested that anybody running into this kind of problem can work around it now by setting dirty_background_bytes and dirty_bytes to reasonable values. But it is generally agreed that the default values on 64-bit systems just don't make much sense on contemporary systems. In fact, according to Linus, the percentage-based limits have outlived their usefulness in general:
Things have changed.
Thus, he suggested, the defaults should be changed to use the byte-based limits; either that, or the percentage-based limits could be deemed to apply only to the first 1GB of memory.
Of course, it would be nicer to have smarter behavior in the kernel. The
limit that applies to a slow USB device may not be appropriate for a
high-speed storage array. The kernel has logic now that tries to estimate
the actual writeback speeds achievable with each attached device; with that
information, one could try to limit dirty pages based on the amount of time
required to write them all out. But, as Mel Gorman noted, this approach is "not that trivial
to implement
".
Andreas Dilger argued that the whole idea of building up large amounts of dirty data before starting I/O is no longer useful. The Lustre filesystem, he said, will start I/O with 8MB or so of dirty data; he thinks that kind of policy (applied on a per-file basis) could solve a lot of problems with minimal complexity. Dave Chinner, however, sees a more complex world where that kind of policy will not work for a wide range of workloads.
Dave, instead, suggests that the kernel focus on implementing two fundamental policies: "writeback caching" (essentially how things work now) and "writethrough caching," where much lower limits apply and I/O starts sooner. Writeback would be used for most workloads, but writethrough makes sense for slow devices or sequential streaming I/O patterns. The key, of course, is enabling the kernel to figure out which policy should apply in each case without the need for user intervention. There are some obvious indications, including various fadvise() calls or high-bandwidth sequential I/O, but, doubtless, there would be details to be worked out.
In the short term, though, we're most likely to see relatively simple
fixes. Linus has posted a patch limiting
the percentage-based calculations to the first 1GB of memory. This kind of
change could conceivably be merged for 3.13; fancier solutions, obviously,
will take longer.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Writeback |
Posted Nov 7, 2013 8:04 UTC (Thu)
by eru (subscriber, #2753)
[Link] (2 responses)
And also for all removable devices, because with them it is common you want to flush all pending writes and unmount.
Posted Nov 7, 2013 13:38 UTC (Thu)
by renox (guest, #23785)
[Link] (1 responses)
*all* removable devices?? What you describe is common for USB key, sure, but what about USB HDD?
Posted Nov 8, 2013 3:39 UTC (Fri)
by eru (subscriber, #2753)
[Link]
I am myself one of those, finding that the easiest way to expand storage on my old home PC. But I think that is still exceptional, and could be solved by passing a parameter when mounting (in fstab? I have those drives mounted the traditional way, instead of hot-plugging), or later via some tuning command.
Posted Nov 7, 2013 8:34 UTC (Thu)
by nix (subscriber, #2304)
[Link] (1 responses)
But... compare it to what we have now, which is *catastrophically* wrong by default, easily giving you a situation where you can have an amount of data that can take ten minutes to write out and jamming the entire system into a stall until it's done. In order to be as bad as what we have now, a rate-estimation-based system would have to be hit with a device which goes from above-SSD speeds to USB-key speeds on the fly -- and how likely is that?
In this case, I'd say, the perfect is the enemy of the good. What we have now is bad: simpleminded rate-estimation would be better, even if not perfect. Go for that first, pile in the complexity later, and throw away those horrible old knobs. (Whichever was written to last wins?! Ugh!)
Posted Nov 7, 2013 14:48 UTC (Thu)
by jzbiciak (guest, #5246)
[Link]
I guess it depends on whether there are devices that have their own in-built caching and can absorb quite a few writes until they slow down dramatically. They could exhibit rather bimodal behavior based on the size of the incoming writes. Also, where does NFS fit in the picture? There, performance may fluctuate quite a bit as well, although I don't know if it's affected by this particular set of knobs. (Seems like it ought to be.) With NFS, you have the combined effects of the buffering on the NFS server as well as all the other people on the network vying for the same bandwidth. My gut feel tells me any simple-minded rate estimator should also have a fairly quick adaptation rate so it tracks any workload-dependent behavior and other variations in media performance. ie. it should probably represent recent history (the last several seconds), more than long-term history.
Posted Nov 7, 2013 8:43 UTC (Thu)
by johannbg (guest, #65743)
[Link]
Posted Nov 7, 2013 16:17 UTC (Thu)
by luto (guest, #39314)
[Link] (2 responses)
Posted Nov 7, 2013 21:43 UTC (Thu)
by nybble41 (subscriber, #55106)
[Link]
In any case the percentage should be based on the same numbers--total RAM, not low memory--regardless of the word size.
Posted Nov 8, 2013 6:42 UTC (Fri)
by neilbrown (subscriber, #359)
[Link]
Guess what. The kernel doesn't even get it right!! Almost but not quite.
There is a global variable "ratelimit_pages" which is effectively a granularity - we only do the expensive tests every "ratelimit_pages" pages.
This gets updated whenever you set dirty_ratio or dirty_bytes. It is set to dirty_thresh / (num_online_cpus() * 32)
However if you set "dirty_bytes" and then "dirty_ratio", the second calculation of ratelimit_pages will be based on the old "dirty_bytes" value, not the new "dirty_ratio" value.
It's a minor bug, but it confirms your assertion that this is an easy interface to get wrong.
[dirty_ratio_handler() should set "vm_dirty_bytes = 0" *before* the call to writeback_set_ratelimit()]
Posted Nov 7, 2013 23:43 UTC (Thu)
by jhhaller (guest, #56103)
[Link] (3 responses)
My question is whether writing dirty pages back more quickly will have a big effect on COW filesystem performance. Copying a large file to a COW filesystem may trigger more COW actions than before.
Posted Nov 11, 2013 0:45 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (2 responses)
Lots of fadvise and madvise flags have been developed over the years; doesn't one of them get the writeback happening immediately? I created a TV recorder based on a ca 2004 Linux kernel that used msync(MS_ASYNC) for that purpose (yes, I made it write the file via mmap just so I could use msync).
Of course, if the issue is that you think the recorder might not actually be able to keep up with the data arriving, and want to make sure when that happens it just drops TV data and doesn't cripple the rest of the system too, then you do need something synchronous like fsync.
Posted Nov 11, 2013 1:59 UTC (Mon)
by jhhaller (guest, #56103)
[Link] (1 responses)
Posted Nov 11, 2013 4:16 UTC (Mon)
by giraffedata (guest, #1954)
[Link]
The actual point lost me, because I don't see the connection between the high-volume writing and reading you described and copy-on-write and the stalling of the playback. But I also don't know anything about mythtv or btrfs specifically.
I don't think there's anything inherent in COW that means if you flush a large file write sooner that you make more copies of tree data or some group of blocks, but I may have just totally missed the scenario you have in mind.
Posted Nov 8, 2013 16:12 UTC (Fri)
by ssam (guest, #46587)
[Link] (2 responses)
cat /proc/sys/vm/dirty_background_ratio
(I get 10 and 20 on fedora), and then change them to smaller numbers with something like
echo 2 > /proc/sys/vm/dirty_background_ratio
in your /etc/rc.d/rc.local
Posted Nov 9, 2013 4:07 UTC (Sat)
by naptastic (guest, #60139)
[Link]
Posted Nov 11, 2013 0:34 UTC (Mon)
by giraffedata (guest, #1954)
[Link]
With the defaults, your system can write up to .2M (where M is the size of your memory) to the USB stick before the system slows to a crawl. Remember that not just the process writing to the stick must wait for writeback before it can dirty more pages - all processes must. With your proposed numbers, the crawl happens for writes to the stick as small as .05M.
Lowering dirty_background_ratio will give your system a probably imperceptible head start (and thus earlier finish) on writing all that data to the USB stick. The headstart will be the amount of time it takes to buffer .08M of writes (10% - 2%).
These workarounds using global parameters can help only in carefully constructed cases. To prevent a slow write to USB from affecting non-USB-writing processes, the kernel would need some kind of memory allocation scheme that distinguishes processes or distinguishes write speed of backing devices.
Posted Nov 9, 2013 4:25 UTC (Sat)
by naptastic (guest, #60139)
[Link]
* - dirty_background_ratio is a signed 32-bit value. If our "fixed value" were 1MiB, you could specify up to 2TiB of buffer space in the future. When we get to 128-bit architectures, we might want to increase the size of dirty_background_ratio to accommodate larger buffers.
Posted Nov 9, 2013 6:58 UTC (Sat)
by Nagilum (guest, #93411)
[Link] (2 responses)
Posted Nov 15, 2013 13:40 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (1 responses)
You're obviously not a developer (or gentoo user). I have a huge (20/30Gb ramdisk) for temp precisely because I quite often have gigs of data that gets created and deleted pretty quick. What's the point of writing it to disk when my system has plenty of ram?
Cheers,
Posted Nov 15, 2013 14:03 UTC (Fri)
by Nagilum (guest, #93411)
[Link]
Posted Nov 9, 2013 8:54 UTC (Sat)
by marcH (subscriber, #57642)
[Link] (2 responses)
Good analogy. It stops at the cure though: with TCP/IP it's all about dropping packets! Back-pressure is extremely rare in networking because of Head Of Line blocking.
Speaking of Head Of Line blocking, I suspect the queues involved in this article don't make the difference between users, do they? In other words, someone writing a lot to a slow device will considerably slow other users, correct?
(Yes, I do realize USB sticks don't tend to have a lot of concurrent users :-)
Posted Nov 9, 2013 15:30 UTC (Sat)
by raven667 (subscriber, #5198)
[Link] (1 responses)
Posted Nov 11, 2013 5:32 UTC (Mon)
by jzbiciak (guest, #5246)
[Link]
Posted Nov 14, 2013 10:02 UTC (Thu)
by callegar (guest, #16148)
[Link]
Posted Nov 14, 2013 19:29 UTC (Thu)
by chojrak11 (guest, #52056)
[Link] (3 responses)
Posted Nov 14, 2013 19:34 UTC (Thu)
by apoelstra (subscriber, #75205)
[Link] (2 responses)
One result of this is that average-case data transfer on Windows is noticeably slower than on Linux, which is probably why the kernel folks are loath to copy such a solution.
Posted Nov 14, 2013 19:51 UTC (Thu)
by khim (subscriber, #9252)
[Link] (1 responses)
Posted Nov 14, 2013 21:48 UTC (Thu)
by hummassa (subscriber, #307)
[Link]
Ah, and in my job I see hundreds of USB drives becoming totally hosed every year, by way users writing them on a windows machine and removing without unmounting. Don't do that, even on Windows.
Posted Nov 15, 2013 11:09 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (1 responses)
I remember there was an issue with page locking, I think, some time ago that was solved with stable pages. But it still doesn't make much sense.
Posted Jan 8, 2014 10:26 UTC (Wed)
by jospoortvliet (guest, #33164)
[Link]
Posted Nov 15, 2013 21:02 UTC (Fri)
by HenrikH (subscriber, #31152)
[Link] (1 responses)
Posted Nov 17, 2013 4:25 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link]
- Heavy output on any local terminal (switching workspaces or to another tmux window makes it work better, so closer to the X side of the I/O pipeline; the switch might not take effect for 20–30 seconds though)
Posted Jun 12, 2015 20:41 UTC (Fri)
by evultrole (guest, #103116)
[Link] (1 responses)
Got the problem fixed with a custom udev rule.
/usr/lib/udev/rules.d/81-udisks_maxsect.rules
SUBSYSTEMS=="scsi", ATTR{max_sectors}=="240", ATTR{max_sectors}="32678"
My hangs disappeared after a reboot.
Posted Jun 13, 2015 8:04 UTC (Sat)
by paulj (subscriber, #341)
[Link]
Posted Nov 7, 2018 21:11 UTC (Wed)
by sourcejedi (guest, #45153)
[Link]
"The entire system proceeds to just hang" - I think this is misleading :-(. Artem didn't report this, and I don't see any other evidence presented for it.
I am hopeful that it is prevented, or at least mitigated, by the "No-I/O dirty throttling" code that you reported on in 2011 :-). This throttles write() calls to control both the size of the overall writeback cache, and the amount of writeback cache *for the specific backing device*.
Artem did not report the entire system hanging while it flushes cached writes to a USB stick. His report only complained the "sync" command could take up to "dozens of minutes".
In his followup message, Artem reported "the server almost stalls and other IO requests take a lot more time to complete even though `mysqldump` is run with `ionice -c3`". But this was not the USB-stick problem. It happened after creating a 10GB file on an *internal* disk.
I'm not saying there isn't a bufferbloat-style problem. But I cant find any evidence here, that excessive writeback cache on one BDI is delaying writes to other BDIs. At least in the simple case you described.
I wrote a StackExchange post about this here.
Posted Apr 5, 2020 16:10 UTC (Sun)
by abdulla95 (guest, #138076)
[Link]
echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
Didn't help me. The performance did improve but it would still lag. (I have a 1TB HDD and 8GB RAM)
My question is, is using a hack to go around this a good thing? Like `ionice`, `rsync`, `pv`? I have seen these being thrown around in the internet. And I have used rsync and it works.
Posted Mar 5, 2021 14:23 UTC (Fri)
by kolay.ne (guest, #145247)
[Link] (2 responses)
Posted Apr 18, 2021 13:19 UTC (Sun)
by LaurentD (guest, #151713)
[Link] (1 responses)
That said, I concur. 2021 already and still seeing the issue here as well. The below does not seem to help much.
echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
I wonder: how have other OSs addressed the issue?
Posted Apr 20, 2021 0:42 UTC (Tue)
by flussence (guest, #85566)
[Link]
More serious answer: has anyone benchmarked the bufferbloat in writing directly to a USB stick compared to spinning up a VM, handing the USB device to that, exposing the stick as an NFS share and writing that way? I honestly wouldn't be surprised if the latter works better.
Posted Aug 20, 2021 20:10 UTC (Fri)
by xmready (guest, #153808)
[Link] (2 responses)
Posted Aug 21, 2021 1:25 UTC (Sat)
by pizza (subscriber, #46)
[Link] (1 responses)
Patches welcome!
Posted Jul 12, 2023 19:43 UTC (Wed)
by juliano_vs (guest, #166031)
[Link]
with the following content:
then restart your machine !
After so many years this should already have a definitive solution in the kernel and not need manual intervention by the user.
but writethrough makes sense for slow devices
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
>
> And also for all removable devices, because with them it is common you want to flush all pending writes and unmount.
I'm sure that some users have such an HDD plugged in all the time and they would object to the performance degradation..
I'm sure that some users have such an HDD plugged in all the time and they would object to the performance degradation.
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
a device which goes from above-SSD speeds to USB-key speeds on the fly
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The one that applies is simply the one that was set last; so, for example, setting dirty_background_bytes to some value will cause dirty_background_ratio to be set to zero and ignored.
Seriously? I bet that no one (especially the sysctl tool) gets this right.
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
COW filesystems and the stall problem
fsync is a rather primitive way to cause the system not to keep memory needlessly dirty.
COW filesystems and the stall problem
COW filesystems and the stall problem
Yes, I was not responding to the point.
COW filesystems and the stall problem
The pernicious USB-stick stall problem
cat /proc/sys/vm/dirty_ratio
echo 5 > /proc/sys/vm/dirty_ratio
The pernicious USB-stick stall problem
I don't see how that will make things noticeably better. In fact, it will probably make it worse.
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
vm.dirty_background_ratio=0
or some other very low number (0..5). Personally I see no reason to delay starting to write dirty data out other than power saving. There is usually only a very slim chance that data will be written that will be deleted right away again so delaying starting to flush the data to disk makes very little sense to me.
If you have multiple writers it may also help with the performance if you have a higher value here but high for me is something like 5.
The pernicious USB-stick stall problem
Wol
The pernicious USB-stick stall problem
Anyway if nothing else is waiting for IO on that disk then it still wouldn't bother you very much since it won't block anything.
Anyhow you have your use-case solved.
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
This is classic case of "perfect is the enemy of good". Yes, Windows commits everything to USB stick right away, yes it's slow and inefficient, but it also means that you can actually use USB sticks without worry! With Linux small operations are extremely fast, but try to copy few gigs of data to USB stick and be ready to use a different computer for a few minutes (or hours) because system will be totally hosed.
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
- rdesktop (I have to hide Windows' TTY window(s) during a build since its speed affects my local machine even when on another, hidden, desktop locally)
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes
The pernicious USB-stick stall problem
Still having this issue
The pernicious USB-stick stall problem
echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
The pernicious USB-stick stall problem
I keep imagining a new user going to linux and facing this problem, having to wait 2 hours to unmount his usb stick 2.0 and having to go out looking for a manual solution on google
In my case i am not a new user but i had this problem and i ended up finding a solution and now i can unmount the pendrive as soon as the copy progress bar ends (just like it is in windows)
correction:
create the file:
/etc/udev/rules.d/60-usb-dirty-pages-udev.rules
ACTION=="add", KERNEL=="sd[a-z]", SUBSYSTEM=="block", ENV{ID_USB_TYPE}=="disk", RUN+="/usr/bin/bash -c 'echo 1 > /sys/block/%k/bdi/strict_limit; echo 16777216 > /sys/block/%k/bdi/max_bytes'"
It's this kind of thing that keeps new users away from the system
