| LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing |
It is a rare event, but it is no fun when it strikes. Plug in a slow storage device - a USB stick or a music player, for example - and run something like rsync to move a lot of data to that device. The operation takes a while, which is unsurprising; more surprising is when random processes begin to stall. In the worst cases, the desktop can lock up for minutes at a time; that, needless to say, is not the kind of interactive response that most users are looking for. The problem can strike in seemingly arbitrary places; the web browser freezes, but a network audio stream continues to play without a hiccup. Everything unblocks eventually, but, by then, the user is on their third beer and contemplating the virtues of proprietary operating systems. One might be forgiven for thinking that the system should work a little better than that.
Numerous people have reported this sort of behavior in recent times; your editor has seen it as well. But it is hard to reproduce, which means it has been hard to track down. It is also entirely possible that there is more than one bug causing this kind of behavior. In any case, there should now be one less bug of this type if Mel Gorman's patch proves to be effective. But a few developers are wondering if, in some cases, the cure is worse than the disease.
The problem Mel found appears to go somewhat like this. A process (that web browser, say) is doing its job when it incurs a page fault. This is normal; the whole point of contemporary browsers sometimes seems to be to stress-test virtual memory management systems. The kernel will respond by grabbing a free page to slot into the process's address space. But, if the transparent huge pages feature is built into the kernel (and most distributors do enable this feature), the page fault handler will attempt to allocate a huge page instead. With luck, there will be a huge page just waiting for this occasion, but that is not always the case; in particular, if there is a process dirtying a lot of memory, there may be no huge pages available. That is when things start to go wrong.
Once upon a time, one just had to assume that, once the system had been running for a while, large chunks of physically-contiguous memory would simply not exist. Virtual memory management tends to fragment such chunks quickly. So it is a bad idea to assume that huge pages will just be sitting there waiting for a good home; the kernel has to take explicit action to cause those pages to exist. That action is compaction: moving pages around to defragment the free space and bring free huge pages into existence. Without compaction, features like transparent huge pages would simply not work in any useful way.
A lot of the compaction work is done in the background. But current kernels will also perform "synchronous compaction" when an attempt to allocate a huge page would fail due to lack of availability. The process attempting to perform that allocation gets put to work migrating pages in an attempt to create the huge page it is asking for. This operation is not free in the best of times, but it should not be causing multi-second (or multi-minute) stalls. That is where the USB stick comes in.
If a lot of data is being written to a slow storage device, memory will quickly be filled with dirty pages waiting to be written out. That, in itself, can be a problem, which is why the recently-merged I/O-less dirty throttling code tries hard to keep pages for any single device from taking too much memory. But writeback to a slow device plays poorly with compaction; the memory management code cannot migrate a page that is being written back until the I/O operation completes. When synchronous compaction encounters such a page, it will go to sleep waiting for the I/O on that page to complete. If the page is headed to a slow device, and it is far back on a queue of many such pages, that sleep can go on for a long time.
One should not forget that producing a single huge page can involve migrating hundreds of ordinary pages. So once that long sleep completes, the job is far from done; the process stuck performing compaction may find itself at the back of the writeback queue quite a few times before it can finally get its page fault resolved. Only then will it be able to resume executing the code that the user actually wanted run - until the next page fault happens and the whole mess starts over again.
Mel's fix is a simple one-liner: if a process is attempting to allocate a transparent huge page, synchronous compaction should not be performed. In such a situation, Mel figured, it is far better to just give the process an ordinary page and let it continue running. The interesting thing is that not everybody seems to agree with him.
Andrew Morton was the first to object, saying "Presumably some people would prefer to get lots of huge pages for their 1000-hour compute job, and waiting a bit to get those pages is acceptable." David Rientjes, presumably thinking of Google's throughput-oriented tasks, said that there are times when the latency is entirely acceptable, but that some tasks really want to get huge pages at fault time. Mel's change makes it that much less likely that processes will be allocated huge pages in response to faults; David does not appear to see that as a good thing.
One could (and Mel did) respond that the transparent huge page mechanism does not only work at fault time. The kernel will also try to replace small pages with huge pages in the background while the process is running; that mechanism should bring more huge pages into use - for longer-running processes, at least - even if they are not available at fault time. In cases where that is not enough, there has been talk of adding a new knob to allow the system administrator to request that synchronous compaction be used. The actual semantics of such a knob are not clear; one could argue that if huge page allocations are that much more important than latency, the system should perform more aggressive page reclaim as well.
Andrea Arcangeli commented that he does not like how Mel's change causes failures to use huge pages at fault time; he would rather find a way to keep synchronous compaction from stalling instead. Some ideas for doing that are being thrown around, but no solution has been found as of this writing.
Such details can certainly be worked out over time. Meanwhile, if Mel's patch turns out to be the best fix, the decision on merging should be clear enough. Given a choice between (1) a system that continues to be responsive during heavy I/O to slow devices and (2) random, lengthy lockups in such situations, one might reasonably guess that most users would choose the first alternative. Barring complications, one would expect this patch to go into the mainline fairly soon, and possibly into the stable tree shortly thereafter.
Huge pages, slow drives, and long delays
Posted Nov 17, 2011 2:12 UTC (Thu) by nybble41 (subscriber, #55106) [Link]
In my opinion, in-core operations should never be forced to wait on disk I/O unless it's necessary to prevent the entire operation from failing. On the other hand, there is definite value in allocating a hugepage up front, so it might make sense to put some effort toward locating a candidate hugepage which *can* be migrated rather than immediately falling back to individual pages.
If it's possible to try another hugepage, or fall back to individual pages, these options should come first.
Huge pages, slow drives, and long delays
Posted Nov 17, 2011 2:21 UTC (Thu) by naptastic (subscriber, #60139) [Link]
Why not tie this behavior to the kernel preemption setting? If it's set to anything higher than voluntary, then Mel's change (and perhaps some others?) should be in place; if it's set to no preemption, then go ahead and stall processes while you're making room for some hugepages.
It's not server vs desktop
Posted Nov 17, 2011 8:10 UTC (Thu) by khim (subscriber, #9252) [Link]
Actually it does not always make sense on server either. If you have some batch-processing operation (slocate indexer on desktop, map-reduce on server) then it's Ok to wait for the compaction - even if it'll take a few minutes.
But if you need response right away (most desktop operations, live request in server's case) then latency is paramount.
It's not server vs desktop
Posted Nov 17, 2011 13:21 UTC (Thu) by mennucc1 (subscriber, #14730) [Link]
It's not server vs desktop
Posted Nov 17, 2011 21:26 UTC (Thu) by lordsutch (guest, #53) [Link]
It's not server vs desktop
Posted Nov 18, 2011 2:16 UTC (Fri) by naptastic (subscriber, #60139) [Link]
I see two questions. First; can we infer from a process's niceness or scheduler class whether it would prefer waiting for a hugepage or taking what's available now? Second; are memory compaction passes preemptible? Is this the behavior you're looking for?
1. A low-priority, sched_idle process (1) tries to allocate memory. The kernel starts compacting memory to provide it with a hugepage.
2. A higher-priority, sched_fifo process (2) becomes runnable and tries to allocate. Because it's higher priority, the kernel puts the request for (1) on the back burner. Because (2) is sched_fifo, the kernel doesn't wait for compaction but just gives it what's available now
3. With that request satisfied, the kernel goes back to compacting in order to satisfy (1)'s needs.
As someone who only uses -rt kernels, this is the behavior I think I would want. The network can get hugepages, and it can wait for them; but jackd and friends better get absolute preferential treatment for memory Right Now.
Huge pages, slow drives, and long delays
Posted Nov 17, 2011 3:09 UTC (Thu) by smoogen (subscriber, #97) [Link]
Huge pages, slow drives, and long delays
Posted Nov 17, 2011 7:42 UTC (Thu) by iq-0 (subscriber, #36655) [Link]
Writeback caching for USB sticks
Posted Nov 17, 2011 10:15 UTC (Thu) by epa (subscriber, #39769) [Link]
Or at least, have at most a fixed small amount of dirty pages for removable and slow devices (those that can be written within one second, say).
Writeback caching for USB sticks
Posted Dec 15, 2011 13:43 UTC (Thu) by hpro (subscriber, #74751) [Link]
On a related note, have you ever tried to do some useful work on files on a stick mounted 'sync'? It is quite painful, I assure you.
Huge pages, slow drives, and long delays
Posted Nov 17, 2011 10:31 UTC (Thu) by michaeljt (subscriber, #39183) [Link]
Ha-Ha-Only-Serious
Posted Nov 17, 2011 11:09 UTC (Thu) by CChittleborough (subscriber, #60775) [Link]
[T]he whole point of contemporary browsers sometimes seems to be to stress-test virtual memory management systems.Can I submit this as a quote of the week?
This really is a classic "ha-ha-only-serious" quip. For me (and, I guess, most people) the browser is the only thing that uses large amounts of memory, so any memory-related misconfiguration shows up as browser slowdowns, freezes and crashes. (With Firefox 3 and Firefox 4, the OOM killer used to startle me once or twice a month.) Is there a good HOWTO on this topic?
Ha-Ha-Only-Serious
Posted Nov 17, 2011 20:11 UTC (Thu) by nevets (subscriber, #11875) [Link]
the browser is the only thing that uses large amounts of memory
You obviously don't use Evolution.
Ha-Ha-Only-Serious
Posted Nov 17, 2011 20:51 UTC (Thu) by chuckles (subscriber, #41964) [Link]
You obviously don't use Evolution.
Ha-Ha-Only-Serious
Posted Nov 17, 2011 21:10 UTC (Thu) by nevets (subscriber, #11875) [Link]
I still use it, and I sometimes wish they would stop working on it, as they keep making it harder to use after every update. I guess they have the gnome mind set too.
I never cared much for mutt. I do like alpine, but too many people send me html crap that I need to read, and tbird always screws up patches I try to send.
Evolution seems to work the best with imap and it's trivial to send sane patches.
Ha-Ha-Only-Serious
Posted Nov 17, 2011 23:33 UTC (Thu) by mathstuf (subscriber, #69389) [Link]
I didn't at first either, mainly because the bindings were crazy and inconsistent. I've gotten it to be pretty well vim-ish, but it's not a 1:1 translation. It is certainly more consistent though and that makes it much better for me.
> I do like alpine
I couldn't get the hang of alpine. Too much not-mail clutter around.
> but too many people send me html crap that I need to read
For HTML, I have a bindings to toggle plain text or w3m -dump viewing for HTML mails (Ctrl-U and Ctrl-I, respectively). It works fairly well, not perfect, but better than any web interface.
Ha-Ha-Only-Serious
Posted Nov 18, 2011 0:40 UTC (Fri) by nevets (subscriber, #11875) [Link]
I like to have a preview screen. I move mail all the time by dragging a message over to a folder with the mouse. I do wish Evolution had better keyboard short cuts, as I probably could move messages faster with typing. I did with alpine. But mutt still seems hackish to me, and I never got past it. I've been using it for LKML for a few years now, and I still don't care much for it.
Evolution is big and slow, and I need to kill it as often as I do my browsers, but other than that, it works well for me. If it had better key bindings, I would dare call it an awesome application.
Ha-Ha-Only-Serious
Posted Nov 18, 2011 7:53 UTC (Fri) by mathstuf (subscriber, #69389) [Link]
I could see a 'hackish' feeling, but the bindings were the biggest part of that for me. Moving messages is = and then I have tab completion and (session-local) history in the prompt. Tagging files with searches works wonders too to do batch processing of a sort.
I don't imagine we'll convince each other as I find gut feelings are hard to explain and uproot besides and I've been using mutt happily for over a year since moving from KMail.
Ha-Ha-Only-Serious
Posted Nov 24, 2011 20:40 UTC (Thu) by jospoortvliet (subscriber, #33164) [Link]
Until they came with an Akonadi based KMail2 - now it's as slow, if not slower, than Evolution, and eating lots of RAM. Now I have to wait until they fix it and muddle through in the mean time; or give up and go back to webmail :(
Ha-Ha-Only-Serious
Posted Dec 3, 2011 23:25 UTC (Sat) by Los__D (guest, #15263) [Link]
Now I have to wait until they fix it and muddle through in the mean time; or give up and go back to webmail :(
In that case, I'd give RoundCube a try. It has worked perfectly for me by keeping much of the desktop mail client feel. Certainly, that feel sometimes break down, but in general I'm very pleased, especially compared to my old webmail client, Squirrelmail.
SquirrelMail is way more configurable though, if you need that (although RoundCube might be more configurable than I think, currently I'm just running the default Synology DSM version).
Evolution
Posted Nov 17, 2011 21:36 UTC (Thu) by shane (subscriber, #3335) [Link]
* Ability to specify only a part of the message as "preformatted text", so they don't get word-wrapped. Useful for pasting console output in support messages, for example.
* Nice "bullet list"/"number list" support.
* Pretty good IMAP support.
* Multiple-language spell-checking in a single mail.
It's bloated and slow, but newer versions actually do fix lots of bugs.
Evolution
Posted Nov 17, 2011 23:44 UTC (Thu) by mathstuf (subscriber, #69389) [Link]
> Nice "bullet list"/"number list" support.
I tend to just do ASCII lists in vim myself.
> Pretty good IMAP support.
I use offlineimap here.
> Multiple-language spell-checking in a single mail.
I'm not bilingual (or anything more for that matter), so I've not had to solve this one.
Evolution
Posted Nov 18, 2011 1:59 UTC (Fri) by mgedmin (subscriber, #34497) [Link]
Stress-testers
Posted Nov 18, 2011 12:27 UTC (Fri) by CChittleborough (subscriber, #60775) [Link]
You're right, I don't use Evolution. I use (and highly recommend) fastmail.fm, via their excellent webmail interface. So my mail program is just as memory-hungry as Firefox, because it is Firefox.
Slightly more on topic: Mozilla has been working on reducing memory usage in recent versions of Firefox, with good success AFAICT. So we may need to find another stress-tester for the memory subsystem ... :-)
Stress-testers
Posted Nov 24, 2011 20:42 UTC (Thu) by jospoortvliet (subscriber, #33164) [Link]
Huge pages, slow drives, and long delays
Posted Nov 17, 2011 19:19 UTC (Thu) by ejr (subscriber, #51652) [Link]
But it's a server / HPC system. No browsers running, and no USB sticks hanging off. I wouldn't mind having to set a sysctl at boot... Also, our heap space often is allocated once. Dynamic compaction is less interesting, but doesn't seem to be hurting performance. I should measure that sometime.
Optimize for desktop
Posted Nov 17, 2011 20:32 UTC (Thu) by Velmont (guest, #46433) [Link]
sysctl-knob, yes, yes, yes.
I hate that this is getting pushback. I hope it's the kind "yes we need this, but in another way" and not the "no, we don't want this unless it doesn't hurt us at all".
Really, really. The HPC-crowd, Google et al; they have the resources to configure their systems. Let them.
Optimize for desktop
Posted Nov 17, 2011 21:01 UTC (Thu) by Ben_P (guest, #74247) [Link]
USB sticks
Posted Nov 17, 2011 23:11 UTC (Thu) by dougg (subscriber, #1894) [Link]
And I suspect the performance degradation is non-linear: hit it with too many writes and you will be punished accordingly (and then some). Any efficiencies up stream (e.g. huge pages) may just make things worse.
Huge pages, slow drives, and long delays
Posted Nov 18, 2011 13:39 UTC (Fri) by foom (subscriber, #14868) [Link]
We tried experimentally enabling the THP feature in the latest kernel but it made things *much* slower, which seemed really mysterious. Now, knowing that the kernel is unnecessarily blocking processes on writing some pages to disk explains the whole issue. (Note: no USB sticks here; normal hard drives are already slow enough for me...).
So yeah, that argument about how it's desirable for long-running server workloads? Not so much.
Huge pages, slow drives, and long delays
Posted Nov 19, 2011 20:56 UTC (Sat) by Ben_P (guest, #74247) [Link]
Why not the obvious solution?
Posted Nov 24, 2011 13:18 UTC (Thu) by slashdot (guest, #22014) [Link]
In addition the number of pages actually submitted to hardware simultaneously number of such pages needs to be limited, or they must be copied to a compact bounce buffer before submitting to the hardware, unless they are already contiguous.
Also, with a suitable IOMMU, it should be possible to move pages even while they are actively read by the hardware.
The fix proposed instead seems to make no sense, as it doesn't fix the issue and introduces unpredictable performance degradation.
Huge pages, slow drives, and long delays
Posted Nov 25, 2011 7:33 UTC (Fri) by set (guest, #4788) [Link]
I have a usb stick which has poor write speed, which I periodicly copy ~16gb of linear data to. If I just left the machine alone, it would struggle along at a rate of perhaps 1mb/s - 3mb/s, with occasional pauses where almost no data moved. If, however I were to try and use a browser, not only would the browser hang for upto minutes, but other applications would hang, not updating their windows, AND the copy process really went to hell, trickling out a few kb/s and requiring hours to complete. Oddly, the only way I could get things going again after it entered this IO hell state, was to just cat gigs of disk files. (this just got the copy going again-- no help for hanging applications.)
This seems the definition of unacceptable behaviour on what is an otherwise capable dual core amd64 box with 4gb of ram. Eventually, after trying many things to tune swap/page and even formating the USB sticks to start their data clusters on 128k clusters, etc. (before that, the IO was even slower...) I finally just rebuilt the kernel without transparent huge pages. I wish I had heard they might be at fault earlier, but for me that was just a guess.
So, unless some fix is implemented, it would seem to me that transparent huge pages is simply an option that should at least be documented as being potentially completely destructive to desktop usage.
Huge pages, slow drives, and long delays
Posted Nov 25, 2011 19:45 UTC (Fri) by jospoortvliet (subscriber, #33164) [Link]
Huge pages, slow drives, and long delays
Posted Nov 29, 2011 9:21 UTC (Tue) by Thomas (subscriber, #39963) [Link]
Thanks to Jonathan for the very enlightening article!
Huge pages, slow drives, and long delays
Posted Nov 26, 2011 12:43 UTC (Sat) by mcfrisk (guest, #40131) [Link]
Lately I tried to get rid of the problem with ext4 options like noatime, data=writeback etc which did speed the system up a bit but did not remove
the lockups completely. Hope this really gets resolved.
Huge pages, slow drives, and long delays
Posted Dec 13, 2011 22:11 UTC (Tue) by khc (guest, #45209) [Link]
Huge pages, slow drives, and long delays
Posted Dec 24, 2011 0:20 UTC (Sat) by tristangrimaux (guest, #26831) [Link]
Elegant should be to make sure synchronous operations are quick and do not wait for ANY device to signal, only should reserve and set flags and then go out without further ado.
Huge pages, slow drives, and long delays
Posted Jan 21, 2012 5:22 UTC (Sat) by swmike (subscriber, #57335) [Link]
My workaround is "watch -n 5 sync" when I copy files. Makes the problem go away completely.
Copyright © 2011, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds