Huge pages, slow drives, and long delays
Numerous people have reported this sort of behavior in recent times; your editor has seen it as well. But it is hard to reproduce, which means it has been hard to track down. It is also entirely possible that there is more than one bug causing this kind of behavior. In any case, there should now be one less bug of this type if Mel Gorman's patch proves to be effective. But a few developers are wondering if, in some cases, the cure is worse than the disease.
The problem Mel found appears to go somewhat like this. A process (that web browser, say) is doing its job when it incurs a page fault. This is normal; the whole point of contemporary browsers sometimes seems to be to stress-test virtual memory management systems. The kernel will respond by grabbing a free page to slot into the process's address space. But, if the transparent huge pages feature is built into the kernel (and most distributors do enable this feature), the page fault handler will attempt to allocate a huge page instead. With luck, there will be a huge page just waiting for this occasion, but that is not always the case; in particular, if there is a process dirtying a lot of memory, there may be no huge pages available. That is when things start to go wrong.
Once upon a time, one just had to assume that, once the system had been running for a while, large chunks of physically-contiguous memory would simply not exist. Virtual memory management tends to fragment such chunks quickly. So it is a bad idea to assume that huge pages will just be sitting there waiting for a good home; the kernel has to take explicit action to cause those pages to exist. That action is compaction: moving pages around to defragment the free space and bring free huge pages into existence. Without compaction, features like transparent huge pages would simply not work in any useful way.
A lot of the compaction work is done in the background. But current kernels will also perform "synchronous compaction" when an attempt to allocate a huge page would fail due to lack of availability. The process attempting to perform that allocation gets put to work migrating pages in an attempt to create the huge page it is asking for. This operation is not free in the best of times, but it should not be causing multi-second (or multi-minute) stalls. That is where the USB stick comes in.
If a lot of data is being written to a slow storage device, memory will quickly be filled with dirty pages waiting to be written out. That, in itself, can be a problem, which is why the recently-merged I/O-less dirty throttling code tries hard to keep pages for any single device from taking too much memory. But writeback to a slow device plays poorly with compaction; the memory management code cannot migrate a page that is being written back until the I/O operation completes. When synchronous compaction encounters such a page, it will go to sleep waiting for the I/O on that page to complete. If the page is headed to a slow device, and it is far back on a queue of many such pages, that sleep can go on for a long time.
One should not forget that producing a single huge page can involve migrating hundreds of ordinary pages. So once that long sleep completes, the job is far from done; the process stuck performing compaction may find itself at the back of the writeback queue quite a few times before it can finally get its page fault resolved. Only then will it be able to resume executing the code that the user actually wanted run - until the next page fault happens and the whole mess starts over again.
Mel's fix is a simple one-liner: if a process is attempting to allocate a transparent huge page, synchronous compaction should not be performed. In such a situation, Mel figured, it is far better to just give the process an ordinary page and let it continue running. The interesting thing is that not everybody seems to agree with him.
Andrew Morton was the first to object,
saying "Presumably some people would prefer to get lots of
huge pages for their 1000-hour compute job, and waiting a bit to get
those pages is acceptable.
" David Rientjes, presumably thinking of
Google's throughput-oriented tasks, said
that there are times when the latency is entirely acceptable, but that some
tasks really want to get huge pages at fault time. Mel's change makes it
that much less likely that processes will be allocated huge pages in
response to faults; David does not appear to see that as a good thing.
One could (and Mel did) respond that the transparent huge page mechanism does not only work at fault time. The kernel will also try to replace small pages with huge pages in the background while the process is running; that mechanism should bring more huge pages into use - for longer-running processes, at least - even if they are not available at fault time. In cases where that is not enough, there has been talk of adding a new knob to allow the system administrator to request that synchronous compaction be used. The actual semantics of such a knob are not clear; one could argue that if huge page allocations are that much more important than latency, the system should perform more aggressive page reclaim as well.
Andrea Arcangeli commented that he does not like how Mel's change causes failures to use huge pages at fault time; he would rather find a way to keep synchronous compaction from stalling instead. Some ideas for doing that are being thrown around, but no solution has been found as of this writing.
Such details can certainly be worked out over time. Meanwhile, if Mel's
patch turns out to be the best fix, the decision on merging should be clear
enough. Given a choice between (1) a system that continues to be
responsive during heavy I/O to slow devices and (2) random, lengthy
lockups in such situations, one might reasonably guess that most users
would choose the first alternative. Barring complications, one would
expect this patch to go into the mainline fairly soon, and possibly into
the stable tree shortly thereafter.
Index entries for this article | |
---|---|
Kernel | Huge pages |
Kernel | Memory management/Huge pages |
Posted Nov 17, 2011 2:12 UTC (Thu)
by nybble41 (subscriber, #55106)
[Link]
In my opinion, in-core operations should never be forced to wait on disk I/O unless it's necessary to prevent the entire operation from failing. On the other hand, there is definite value in allocating a hugepage up front, so it might make sense to put some effort toward locating a candidate hugepage which *can* be migrated rather than immediately falling back to individual pages.
If it's possible to try another hugepage, or fall back to individual pages, these options should come first.
Posted Nov 17, 2011 2:21 UTC (Thu)
by naptastic (guest, #60139)
[Link] (4 responses)
Why not tie this behavior to the kernel preemption setting? If it's set to anything higher than voluntary, then Mel's change (and perhaps some others?) should be in place; if it's set to no preemption, then go ahead and stall processes while you're making room for some hugepages.
Posted Nov 17, 2011 8:10 UTC (Thu)
by khim (subscriber, #9252)
[Link] (3 responses)
Actually it does not always make sense on server either. If you have some batch-processing operation (slocate indexer on desktop, map-reduce on server) then it's Ok to wait for the compaction - even if it'll take a few minutes. But if you need response right away (most desktop operations, live request in server's case) then latency is paramount.
Posted Nov 17, 2011 13:21 UTC (Thu)
by mennucc1 (guest, #14730)
[Link] (2 responses)
Posted Nov 17, 2011 21:26 UTC (Thu)
by lordsutch (guest, #53)
[Link] (1 responses)
Posted Nov 18, 2011 2:16 UTC (Fri)
by naptastic (guest, #60139)
[Link]
I see two questions. First; can we infer from a process's niceness or scheduler class whether it would prefer waiting for a hugepage or taking what's available now? Second; are memory compaction passes preemptible? Is this the behavior you're looking for?
1. A low-priority, sched_idle process (1) tries to allocate memory. The kernel starts compacting memory to provide it with a hugepage.
As someone who only uses -rt kernels, this is the behavior I think I would want. The network can get hugepages, and it can wait for them; but jackd and friends better get absolute preferential treatment for memory Right Now.
Posted Nov 17, 2011 3:09 UTC (Thu)
by smoogen (subscriber, #97)
[Link]
Posted Nov 17, 2011 7:42 UTC (Thu)
by iq-0 (subscriber, #36655)
[Link]
Posted Nov 17, 2011 10:15 UTC (Thu)
by epa (subscriber, #39769)
[Link] (1 responses)
Or at least, have at most a fixed small amount of dirty pages for removable and slow devices (those that can be written within one second, say).
Posted Dec 15, 2011 13:43 UTC (Thu)
by hpro (subscriber, #74751)
[Link]
On a related note, have you ever tried to do some useful work on files on a stick mounted 'sync'? It is quite painful, I assure you.
Posted Nov 17, 2011 10:31 UTC (Thu)
by mjthayer (guest, #39183)
[Link]
Posted Nov 17, 2011 11:09 UTC (Thu)
by CChittleborough (subscriber, #60775)
[Link] (13 responses)
This really is a classic "ha-ha-only-serious" quip. For me (and, I guess, most people) the browser is the only thing that uses large amounts of memory, so any memory-related misconfiguration shows up as browser slowdowns, freezes and crashes. (With Firefox 3 and Firefox 4, the OOM killer used to startle me once or twice a month.) Is there a good HOWTO on this topic?
Posted Nov 17, 2011 20:11 UTC (Thu)
by nevets (subscriber, #11875)
[Link] (12 responses)
the browser is the only thing that uses large amounts of memory
You obviously don't use Evolution.
Posted Nov 17, 2011 20:51 UTC (Thu)
by chuckles (guest, #41964)
[Link] (9 responses)
Posted Nov 17, 2011 21:10 UTC (Thu)
by nevets (subscriber, #11875)
[Link] (5 responses)
I still use it, and I sometimes wish they would stop working on it, as they keep making it harder to use after every update. I guess they have the gnome mind set too.
I never cared much for mutt. I do like alpine, but too many people send me html crap that I need to read, and tbird always screws up patches I try to send.
Evolution seems to work the best with imap and it's trivial to send sane patches.
Posted Nov 17, 2011 23:33 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (4 responses)
I didn't at first either, mainly because the bindings were crazy and inconsistent. I've gotten it to be pretty well vim-ish, but it's not a 1:1 translation. It is certainly more consistent though and that makes it much better for me.
> I do like alpine
I couldn't get the hang of alpine. Too much not-mail clutter around.
> but too many people send me html crap that I need to read
For HTML, I have a bindings to toggle plain text or w3m -dump viewing for HTML mails (Ctrl-U and Ctrl-I, respectively). It works fairly well, not perfect, but better than any web interface.
Posted Nov 18, 2011 0:40 UTC (Fri)
by nevets (subscriber, #11875)
[Link] (3 responses)
I like to have a preview screen. I move mail all the time by dragging a message over to a folder with the mouse. I do wish Evolution had better keyboard short cuts, as I probably could move messages faster with typing. I did with alpine. But mutt still seems hackish to me, and I never got past it. I've been using it for LKML for a few years now, and I still don't care much for it.
Evolution is big and slow, and I need to kill it as often as I do my browsers, but other than that, it works well for me. If it had better key bindings, I would dare call it an awesome application.
Posted Nov 18, 2011 7:53 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
I could see a 'hackish' feeling, but the bindings were the biggest part of that for me. Moving messages is = and then I have tab completion and (session-local) history in the prompt. Tagging files with searches works wonders too to do batch processing of a sort.
I don't imagine we'll convince each other as I find gut feelings are hard to explain and uproot besides and I've been using mutt happily for over a year since moving from KMail.
Posted Nov 24, 2011 20:40 UTC (Thu)
by jospoortvliet (guest, #33164)
[Link] (1 responses)
Until they came with an Akonadi based KMail2 - now it's as slow, if not slower, than Evolution, and eating lots of RAM. Now I have to wait until they fix it and muddle through in the mean time; or give up and go back to webmail :(
Posted Dec 3, 2011 23:25 UTC (Sat)
by Los__D (guest, #15263)
[Link]
Now I have to wait until they fix it and muddle through in the mean time; or give up and go back to webmail :( In that case, I'd give RoundCube a try. It has worked perfectly for me by keeping much of the desktop mail client feel. Certainly, that feel sometimes break down, but in general I'm very pleased, especially compared to my old webmail client, Squirrelmail. SquirrelMail is way more configurable though, if you need that (although RoundCube might be more configurable than I think, currently I'm just running the default Synology DSM version).
Posted Nov 17, 2011 21:36 UTC (Thu)
by shane (subscriber, #3335)
[Link] (2 responses)
* Ability to specify only a part of the message as "preformatted text", so they don't get word-wrapped. Useful for pasting console output in support messages, for example.
It's bloated and slow, but newer versions actually do fix lots of bugs.
Posted Nov 17, 2011 23:44 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
> Nice "bullet list"/"number list" support.
> Pretty good IMAP support.
> Multiple-language spell-checking in a single mail.
Posted Nov 18, 2011 1:59 UTC (Fri)
by mgedmin (subscriber, #34497)
[Link]
Posted Nov 18, 2011 12:27 UTC (Fri)
by CChittleborough (subscriber, #60775)
[Link] (1 responses)
You're right, I don't use Evolution. I use (and highly recommend) fastmail.fm, via their excellent webmail interface. So my mail program is just as memory-hungry as Firefox, because it is Firefox.
Slightly more on topic: Mozilla has been working on reducing memory usage in recent versions of Firefox, with good success AFAICT. So we may need to find another stress-tester for the memory subsystem ... :-)
Posted Nov 24, 2011 20:42 UTC (Thu)
by jospoortvliet (guest, #33164)
[Link]
Posted Nov 17, 2011 19:19 UTC (Thu)
by ejr (subscriber, #51652)
[Link]
But it's a server / HPC system. No browsers running, and no USB sticks hanging off. I wouldn't mind having to set a sysctl at boot... Also, our heap space often is allocated once. Dynamic compaction is less interesting, but doesn't seem to be hurting performance. I should measure that sometime.
Posted Nov 17, 2011 20:32 UTC (Thu)
by Velmont (guest, #46433)
[Link] (1 responses)
sysctl-knob, yes, yes, yes.
I hate that this is getting pushback. I hope it's the kind "yes we need this, but in another way" and not the "no, we don't want this unless it doesn't hurt us at all".
Really, really. The HPC-crowd, Google et al; they have the resources to configure their systems. Let them.
Posted Nov 17, 2011 21:01 UTC (Thu)
by Ben_P (guest, #74247)
[Link]
Posted Nov 17, 2011 23:11 UTC (Thu)
by dougg (guest, #1894)
[Link]
And I suspect the performance degradation is non-linear: hit it with too many writes and you will be punished accordingly (and then some). Any efficiencies up stream (e.g. huge pages) may just make things worse.
Posted Nov 18, 2011 13:39 UTC (Fri)
by foom (subscriber, #14868)
[Link] (1 responses)
We tried experimentally enabling the THP feature in the latest kernel but it made things *much* slower, which seemed really mysterious. Now, knowing that the kernel is unnecessarily blocking processes on writing some pages to disk explains the whole issue. (Note: no USB sticks here; normal hard drives are already slow enough for me...).
So yeah, that argument about how it's desirable for long-running server workloads? Not so much.
Posted Nov 19, 2011 20:56 UTC (Sat)
by Ben_P (guest, #74247)
[Link]
Posted Nov 24, 2011 13:18 UTC (Thu)
by slashdot (guest, #22014)
[Link]
In addition the number of pages actually submitted to hardware simultaneously number of such pages needs to be limited, or they must be copied to a compact bounce buffer before submitting to the hardware, unless they are already contiguous.
Also, with a suitable IOMMU, it should be possible to move pages even while they are actively read by the hardware.
The fix proposed instead seems to make no sense, as it doesn't fix the issue and introduces unpredictable performance degradation.
Posted Nov 25, 2011 7:33 UTC (Fri)
by set (guest, #4788)
[Link] (2 responses)
I have a usb stick which has poor write speed, which I periodicly copy ~16gb of linear data to. If I just left the machine alone, it would struggle along at a rate of perhaps 1mb/s - 3mb/s, with occasional pauses where almost no data moved. If, however I were to try and use a browser, not only would the browser hang for upto minutes, but other applications would hang, not updating their windows, AND the copy process really went to hell, trickling out a few kb/s and requiring hours to complete. Oddly, the only way I could get things going again after it entered this IO hell state, was to just cat gigs of disk files. (this just got the copy going again-- no help for hanging applications.)
This seems the definition of unacceptable behaviour on what is an otherwise capable dual core amd64 box with 4gb of ram. Eventually, after trying many things to tune swap/page and even formating the USB sticks to start their data clusters on 128k clusters, etc. (before that, the IO was even slower...) I finally just rebuilt the kernel without transparent huge pages. I wish I had heard they might be at fault earlier, but for me that was just a guess.
So, unless some fix is implemented, it would seem to me that transparent huge pages is simply an option that should at least be documented as being potentially completely destructive to desktop usage.
Posted Nov 25, 2011 19:45 UTC (Fri)
by jospoortvliet (guest, #33164)
[Link]
Posted Nov 29, 2011 9:21 UTC (Tue)
by Thomas (subscriber, #39963)
[Link]
Thanks to Jonathan for the very enlightening article!
Posted Nov 26, 2011 12:43 UTC (Sat)
by mcfrisk (guest, #40131)
[Link]
Lately I tried to get rid of the problem with ext4 options like noatime, data=writeback etc which did speed the system up a bit but did not remove
Posted Dec 13, 2011 22:11 UTC (Tue)
by khc (guest, #45209)
[Link]
Posted Dec 24, 2011 0:20 UTC (Sat)
by tristangrimaux (guest, #26831)
[Link]
Elegant should be to make sure synchronous operations are quick and do not wait for ANY device to signal, only should reserve and set flags and then go out without further ado.
Posted Jan 21, 2012 5:22 UTC (Sat)
by swmike (guest, #57335)
[Link]
My workaround is "watch -n 5 sync" when I copy files. Makes the problem go away completely.
Huge pages, slow drives, and long delays
Huge pages, slow drives, and long delays
It's not server vs desktop
It's not server vs desktop
It's not server vs desktop
It's not server vs desktop
2. A higher-priority, sched_fifo process (2) becomes runnable and tries to allocate. Because it's higher priority, the kernel puts the request for (1) on the back burner. Because (2) is sched_fifo, the kernel doesn't wait for compaction but just gives it what's available now
3. With that request satisfied, the kernel goes back to compacting in order to satisfy (1)'s needs.
Huge pages, slow drives, and long delays
Huge pages, slow drives, and long delays
Writeback caching for USB sticks
Writeback caching for USB sticks
Huge pages, slow drives, and long delays
Ha-Ha-Only-Serious
[T]he whole point of contemporary browsers sometimes seems to be to stress-test virtual memory management systems.
Can I submit this as a quote of the week?
Ha-Ha-Only-Serious
You obviously don't use Evolution. Ha-Ha-Only-Serious
Does anyone? I didn't realize it was still being worked on.
Ha-Ha-Only-Serious
Ha-Ha-Only-Serious
Ha-Ha-Only-Serious
Ha-Ha-Only-Serious
Ha-Ha-Only-Serious
Ha-Ha-Only-Serious
Evolution
* Nice "bullet list"/"number list" support.
* Pretty good IMAP support.
* Multiple-language spell-checking in a single mail.
Evolution
Do you mean when writing mails or viewing them? Mutt takes whatever $EDITOR gives back, verbatim. For viewing, I have bindings to toggle between 78 width and $COLUMNS-2 (300+ here which is a little much for a default wrap boundary).
I tend to just do ASCII lists in vim myself.
I use offlineimap here.
I'm not bilingual (or anything more for that matter), so I've not had to solve this one.
Evolution
Stress-testers
Stress-testers
Huge pages, slow drives, and long delays
Optimize for desktop
Optimize for desktop
USB sticks
Huge pages, slow drives, and long delays
Huge pages, slow drives, and long delays
Why not the obvious solution?
Huge pages, slow drives, and long delays
Huge pages, slow drives, and long delays
Huge pages, slow drives, and long delays
Huge pages, slow drives, and long delays
the lockups completely. Hope this really gets resolved.
Huge pages, slow drives, and long delays
Huge pages, slow drives, and long delays
Huge pages, slow drives, and long delays