Huge pages, slow drives, and long delays

By Jonathan Corbet
November 14, 2011

It is a rare event, but it is no fun when it strikes. Plug in a slow storage device - a USB stick or a music player, for example - and run something like rsync to move a lot of data to that device. The operation takes a while, which is unsurprising; more surprising is when random processes begin to stall. In the worst cases, the desktop can lock up for minutes at a time; that, needless to say, is not the kind of interactive response that most users are looking for. The problem can strike in seemingly arbitrary places; the web browser freezes, but a network audio stream continues to play without a hiccup. Everything unblocks eventually, but, by then, the user is on their third beer and contemplating the virtues of proprietary operating systems. One might be forgiven for thinking that the system should work a little better than that.

Numerous people have reported this sort of behavior in recent times; your editor has seen it as well. But it is hard to reproduce, which means it has been hard to track down. It is also entirely possible that there is more than one bug causing this kind of behavior. In any case, there should now be one less bug of this type if Mel Gorman's patch proves to be effective. But a few developers are wondering if, in some cases, the cure is worse than the disease.

The problem Mel found appears to go somewhat like this. A process (that web browser, say) is doing its job when it incurs a page fault. This is normal; the whole point of contemporary browsers sometimes seems to be to stress-test virtual memory management systems. The kernel will respond by grabbing a free page to slot into the process's address space. But, if the transparent huge pages feature is built into the kernel (and most distributors do enable this feature), the page fault handler will attempt to allocate a huge page instead. With luck, there will be a huge page just waiting for this occasion, but that is not always the case; in particular, if there is a process dirtying a lot of memory, there may be no huge pages available. That is when things start to go wrong.

Once upon a time, one just had to assume that, once the system had been running for a while, large chunks of physically-contiguous memory would simply not exist. Virtual memory management tends to fragment such chunks quickly. So it is a bad idea to assume that huge pages will just be sitting there waiting for a good home; the kernel has to take explicit action to cause those pages to exist. That action is compaction: moving pages around to defragment the free space and bring free huge pages into existence. Without compaction, features like transparent huge pages would simply not work in any useful way.

A lot of the compaction work is done in the background. But current kernels will also perform "synchronous compaction" when an attempt to allocate a huge page would fail due to lack of availability. The process attempting to perform that allocation gets put to work migrating pages in an attempt to create the huge page it is asking for. This operation is not free in the best of times, but it should not be causing multi-second (or multi-minute) stalls. That is where the USB stick comes in.

If a lot of data is being written to a slow storage device, memory will quickly be filled with dirty pages waiting to be written out. That, in itself, can be a problem, which is why the recently-merged I/O-less dirty throttling code tries hard to keep pages for any single device from taking too much memory. But writeback to a slow device plays poorly with compaction; the memory management code cannot migrate a page that is being written back until the I/O operation completes. When synchronous compaction encounters such a page, it will go to sleep waiting for the I/O on that page to complete. If the page is headed to a slow device, and it is far back on a queue of many such pages, that sleep can go on for a long time.

One should not forget that producing a single huge page can involve migrating hundreds of ordinary pages. So once that long sleep completes, the job is far from done; the process stuck performing compaction may find itself at the back of the writeback queue quite a few times before it can finally get its page fault resolved. Only then will it be able to resume executing the code that the user actually wanted run - until the next page fault happens and the whole mess starts over again.

Mel's fix is a simple one-liner: if a process is attempting to allocate a transparent huge page, synchronous compaction should not be performed. In such a situation, Mel figured, it is far better to just give the process an ordinary page and let it continue running. The interesting thing is that not everybody seems to agree with him.

Andrew Morton was the first to object, saying "Presumably some people would prefer to get lots of huge pages for their 1000-hour compute job, and waiting a bit to get those pages is acceptable." David Rientjes, presumably thinking of Google's throughput-oriented tasks, said that there are times when the latency is entirely acceptable, but that some tasks really want to get huge pages at fault time. Mel's change makes it that much less likely that processes will be allocated huge pages in response to faults; David does not appear to see that as a good thing.

One could (and Mel did) respond that the transparent huge page mechanism does not only work at fault time. The kernel will also try to replace small pages with huge pages in the background while the process is running; that mechanism should bring more huge pages into use - for longer-running processes, at least - even if they are not available at fault time. In cases where that is not enough, there has been talk of adding a new knob to allow the system administrator to request that synchronous compaction be used. The actual semantics of such a knob are not clear; one could argue that if huge page allocations are that much more important than latency, the system should perform more aggressive page reclaim as well.

Andrea Arcangeli commented that he does not like how Mel's change causes failures to use huge pages at fault time; he would rather find a way to keep synchronous compaction from stalling instead. Some ideas for doing that are being thrown around, but no solution has been found as of this writing.

Such details can certainly be worked out over time. Meanwhile, if Mel's patch turns out to be the best fix, the decision on merging should be clear enough. Given a choice between (1) a system that continues to be responsive during heavy I/O to slow devices and (2) random, lengthy lockups in such situations, one might reasonably guess that most users would choose the first alternative. Barring complications, one would expect this patch to go into the mainline fairly soon, and possibly into the stable tree shortly thereafter.

Index entries for this article
Kernel	Huge pages
Kernel	Memory management/Huge pages

Huge pages, slow drives, and long delays

Posted Nov 17, 2011 2:12 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

How hard would it be to locate another potential hugepage when a subpage cannot be migrated immediately?

In my opinion, in-core operations should never be forced to wait on disk I/O unless it's necessary to prevent the entire operation from failing. On the other hand, there is definite value in allocating a hugepage up front, so it might make sense to put some effort toward locating a candidate hugepage which *can* be migrated rather than immediately falling back to individual pages.

If it's possible to try another hugepage, or fall back to individual pages, these options should come first.

Huge pages, slow drives, and long delays

Posted Nov 17, 2011 2:21 UTC (Thu) by naptastic (guest, #60139) [Link] (4 responses)

It sounds like another case of desktop users needing different things than server administrators.

Why not tie this behavior to the kernel preemption setting? If it's set to anything higher than voluntary, then Mel's change (and perhaps some others?) should be in place; if it's set to no preemption, then go ahead and stall processes while you're making room for some hugepages.

It's not server vs desktop

Posted Nov 17, 2011 8:10 UTC (Thu) by khim (subscriber, #9252) [Link] (3 responses)

Actually it does not always make sense on server either. If you have some batch-processing operation (slocate indexer on desktop, map-reduce on server) then it's Ok to wait for the compaction - even if it'll take a few minutes.

But if you need response right away (most desktop operations, live request in server's case) then latency is paramount.

It's not server vs desktop

Posted Nov 17, 2011 13:21 UTC (Thu) by mennucc1 (guest, #14730) [Link] (2 responses)

So the decision should depend on niceness. Usual process should not wait for compaction. Nice process should. Just my 2.

It's not server vs desktop

Posted Nov 17, 2011 21:26 UTC (Thu) by lordsutch (guest, #53) [Link] (1 responses)

But isn't part of the problem that even if you're only compacting when nice processes try to get a huge page, non-nice processes may be queue up waiting for the compaction to complete before they can do things, due to resource contention (for example, non-nice process may want to read from the USB drive to play back video, while you're doing a nice'd backup of $HOME to it).

It's not server vs desktop

Posted Nov 18, 2011 2:16 UTC (Fri) by naptastic (guest, #60139) [Link]

Hmm.

I see two questions. First; can we infer from a process's niceness or scheduler class whether it would prefer waiting for a hugepage or taking what's available now? Second; are memory compaction passes preemptible? Is this the behavior you're looking for?

1. A low-priority, sched_idle process (1) tries to allocate memory. The kernel starts compacting memory to provide it with a hugepage.
2. A higher-priority, sched_fifo process (2) becomes runnable and tries to allocate. Because it's higher priority, the kernel puts the request for (1) on the back burner. Because (2) is sched_fifo, the kernel doesn't wait for compaction but just gives it what's available now
3. With that request satisfied, the kernel goes back to compacting in order to satisfy (1)'s needs.

As someone who only uses -rt kernels, this is the behavior I think I would want. The network can get hugepages, and it can wait for them; but jackd and friends better get absolute preferential treatment for memory Right Now.

Huge pages, slow drives, and long delays

Posted Nov 17, 2011 3:09 UTC (Thu) by smoogen (subscriber, #97) [Link]

This somehow reminds me of Emacs Garbage Collecting.. which in the bad old days would cause you to spend 2-3 minutes watching the window freeze if you had a really slow system. I wonder how close garbage collecting and compaction match up and various fixes on how to fix it in GC languages could be used.

Huge pages, slow drives, and long delays

Posted Nov 17, 2011 7:42 UTC (Thu) by iq-0 (subscriber, #36655) [Link]

Wouldn't it be logical to always try a non-blocking attempt and if that's not easily possible fallback to normal page allocation? And perhaps one could hint using madvise or so that one is willing to wait for huge page allocation, no master what...

Writeback caching for USB sticks

Posted Nov 17, 2011 10:15 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Perhaps slow devices such as USB sticks should not get writeback caching. Writes to them should happen synchronously - that would also avoid the user-unfriendly extra step of unmounting the stick before unplugging it. If you ask to copy a file to a USB stick, the OS shouldn't lie when reporting that the operation has completed; it should only say it's done when it's done.

Or at least, have at most a fixed small amount of dirty pages for removable and slow devices (those that can be written within one second, say).

Writeback caching for USB sticks

Posted Dec 15, 2011 13:43 UTC (Thu) by hpro (subscriber, #74751) [Link]

Isn't that just backwards? The whole point of the writeback cache is to spare the user/process from having to wait for the slow write to hit disk?

On a related note, have you ever tried to do some useful work on files on a stick mounted 'sync'? It is quite painful, I assure you.

Huge pages, slow drives, and long delays

Posted Nov 17, 2011 10:31 UTC (Thu) by mjthayer (guest, #39183) [Link]

I'm sure that this is a hopelessly naive thought, but presumably not all of the pages which are in cache waiting to be written out to the USB stick are actually actively being written at any given time. Can't they be moved elsewhere in the meantime? Or in a similar vein, the process faulting memory in could initially be given small pages and they could be compacted after the fact as huge pages became available.

Ha-Ha-Only-Serious

Posted Nov 17, 2011 11:09 UTC (Thu) by CChittleborough (subscriber, #60775) [Link] (13 responses)

[T]he whole point of contemporary browsers sometimes seems to be to stress-test virtual memory management systems.

Can I submit this as a quote of the week?

This really is a classic "ha-ha-only-serious" quip. For me (and, I guess, most people) the browser is the only thing that uses large amounts of memory, so any memory-related misconfiguration shows up as browser slowdowns, freezes and crashes. (With Firefox 3 and Firefox 4, the OOM killer used to startle me once or twice a month.) Is there a good HOWTO on this topic?

Ha-Ha-Only-Serious

Posted Nov 17, 2011 20:11 UTC (Thu) by nevets (subscriber, #11875) [Link] (12 responses)

the browser is the only thing that uses large amounts of memory

You obviously don't use Evolution.

Ha-Ha-Only-Serious

Posted Nov 17, 2011 20:51 UTC (Thu) by chuckles (guest, #41964) [Link] (9 responses)

You obviously don't use Evolution.

Does anyone? I didn't realize it was still being worked on.

Ha-Ha-Only-Serious

Posted Nov 17, 2011 21:10 UTC (Thu) by nevets (subscriber, #11875) [Link] (5 responses)

You make me chuckle, chuckles.

I still use it, and I sometimes wish they would stop working on it, as they keep making it harder to use after every update. I guess they have the gnome mind set too.

I never cared much for mutt. I do like alpine, but too many people send me html crap that I need to read, and tbird always screws up patches I try to send.

Evolution seems to work the best with imap and it's trivial to send sane patches.

Ha-Ha-Only-Serious

Posted Nov 17, 2011 23:33 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (4 responses)

> I never cared much for mutt.

I didn't at first either, mainly because the bindings were crazy and inconsistent. I've gotten it to be pretty well vim-ish, but it's not a 1:1 translation. It is certainly more consistent though and that makes it much better for me.

> I do like alpine

I couldn't get the hang of alpine. Too much not-mail clutter around.

> but too many people send me html crap that I need to read

For HTML, I have a bindings to toggle plain text or w3m -dump viewing for HTML mails (Ctrl-U and Ctrl-I, respectively). It works fairly well, not perfect, but better than any web interface.

Ha-Ha-Only-Serious

Posted Nov 18, 2011 0:40 UTC (Fri) by nevets (subscriber, #11875) [Link] (3 responses)

The thing is. I actually like the GUI part of Evolution. mutt just looks damn ugly. I use it to read LKML but that's it. I place reading email up there with web browsing, and I personally think they both do better with a GUI interface than plain text.

I like to have a preview screen. I move mail all the time by dragging a message over to a folder with the mouse. I do wish Evolution had better keyboard short cuts, as I probably could move messages faster with typing. I did with alpine. But mutt still seems hackish to me, and I never got past it. I've been using it for LKML for a few years now, and I still don't care much for it.

Evolution is big and slow, and I need to kill it as often as I do my browsers, but other than that, it works well for me. If it had better key bindings, I would dare call it an awesome application.

Ha-Ha-Only-Serious

Posted Nov 18, 2011 7:53 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

I agree that the default mutt colorscheme is...less than pleasant. Of course, any UI with many elements tends to do poorly with just 8 colors. I converted the vim colorscheme I use (neverland-darker) over to it[1].

I could see a 'hackish' feeling, but the bindings were the biggest part of that for me. Moving messages is = and then I have tab completion and (session-local) history in the prompt. Tagging files with searches works wonders too to do batch processing of a sort.

I don't imagine we'll convince each other as I find gut feelings are hard to explain and uproot besides and I've been using mutt happily for over a year since moving from KMail.

[1]http://blipper.dev.benboeckel.net/files/mutt.png

Ha-Ha-Only-Serious

Posted Nov 24, 2011 20:40 UTC (Thu) by jospoortvliet (guest, #33164) [Link] (1 responses)

I afew months ago I would've said - try KMail. Much faster than Evolution, highly efficient keyboard shortcuts and layout options, nice security integration etc.

Until they came with an Akonadi based KMail2 - now it's as slow, if not slower, than Evolution, and eating lots of RAM. Now I have to wait until they fix it and muddle through in the mean time; or give up and go back to webmail :(

Ha-Ha-Only-Serious

Posted Dec 3, 2011 23:25 UTC (Sat) by Los__D (guest, #15263) [Link]

Now I have to wait until they fix it and muddle through in the mean time; or give up and go back to webmail :(

In that case, I'd give RoundCube a try. It has worked perfectly for me by keeping much of the desktop mail client feel. Certainly, that feel sometimes break down, but in general I'm very pleased, especially compared to my old webmail client, Squirrelmail.

SquirrelMail is way more configurable though, if you need that (although RoundCube might be more configurable than I think, currently I'm just running the default Synology DSM version).

Evolution

Posted Nov 17, 2011 21:36 UTC (Thu) by shane (subscriber, #3335) [Link] (2 responses)

I use Evolution. It has a few features I haven't found in other e-mail clients:

* Ability to specify only a part of the message as "preformatted text", so they don't get word-wrapped. Useful for pasting console output in support messages, for example.
* Nice "bullet list"/"number list" support.
* Pretty good IMAP support.
* Multiple-language spell-checking in a single mail.

It's bloated and slow, but newer versions actually do fix lots of bugs.

Evolution

Posted Nov 17, 2011 23:44 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

> Ability to specify only a part of the message as "preformatted text", so they don't get word-wrapped. Useful for pasting console output in support messages, for example.
Do you mean when writing mails or viewing them? Mutt takes whatever $EDITOR gives back, verbatim. For viewing, I have bindings to toggle between 78 width and $COLUMNS-2 (300+ here which is a little much for a default wrap boundary).

> Nice "bullet list"/"number list" support.
I tend to just do ASCII lists in vim myself.

> Pretty good IMAP support.
I use offlineimap here.

> Multiple-language spell-checking in a single mail.
I'm not bilingual (or anything more for that matter), so I've not had to solve this one.

Evolution

Posted Nov 18, 2011 1:59 UTC (Fri) by mgedmin (subscriber, #34497) [Link]

Vim does multilingual spell-checking.

Stress-testers

Posted Nov 18, 2011 12:27 UTC (Fri) by CChittleborough (subscriber, #60775) [Link] (1 responses)

You're right, I don't use Evolution. I use (and highly recommend) fastmail.fm, via their excellent webmail interface. So my mail program is just as memory-hungry as Firefox, because it is Firefox.

Slightly more on topic: Mozilla has been working on reducing memory usage in recent versions of Firefox, with good success AFAICT. So we may need to find another stress-tester for the memory subsystem ... :-)

Stress-testers

Posted Nov 24, 2011 20:42 UTC (Thu) by jospoortvliet (guest, #33164) [Link]

Don't worry, Firefox still has no problem eating over 500 mb of ram for just a few web pages :D

Huge pages, slow drives, and long delays

Posted Nov 17, 2011 19:19 UTC (Thu) by ejr (subscriber, #51652) [Link]

Allocating huge pages transparently and immediately gives a large performance boost on the Graph500 benchmark. The boost was big enough that other people compared results on our system (at Georgia Tech) with their own results and thought their hardware was broken.

But it's a server / HPC system. No browsers running, and no USB sticks hanging off. I wouldn't mind having to set a sysctl at boot... Also, our heap space often is allocated once. Dynamic compaction is less interesting, but doesn't seem to be hurting performance. I should measure that sometime.

Optimize for desktop

Posted Nov 17, 2011 20:32 UTC (Thu) by Velmont (guest, #46433) [Link] (1 responses)

Optimize for desktop/laptop users. The server people already know how to tweak their system for what they want.

sysctl-knob, yes, yes, yes.

I hate that this is getting pushback. I hope it's the kind "yes we need this, but in another way" and not the "no, we don't want this unless it doesn't hurt us at all".

Really, really. The HPC-crowd, Google et al; they have the resources to configure their systems. Let them.

Optimize for desktop

Posted Nov 17, 2011 21:01 UTC (Thu) by Ben_P (guest, #74247) [Link]

A quick google didn't turn up anything, can sysctl parameters be cgroup specific? If we should be separating of applications into cgroups anyway, this seems like a nice place for this type of mm tuning.

USB sticks

Posted Nov 17, 2011 23:11 UTC (Thu) by dougg (guest, #1894) [Link]

Unless one uses industrial grade flash devices (very unlikely) then anything using low cost flash (e.g. USB sticks) can be extremely horrible with large scattered writes. And it gets worse if the file system is other than VFAT. Think seconds per write operation.

And I suspect the performance degradation is non-linear: hit it with too many writes and you will be punished accordingly (and then some). Any efficiencies up stream (e.g. huge pages) may just make things worse.

Huge pages, slow drives, and long delays

Posted Nov 18, 2011 13:39 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

Awesome timing on this article, this is almost certainly exactly what's slowing down a big server workload I run when transparent hugepages are enabled.

We tried experimentally enabling the THP feature in the latest kernel but it made things *much* slower, which seemed really mysterious. Now, knowing that the kernel is unnecessarily blocking processes on writing some pages to disk explains the whole issue. (Note: no USB sticks here; normal hard drives are already slow enough for me...).

So yeah, that argument about how it's desirable for long-running server workloads? Not so much.

Huge pages, slow drives, and long delays

Posted Nov 19, 2011 20:56 UTC (Sat) by Ben_P (guest, #74247) [Link]

Are long running server workloads considered those who hold large chunks of memory for a long time? Almost any Java application would fall under this due to it's own internal memory management. It does seem very plausible there are some server loads that actually malloc/free per-request or something similar which could make this problem seem worse.

Why not the obvious solution?

Posted Nov 24, 2011 13:18 UTC (Thu) by slashdot (guest, #22014) [Link]

The obvious solution is to allow the mm code to move pages that are queued for I/O, temporarily blocking the actual hardware submission, instead of waiting until the hardware completes the I/O.

In addition the number of pages actually submitted to hardware simultaneously number of such pages needs to be limited, or they must be copied to a compact bounce buffer before submitting to the hardware, unless they are already contiguous.

Also, with a suitable IOMMU, it should be possible to move pages even while they are actively read by the hardware.

The fix proposed instead seems to make no sense, as it doesn't fix the issue and introduces unpredictable performance degradation.

Huge pages, slow drives, and long delays

Posted Nov 25, 2011 7:33 UTC (Fri) by set (guest, #4788) [Link] (2 responses)

Well, just a comment as a desktop user that had no trouble reproducing this problem:

I have a usb stick which has poor write speed, which I periodicly copy ~16gb of linear data to. If I just left the machine alone, it would struggle along at a rate of perhaps 1mb/s - 3mb/s, with occasional pauses where almost no data moved. If, however I were to try and use a browser, not only would the browser hang for upto minutes, but other applications would hang, not updating their windows, AND the copy process really went to hell, trickling out a few kb/s and requiring hours to complete. Oddly, the only way I could get things going again after it entered this IO hell state, was to just cat gigs of disk files. (this just got the copy going again-- no help for hanging applications.)

This seems the definition of unacceptable behaviour on what is an otherwise capable dual core amd64 box with 4gb of ram. Eventually, after trying many things to tune swap/page and even formating the USB sticks to start their data clusters on 128k clusters, etc. (before that, the IO was even slower...) I finally just rebuilt the kernel without transparent huge pages. I wish I had heard they might be at fault earlier, but for me that was just a guess.

So, unless some fix is implemented, it would seem to me that transparent huge pages is simply an option that should at least be documented as being potentially completely destructive to desktop usage.

Huge pages, slow drives, and long delays

Posted Nov 25, 2011 19:45 UTC (Fri) by jospoortvliet (guest, #33164) [Link]

What I wonder if this can also hit you on a internal SSD which has performance issues... My system stalls frequently like this when an app tries to allocate memory.

Huge pages, slow drives, and long delays

Posted Nov 29, 2011 9:21 UTC (Tue) by Thomas (subscriber, #39963) [Link]

Exactly the same behaviour here on my laptop at work with a 32GB SD-card connected via a built-in MMC/SD-card reader. After having written more than 8GB of a 16GB git repository via rsync to the SD card, I have lock-ups for minutes (system load > 8 according to xosview). I have to try out and disable this huge pages feature and try again.

Thanks to Jonathan for the very enlightening article!

Huge pages, slow drives, and long delays

Posted Nov 26, 2011 12:43 UTC (Sat) by mcfrisk (guest, #40131) [Link]

Now this sounds exactly what I've been suffering from. Using Thinkpad T60 with a SAMSUNG MMCRE64G5MXP-0VB SSD disk. Music playing, video editing, web browsing etc was smooth while kernel compilation was in background until link time comes and whole machine freezes for minutes. Sometimes music continued to play but even keyboard leds where not reacting to caps or numlock presses. But after a few minutes it's over with nothing in the logs indicating a problem.

Lately I tried to get rid of the problem with ext4 options like noatime, data=writeback etc which did speed the system up a bit but did not remove
the lockups completely. Hope this really gets resolved.

Huge pages, slow drives, and long delays

Posted Dec 13, 2011 22:11 UTC (Tue) by khc (guest, #45209) [Link]

Isn't that what /sys/kernel/mm/transparent_hugepage/defrag is supposed to control?

Huge pages, slow drives, and long delays

Posted Dec 24, 2011 0:20 UTC (Sat) by tristangrimaux (guest, #26831) [Link]

Mel Gorman's patch is slick, but is not elegant.

Elegant should be to make sure synchronous operations are quick and do not wait for ANY device to signal, only should reserve and set flags and then go out without further ado.

Huge pages, slow drives, and long delays

Posted Jan 21, 2012 5:22 UTC (Sat) by swmike (guest, #57335) [Link]

I dont't understand when people say it's hard to reproduce. I have posted several times to the linux-mm list how to reproduce this problem. I can do it 100% of the time. My ubuntu laptop (currently running 11.10), unplug the power, put in a usb stick with FAT or NTFS, and start writing to it. As soon as the process has written about 1/4 to 1/2 of my physical memory, my browser starts randomly blocking. Happens every time.

My workaround is "watch -n 5 sync" when I copy files. Makes the problem go away completely.