LWN.net Logo

Time to thrash the 2.6 VM?

Those who have been watching kernel development for a little while will remember the fun that came with the 2.4.10 release, when Linus replaced the virtual memory subsystem with a new implementation by Andrea Arcangeli. The 2.4 kernel did end up with a stable VM some releases thereafter, but many developers were upset that such a major change would be merged that far into a stable series. Especially since many of those developers were not convinced that the previous VM was not fixable.

The 2.4 changes are long past, but the memories are fresh enough that when Andrea put forward a set of VM changes which, while they are for 2.4, are said to be applicable to 2.6 as well, people took notice. Andrea's goals this time are little more focused; he is concerned with the performance of systems with at least 32GB of installed memory and hundreds of processes with shared mappings of large files. This, of course, is the sort of description that might fit a high-end database server.

Andrea has found three problems which make those massive servers fail to function well. The first has to do with how 2.4 performs swapout; it works by scanning each process's virtual address space, and unmapping pages that it would like to make free. When a page's mapping count reaches zero, it gets kicked out of main memory. The problem is that this algorithm performs poorly in situations where many processes have the same, large file mapped. The VM will start by unmapping the entire file for the first process, then another, and so on. Only when it has passed through all of the processes mapping the file can it actually move pages out of main memory. Meanwhile, all of those processes are incurring minor page faults and remapping the pages. With enough memory and processes, the VM subsystem is almost never able to actually free anything.

This is the problem that the reverse-mapping VM (rmap) was added to 2.5 to solve. By working directly with physical pages and following pointers to the page tables which map them, the VM subsystem can quickly free pages for other use. Andrea is critical of rmap, however; with his scenario of 32GB of memory and hundreds of processes, the rmap infrastructure grows to a point where the system collapses. Instead, for his patches, he has implemented a variant of the object-based reverse mapping scheme. Object-based reverse mapping works by following the links from the object (a shared file, say) which backs up the shared memory; in this way it is able to dispense with the rmap structures in many situations. There are some concerns about pathological performance issues with the object-based approach, but those problems do not seem to arise in real-world use.

The second problem is a simple bug in the swapout code. When shared memory is unmapped and set up for swap, the actual I/O to write it out to the swap file is not started right away. By the time the system gets around to actually performing I/O, there is a huge pile of pages waiting to be shoved out, and an I/O storm results. Even then, the way the kernel tracks this memory means that it takes a long time to notice that it is free even after it has been written to swap. This problem is fixed by taking frequent breaks to actually shove dirty memory out to disk.

Andrea's final problem came about when he tried to copy a large file while all those database processes were running. It turns out that the system was swapping out the shared database memory (which was dirty and in use) rather than the data from the file just copied (which is clean). Tweaking the memory freeing code to make it prefer clean cache pages over dirty pages straightened this problem out, at the cost of a certain amount of unfairness.

With these patches, Andrea claims, the 2.4 kernel can run heavy loads on large systems which will immediately lock up a 2.6 system. So he is going to start looking toward 2.6, with an eye toward beefing it up for this sort of load. Andrew Morton has indicated that he might accept some of this work - but not yet:

We need to understand that right now, 2.6.x is 2.7-pre. Once 2.7 forks off we are more at liberty to merge nasty highmem hacks which will die when 2.6 is end-of-lined.

I plan to merge the 4g split immediately after 2.7 forks. I wouldn't be averse to objrmap for file-backed mappings either - I agree that the search problems which were demonstrated are unlikely to bite in real life.

The "4g split" is Ingo Molnar's 4GB user-space patch which makes more low memory available to the kernel, but at a performance cost. Before Andrew merges any other patches, however, he wants to see a convincing demonstration of why the current VM patches are not enough for large loads. The 2.6 "stable" kernel may well see some significant virtual memory work, but, with luck, it will not be subjected to a 2.4.10-like abrupt switch.


(Log in to post comments)

Time to thrash the 2.6 VM?

Posted Mar 4, 2004 19:39 UTC (Thu) by jwb (guest, #15467) [Link]

I boggle when certain kernel developers claim that the current VM is useful on medium workloads. On my 8GB SMP machines, I often see userspace making no progress for minutes at a stretch, as seen in this chart:

http://saturn5.com/~jwb/prime-starve.png

The current VM can spin and spin doing nothing for a horribly long time. I'm glad AA is working on the problem. RvR, on the other hand, *is* the problem and I wish they would stop letting him screw around with the VM code.

That's the way I see it from a sysadmin/user perspective, anyway.

Time to thrash the 2.6 VM?

Posted Mar 4, 2004 19:58 UTC (Thu) by crimsun (subscriber, #13750) [Link]

It's wrong to blame one person or any group for filibustering vm improvements. VM itself, as with any core OS component, is a very elusive bullseye: there is no best implementation for the average workload. 2.6 does a decent approximation. Sure, there's room for improvement - and the code is freely available for free modification and distribution.

Time to thrash the 2.6 VM?

Posted Mar 4, 2004 20:45 UTC (Thu) by jwb (guest, #15467) [Link]

I suppose it would be better to blame the process, but RvR's style of magically tweaking constants strikes me as unscientific. I'd be happier with an OSDL project which analyzed any proposed VM changes (or any kernel change) on a variety of workloads: kernel compiles on 128MB machines, interactive loads on 256MB machines, DVD burns on 1GB machines, web serving on 1 and 2GB machines, database loads on 1, 2, 4, 8, 16, and 32GB machines, and so forth.

My enthusiasm for AA's changes stems from his proven history of having insight into how the kernel is really behaving, and his willingness to acknowledge the current brokenness.. RvR's 2.4 VM through 2.4.9 was a disaster. 2.4.10 was the climax, after which the kernel's behavior was so very much more reasonable. And I never forget that RvR invented the OOM killer, probably the worst decision for Linux operability ever made.

Time to thrash the 2.6 VM?

Posted Mar 5, 2004 13:26 UTC (Fri) by Johbe (guest, #249) [Link]

I'm having this problem right now. We run a squidproxy for about 1500 clients, a smp machine with approx 4 gigs of ram. Someone suggested I tried out Rik's rmap patch and that's what I'm doing right now.

The problem I'm experiencing occationally is a complete lockup, kswapd eating *all* cpu on the machine, it becomes unusable, locked down to ssh and everything for 10-15 minutes then it starts running again as usual.

I've been using the rmap patch for 3 days now, and so far, we haven't experienced this problem - it might show up, takes a while after a reboot until it hits. We'll see. But I still have some faith in it.

Time to thrash the 2.6 VM?

Posted Mar 5, 2004 19:58 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

Just a quick question: Are these 8GB machines 32-bit machines running w/ highmem, or 64-bit machines running flat memory?

Historically, it seems like highmem has always been a challenge to get to work well for Linux.

Time to thrash the 2.6 VM?

Posted Mar 5, 2004 21:42 UTC (Fri) by jwb (guest, #15467) [Link]

They are 64-bit machines running a 32-bit kernel with high memory. How's that for confusion?

I think Andrea gets too much flack about 2.4.10

Posted Mar 6, 2004 18:59 UTC (Sat) by chip (subscriber, #8258) [Link]

Andrea still describes 2.4.10 as a success -- it's right there in lkml this month -- and as a regular user of his kernel patches, I know I'm a satisfied customer.

I think Andrea gets too much flack about 2.4.10

Posted Mar 7, 2004 23:48 UTC (Sun) by garloff (subscriber, #319) [Link]

2.4.10 itself was not great. It had some rough edges.
But after a few more revisions, 2.4 VM worked much better than before.

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds