Stabilizing per-VMA locking

By Jonathan Corbet
July 13, 2023

The kernel-development process routinely absorbs large changes to fundamental subsystems and still produces stable releases every nine or ten weeks. On occasion, though, the development community's luck runs out. The per-VMA locking work that went into the 6.4 release is a case in point; it looked like a well-tested change that improved page-fault scalability. There turned out to be a few demons hiding in that code, though, that made life difficult for early adopters of the 6.4 kernel.

The mmap_lock controls access to a process's address space. If a process has a lot of things going on at the same time — a large number of threads, for example — contention for the mmap_lock can become a serious performance problem. Handling page faults, which can involve changing page-table entries or expanding a virtual memory area (VMA), has traditionally required the mmap_lock, meaning that many threads generating page faults concurrently can slow a process down.

Developers have been working for years to address this problem, often pursuing one form or another of speculative page-fault handling, where the work is done without the protection of the lock. If contention occurred during the handling of a fault, the kernel would drop the work it had done before making it visible and start over. This work never got to the point of being merged, though. Eventually, Suren Baghdasaryan's per-VMA locking work was merged instead. Since a page fault happens within a specific VMA, handling it should really only require serialized access to that one VMA rather than to the entire address space. Per-VMA locking is, thus, a finer-grained version of mmap_lock that allows for more parallelism in the execution of specific tasks.

After a few iterations, this work was pulled into the mainline during the 6.4 merge window. Entry into the mainline tends to increase the amount of testing a change is subject to, but none of that testing turned up any problems with per-VMA locking, which saw no fixup changes during the remainder of the 6.4 development cycle. Per-VMA locking, it seemed, was a done deal.

Trouble arises

A mainline kernel release gets even more testing, though. The 6.4 release happened on June 25; it took four days for Jiri Slaby to report problems with the Go language build. On July 2, Jacob Young reported "frequent but random crashes" and narrowed the problem down to a simple test case that could reproduce it. In both cases, investigation quickly pointed a finger at the per-VMA locking work. There were, it seems, cases where unexpected concurrent access was corrupting data structures within the kernel.

Random crashes and data corruption are not the sort of experience users are looking for when they upgrade to a new "stable" kernel release. For extra fun, the StackRot vulnerability led to the merging of some significant changes to how stack expansion is handled — changes that had not seen public review and which were quickly shipped in the 6.4.1 stable update. That work introduced another VMA-locking bug in 6.4 kernels, having added another case that needed explicit serialization.

Baghdasaryan responded quickly to all of these problems, posting a patch to simply disable the feature until the problems were worked out. That patch needed some tweaking of its own, but it was never applied to the mainline in any form. Memory-management maintainer Andrew Morton went into "wait-a-few-days-mode" in the hopes that a proper fix for the problem would emerge. Greg Kroah-Hartman, meanwhile, said that nothing could be done for the stable kernels until some patch landed in the mainline kernel. As a result, the 6.4 kernel lacked any sort of workaround for this serious bug.

On July 5, Thorsten Leemhuis (working in his regression-tracking role) wondered how long the wait would be, noting that the faster-moving community distributions would already be starting to ship the 6.4 kernel. Morton answered that he would send the fixes that existed at the time, but that did not actually happen for a few more days, leading Leemhuis to think that he needs to start sounding the alarm more quickly on serious regressions like this one. Had he done so in this case, he thought, the problem might have been addressed more quickly.

Morton sent a patch series to Linus Torvalds on July 8 that included, among other things, the disabling of per-VMA locking. Torvalds, though, undid that change when merging the set, because three other patches had gone straight into the mainline by then:

The first one added some locking to the new stack-expansion code, undoing the problem that had been introduced by the security fixes.
The next ensures that newly created VMAs are locked before they are made visible to the rest of the kernel. The problem fixed did not affect any released kernels, but it was a latent bug that would be exposed by the planned expansion of per-VMA locking in the future.
Finally, this patch fixed the problem that was being reported. If a process called fork(), then incurred a page fault on a VMA while that VMA was being copied into the child process, the result could be the corruption of the VMA. The fix is to explicitly lock each VMA before beginning the copy, slowing down fork()-heavy workloads by as much as 5%.

It's worth noting that Torvalds was opposed to the idea of disabling per-VMA locking; instead he insisted that, if it could not be fixed, it should be removed entirely:

I seriously believe that if the per-vma locking is so broken that it needs to be disabled in a development kernel, we should just admit failure, and revert it all.
And not in a "revert it for a later attempt" kind of way.
So it would be a "revert it because it added insurmountable problems that we couldn't figure out" thing that implies *not* trying it again in that form at all, and much soul-searching before somebody decides that they have a more maintainable model for it all.

By all appearances, the fixes (which were included in the 6.4.3 stable update on July 11) are sufficient, and per-VMA locking is now stable. There should, thus, be no need to revert and soul-search in this case. That said, it's worth keeping in mind that this work looked stable before the 6.4 release as well.

Closing thoughts

The per-VMA locking work made fundamental changes to a core kernel subsystem, moving some page-fault handling outside of a sprawling lock, the coverage of which is probably not fully understood by anybody. The fact that it turned out to have some subtle bugs should not be especially surprising. It is hard to make this kind of change to the kernel without running into trouble somewhere, but this is also precisely the sort of change that developers need to be able to make if Linux is to adapt to current needs.

In theory, avoiding subtle locking problems is one of the advantages of using a language like Rust. Whether, say, a VMA abstraction written in Rust could truly ensure that accesses use proper locking while maintaining performance has not yet been proven in the kernel context, though.

There is a case to be made that this regression could have been handled better. Perhaps there should be a way for an immediate "disable the offending feature" patch to ship in a stable release, even if that change has not gone into the mainline, and even if a better fix is expected in the near future. Kernel developers often make the point that the newest kernels are the best ones that the community knows how to make and that users should not hesitate to upgrade to them. Responding more quickly when an upgrade turns out to be a bad idea would help to build confidence in that advice.

Meanwhile, though, developers did respond quickly to the problem and proper fixes that allowed the feature to stay in place were found. The number of users exposed to the bug was, hopefully, relatively small. There is, in other words, a case to be made that this problem was handled reasonably well and that we will all get the benefits of faster memory management as a result of this work.

Index entries for this article
Kernel	Memory management/mmap_sem
Kernel	Releases/6.4

Stabilizing per-VMA locking

Posted Jul 13, 2023 21:02 UTC (Thu) by dskoll (subscriber, #1630) [Link]

Anecdote: Firefox began crashing occasionally (1-2x per day) when I switched to kernel 6.4. Since I upgraded to 6.4.3 two days ago, it has not crashed. I don't know for sure it was the kernel, but it seems likely.

Stabilizing per-VMA locking

Posted Jul 13, 2023 22:23 UTC (Thu) by Paf (subscriber, #91811) [Link] (1 responses)

“ Kernel developers often make the point that the newest kernels are the best ones that the community knows how to make and that users should not hesitate to upgrade to them. Responding more quickly when an upgrade turns out to be a bad idea would help to build confidence in that advice.”

It seems so obvious that the truth of “newest is best” is more complicated than this (I’m not saying those involved don’t know this, just it helps me to say it directly). The basic logic of it is obviously correct, but at the same time, it’s common that new things have new problems not yet found through greater exposure. I don’t really have an answer, maybe it’s to hugely increase testing of kernels before they come out. But it seems likely there will still be a sweet spot with stable kernels that are just a bit older base with more fixes, whatever we may say about the best we know how to produce. :/

Stabilizing per-VMA locking

Posted Jul 14, 2023 10:25 UTC (Fri) by farnz (subscriber, #17727) [Link]

It's complicated. The newest kernel is, by definition, the best kernel the developers know how to produce, but that does not necessarily make it the best for your workload, because it may have new bugs that matter to you, while the old bugs don't matter to you.

And one component of "please use the latest" is that bug reports become less valuable the longer the gap between bug introduced, and bug found; if you report a bug that's not present in 6.4, but is present in 6.5-rc1, not only are there fewer commits to consider that could possibly have exposed the bug, but the people who will fix your bug have the context around why something works the way it does in recent memory. If you report a bug as added in 5.12-rc1 and still present in 6.5-rc1, but not in 5.11, you're still keeping the commit count down, but now you're asking the developers to remember why changes were made in the 5.12 time frame, and what the intended effect was.

Stabilizing per-VMA locking

Posted Jul 13, 2023 22:34 UTC (Thu) by ringerc (subscriber, #3071) [Link] (4 responses)

Wow. This bug would've made postgres extremely sad, and should've turned up reasonably quickly in a pgbench workload.

Interesting that it got so far, given that postgres has been a useful tool to exercise other MM kernel issues in the past.

Stabilizing per-VMA locking

Posted Jul 14, 2023 4:02 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (1 responses)

I actually happened to run a bunch of postgres benchmarks on a vulnerable kernel, without, as far as I know it, encountering the issue. I suspect it's because postmaster just doesn't have a lot of page faults.

Stabilizing per-VMA locking

Posted Jul 14, 2023 12:30 UTC (Fri) by sam_c (subscriber, #139836) [Link]

I took the above comment as being about the fork problem in 6.4 rather than the stack rot issue from 6.1.

Stabilizing per-VMA locking

Posted Jul 14, 2023 19:40 UTC (Fri) by willy (subscriber, #9762) [Link] (1 responses)

The task calling fork() must be multithreaded to encounter this bug. Most multithreaded tasks don't call fork(). Most tasks that call fork() (including pgbench) are not multithreaded. It's not too surprising that it took months to be uncovered (that is, months since Suren started posting this patch series, not months since it was released in an official kernel).

Stabilizing per-VMA locking

Posted Jul 15, 2023 10:37 UTC (Sat) by ringerc (subscriber, #3071) [Link]

Ah, that explains it, I missed that it had to be MT. Postgres would never hit that every in unusual deployments like those using PL/Java.

Stabilizing per-VMA locking

Posted Jul 14, 2023 2:16 UTC (Fri) by jlbec (subscriber, #121932) [Link] (4 responses)

Significant VM changes in major releases that take time and consternation to sort out are time-honored tradition.

Stabilizing per-VMA locking

Posted Jul 14, 2023 4:54 UTC (Fri) by alison (subscriber, #63752) [Link]

Thanks for posting that link into the Wayback Machine: it made me laugh.

Stabilizing per-VMA locking

Posted Jul 15, 2023 0:25 UTC (Sat) by Paf (subscriber, #91811) [Link] (2 responses)

From the perspective of someone who wasn’t around then and has only seen basically the current process, that all sounds just *insane* as a dev process.

2.4.10

Posted Jul 15, 2023 0:40 UTC (Sat) by corbet (editor, #1) [Link] (1 responses)

The VM switch in 2.4 seemed insane at the time as well. It did seem to stabilize a lot of the persistent problems that had been plaguing that kernel, though; it's been a long time since I've heard anybody who thinks it was the wrong decision.

2.4.10

Posted Jul 16, 2023 19:30 UTC (Sun) by Paf (subscriber, #91811) [Link]

Yeah, I can’t argue with the results. That stood out but I was in fact talking as much about the whole process - the differing branches and the seeming lack of maintainers except Linus. It feels a little like watching a kind of multi-mode high traffic near-chaos flow, the kind that seems - to someone used to orderly traffic - like it should lead to constant crashes but somehow doesn’t (usually :x).

Stabilizing per-VMA locking

Posted Jul 14, 2023 6:26 UTC (Fri) by mb (subscriber, #50428) [Link] (3 responses)

> Whether, say, a VMA abstraction written in Rust could truly ensure that accesses use proper
> locking while maintaining performance has not yet been proven in the kernel context, though.

Well, I think that such performance critical paths would either use safe code or they would be marked unsafe {} and use some lockless performance tricks.
The unsafe-block by itself would probably help to raise some eyebrows when doing such fundamental locking changes.

Stabilizing per-VMA locking

Posted Jul 14, 2023 7:06 UTC (Fri) by sima (subscriber, #160698) [Link] (2 responses)

I'm not sure rust would have helped here a lot. Rust is really good at enforcing ownership models and pretty flexible at that. But for that to work you need to understand your ownership model clearly, otherwise you can't encode it in the type system. And you need to encode it all yourself for custom data structure protection schemes like this one.

I think in this case here it was more an issue of not clearly understanding the ownership model of vma structs, and not so much a missed case that the compiler could have spotted for you. Where rust might help is that it allows you to roll out the ownership model first (in the type system) without changing the locking, and then once you're fairly confident that you got it right, you flip the switch (and watch things blow up if you missed something).

Stabilizing per-VMA locking

Posted Jul 14, 2023 14:05 UTC (Fri) by mb (subscriber, #50428) [Link]

>But for that to work you need to understand your ownership model clearly,
>otherwise you can't encode it in the type system.

If you don't understand the ownership model, you can't write correct C code either. The resulting code may be correct by luck, at best.

>I think in this case here it was more an issue of not clearly understanding the
>ownership model of vma structs

Exactly.
Because it was not enforced and not even encouraged by the language to think about it.

>not so much a missed case that the compiler could have spotted for you.

I'm not talking about that kind of thing.
I'm talking about Rust encouraging you to think about ownership. This is many levels above a compiler run.
Programming Rust puts your brain into a completely different mode of thinking. And that's what makes the real difference.

Stabilizing per-VMA locking

Posted Jul 20, 2023 14:02 UTC (Thu) by khim (subscriber, #9252) [Link]

>But for that to work you need to understand your ownership model clearly, otherwise you can't encode it in the type system.

And if you couldn't do that then your program can not be compiled and thus wouldn't have any bugs.

The question is not whether Rust may prevent such bugs. It absolutely could do that, it's designed for that and it works well.

The question is whether price is too high and if code which would satisfy Rust compiler would also satisfy performance demands of kernel developers, too.

That is still an open question, that's true.

Stabilizing per-VMA locking

Posted Jul 21, 2023 12:06 UTC (Fri) by jeremyhetzler (subscriber, #127663) [Link]

Was this test-case, that triggered the bug, added to the test suite? That would be the best way to prevent similar bugs in the future.

Slowing down fork-heavy workloads in favour of thread-heavy workloads by THIS MUCH?

Posted Jul 21, 2023 15:09 UTC (Fri) by mirabilos (subscriber, #84359) [Link]

An immediate idea to mitigate this: keep a process-global mmap lock until the first thread is created, switch to per-VMA locking only then.

Is that workable?

Many many standard Unix tools don’t use any threads at all, while the standard workflow uses loops and pipes and is therefore fork-heavy.