Stabilizing per-VMA locking
The mmap_lock controls access to a process's address space. If a process has a lot of things going on at the same time — a large number of threads, for example — contention for the mmap_lock can become a serious performance problem. Handling page faults, which can involve changing page-table entries or expanding a virtual memory area (VMA), has traditionally required the mmap_lock, meaning that many threads generating page faults concurrently can slow a process down.
Developers have been working for years to address this problem, often pursuing one form or another of speculative page-fault handling, where the work is done without the protection of the lock. If contention occurred during the handling of a fault, the kernel would drop the work it had done before making it visible and start over. This work never got to the point of being merged, though. Eventually, Suren Baghdasaryan's per-VMA locking work was merged instead. Since a page fault happens within a specific VMA, handling it should really only require serialized access to that one VMA rather than to the entire address space. Per-VMA locking is, thus, a finer-grained version of mmap_lock that allows for more parallelism in the execution of specific tasks.
After a few iterations, this work was pulled into the mainline during the 6.4 merge window. Entry into the mainline tends to increase the amount of testing a change is subject to, but none of that testing turned up any problems with per-VMA locking, which saw no fixup changes during the remainder of the 6.4 development cycle. Per-VMA locking, it seemed, was a done deal.
Trouble arises
A mainline kernel release gets even more testing, though. The 6.4 release
happened on June 25; it took four days for Jiri Slaby to report
problems with the Go language build. On July 2, Jacob Young reported
"frequent but random crashes
" and narrowed the problem down to a
simple test case that could reproduce it. In both cases, investigation
quickly pointed a finger at the per-VMA locking work. There were, it
seems, cases where unexpected concurrent access was corrupting data
structures within the kernel.
Random crashes and data corruption are not the sort of experience users are looking for when they upgrade to a new "stable" kernel release. For extra fun, the StackRot vulnerability led to the merging of some significant changes to how stack expansion is handled — changes that had not seen public review and which were quickly shipped in the 6.4.1 stable update. That work introduced another VMA-locking bug in 6.4 kernels, having added another case that needed explicit serialization.
Baghdasaryan responded quickly to all of these problems, posting a patch
to simply disable the feature until the problems were worked out. That
patch needed some tweaking of its own, but it was never applied to the
mainline in any form. Memory-management maintainer Andrew Morton went
into "wait-a-few-days-mode
" in the hopes that a proper fix for
the problem would emerge. Greg Kroah-Hartman, meanwhile, said
that nothing could be done for the stable kernels until some patch landed
in the mainline kernel. As a result, the 6.4 kernel lacked any sort of
workaround for this serious bug.
On July 5, Thorsten Leemhuis (working in his regression-tracking role) wondered how long the wait would be, noting that the faster-moving community distributions would already be starting to ship the 6.4 kernel. Morton answered that he would send the fixes that existed at the time, but that did not actually happen for a few more days, leading Leemhuis to think that he needs to start sounding the alarm more quickly on serious regressions like this one. Had he done so in this case, he thought, the problem might have been addressed more quickly.
Morton sent a patch series to Linus Torvalds on July 8 that included, among other things, the disabling of per-VMA locking. Torvalds, though, undid that change when merging the set, because three other patches had gone straight into the mainline by then:
- The first one added some locking to the new stack-expansion code, undoing the problem that had been introduced by the security fixes.
- The next ensures that newly created VMAs are locked before they are made visible to the rest of the kernel. The problem fixed did not affect any released kernels, but it was a latent bug that would be exposed by the planned expansion of per-VMA locking in the future.
- Finally, this patch fixed the problem that was being reported. If a process called fork(), then incurred a page fault on a VMA while that VMA was being copied into the child process, the result could be the corruption of the VMA. The fix is to explicitly lock each VMA before beginning the copy, slowing down fork()-heavy workloads by as much as 5%.
It's worth noting that Torvalds was opposed to the idea of disabling per-VMA locking; instead he insisted that, if it could not be fixed, it should be removed entirely:
I seriously believe that if the per-vma locking is so broken that it needs to be disabled in a development kernel, we should just admit failure, and revert it all.And not in a "revert it for a later attempt" kind of way.
So it would be a "revert it because it added insurmountable problems that we couldn't figure out" thing that implies *not* trying it again in that form at all, and much soul-searching before somebody decides that they have a more maintainable model for it all.
By all appearances, the fixes (which were included in the 6.4.3 stable update on July 11) are sufficient, and per-VMA locking is now stable. There should, thus, be no need to revert and soul-search in this case. That said, it's worth keeping in mind that this work looked stable before the 6.4 release as well.
Closing thoughts
The per-VMA locking work made fundamental changes to a core kernel subsystem, moving some page-fault handling outside of a sprawling lock, the coverage of which is probably not fully understood by anybody. The fact that it turned out to have some subtle bugs should not be especially surprising. It is hard to make this kind of change to the kernel without running into trouble somewhere, but this is also precisely the sort of change that developers need to be able to make if Linux is to adapt to current needs.
In theory, avoiding subtle locking problems is one of the advantages of using a language like Rust. Whether, say, a VMA abstraction written in Rust could truly ensure that accesses use proper locking while maintaining performance has not yet been proven in the kernel context, though.
There is a case to be made that this regression could have been handled better. Perhaps there should be a way for an immediate "disable the offending feature" patch to ship in a stable release, even if that change has not gone into the mainline, and even if a better fix is expected in the near future. Kernel developers often make the point that the newest kernels are the best ones that the community knows how to make and that users should not hesitate to upgrade to them. Responding more quickly when an upgrade turns out to be a bad idea would help to build confidence in that advice.
Meanwhile, though, developers did respond quickly to the problem and proper
fixes that allowed the feature to stay in place were found. The number of
users exposed to the bug was, hopefully, relatively small. There is, in
other words, a case to be made that this problem was handled reasonably
well and that we will all get the benefits of faster memory management as a
result of this work.
Index entries for this article | |
---|---|
Kernel | Memory management/mmap_sem |
Kernel | Releases/6.4 |
Posted Jul 13, 2023 21:02 UTC (Thu)
by dskoll (subscriber, #1630)
[Link]
Anecdote: Firefox began crashing occasionally (1-2x per day) when I switched to kernel 6.4. Since I upgraded to 6.4.3 two days ago, it has not crashed. I don't know for sure it was the kernel, but it seems likely.
Posted Jul 13, 2023 22:23 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (1 responses)
It seems so obvious that the truth of “newest is best” is more complicated than this (I’m not saying those involved don’t know this, just it helps me to say it directly). The basic logic of it is obviously correct, but at the same time, it’s common that new things have new problems not yet found through greater exposure. I don’t really have an answer, maybe it’s to hugely increase testing of kernels before they come out. But it seems likely there will still be a sweet spot with stable kernels that are just a bit older base with more fixes, whatever we may say about the best we know how to produce. :/
Posted Jul 14, 2023 10:25 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
It's complicated. The newest kernel is, by definition, the best kernel the developers know how to produce, but that does not necessarily make it the best for your workload, because it may have new bugs that matter to you, while the old bugs don't matter to you.
And one component of "please use the latest" is that bug reports become less valuable the longer the gap between bug introduced, and bug found; if you report a bug that's not present in 6.4, but is present in 6.5-rc1, not only are there fewer commits to consider that could possibly have exposed the bug, but the people who will fix your bug have the context around why something works the way it does in recent memory. If you report a bug as added in 5.12-rc1 and still present in 6.5-rc1, but not in 5.11, you're still keeping the commit count down, but now you're asking the developers to remember why changes were made in the 5.12 time frame, and what the intended effect was.
Posted Jul 13, 2023 22:34 UTC (Thu)
by ringerc (subscriber, #3071)
[Link] (4 responses)
Interesting that it got so far, given that postgres has been a useful tool to exercise other MM kernel issues in the past.
Posted Jul 14, 2023 4:02 UTC (Fri)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
Posted Jul 14, 2023 12:30 UTC (Fri)
by sam_c (subscriber, #139836)
[Link]
Posted Jul 14, 2023 19:40 UTC (Fri)
by willy (subscriber, #9762)
[Link] (1 responses)
Posted Jul 15, 2023 10:37 UTC (Sat)
by ringerc (subscriber, #3071)
[Link]
Posted Jul 14, 2023 2:16 UTC (Fri)
by jlbec (subscriber, #121932)
[Link] (4 responses)
Posted Jul 14, 2023 4:54 UTC (Fri)
by alison (subscriber, #63752)
[Link]
Posted Jul 15, 2023 0:25 UTC (Sat)
by Paf (subscriber, #91811)
[Link] (2 responses)
Posted Jul 15, 2023 0:40 UTC (Sat)
by corbet (editor, #1)
[Link] (1 responses)
Posted Jul 16, 2023 19:30 UTC (Sun)
by Paf (subscriber, #91811)
[Link]
Posted Jul 14, 2023 6:26 UTC (Fri)
by mb (subscriber, #50428)
[Link] (3 responses)
Well, I think that such performance critical paths would either use safe code or they would be marked unsafe {} and use some lockless performance tricks.
Posted Jul 14, 2023 7:06 UTC (Fri)
by sima (subscriber, #160698)
[Link] (2 responses)
I think in this case here it was more an issue of not clearly understanding the ownership model of vma structs, and not so much a missed case that the compiler could have spotted for you. Where rust might help is that it allows you to roll out the ownership model first (in the type system) without changing the locking, and then once you're fairly confident that you got it right, you flip the switch (and watch things blow up if you missed something).
Posted Jul 14, 2023 14:05 UTC (Fri)
by mb (subscriber, #50428)
[Link]
If you don't understand the ownership model, you can't write correct C code either. The resulting code may be correct by luck, at best.
>I think in this case here it was more an issue of not clearly understanding the
Exactly.
>not so much a missed case that the compiler could have spotted for you.
I'm not talking about that kind of thing.
Posted Jul 20, 2023 14:02 UTC (Thu)
by khim (subscriber, #9252)
[Link]
And if you couldn't do that then your program can not be compiled and thus wouldn't have any bugs. The question is not whether Rust may prevent such bugs. It absolutely could do that, it's designed for that and it works well. The question is whether price is too high and if code which would satisfy Rust compiler would also satisfy performance demands of kernel developers, too. That is still an open question, that's true.
Posted Jul 21, 2023 12:06 UTC (Fri)
by jeremyhetzler (subscriber, #127663)
[Link]
Posted Jul 21, 2023 15:09 UTC (Fri)
by mirabilos (subscriber, #84359)
[Link]
Is that workable?
Many many standard Unix tools don’t use any threads at all, while the standard workflow uses loops and pipes and is therefore fork-heavy.
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Significant VM changes in major releases that take time and consternation to sort out are time-honored tradition.
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Stabilizing per-VMA locking
The VM switch in 2.4 seemed insane at the time as well. It did seem to stabilize a lot of the persistent problems that had been plaguing that kernel, though; it's been a long time since I've heard anybody who thinks it was the wrong decision.
2.4.10
2.4.10
Stabilizing per-VMA locking
> locking while maintaining performance has not yet been proven in the kernel context, though.
The unsafe-block by itself would probably help to raise some eyebrows when doing such fundamental locking changes.
Stabilizing per-VMA locking
Stabilizing per-VMA locking
>otherwise you can't encode it in the type system.
>ownership model of vma structs
Because it was not enforced and not even encouraged by the language to think about it.
I'm talking about Rust encouraging you to think about ownership. This is many levels above a compiler run.
Programming Rust puts your brain into a completely different mode of thinking. And that's what makes the real difference.
>But for that to work you need to understand your ownership model clearly, otherwise you can't encode it in the type system.
Stabilizing per-VMA locking
Stabilizing per-VMA locking
Slowing down fork-heavy workloads in favour of thread-heavy workloads by THIS MUCH?