Avoiding page reference-count overflows
changes all over" and highlighted a number of the areas that had been touched. One thing that was not mentioned there was the addition of four patches fixing a security-related issue in the core memory-management subsystem. The vulnerability is sufficiently difficult to exploit that almost nobody should feel the need to rush out a kernel update, but it is still interesting to look at as a demonstration of how things can go wrong.
One of the many things crammed into struct page is a field called _refcount; it is an atomic_t variable that counts the number of references to the page in question. As an atomic_t, it's a signed, 32-bit quantity. As long as this count is nonzero, references to the page exist and it cannot be reused for other purposes; once it drops to zero, the page can be freed.
As with any counter in the kernel, _refcount should be manipulated with care lest it overflow. If one looks at the definition of get_page(), one will see this check:
VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
If the reference count is incremented beyond the maximum positive value that can be held in a 32-bit signed variable (2,147,483,647), it will wrap around to the largest-magnitude negative value (-2,147,483,648). That, of course, wanders into undefined behavior, so the compiler is entitled to set it to 0xdeadbeef and erase the disk drives instead if it so chooses. With reasonable compilers and a two's complement representation, though, one can expect the count to go negative when it overflows; that is what the above test is checking for.
There are a couple of interesting things to note about this test. One is that it happens before _refcount is incremented, so it will only trigger on the second overflow. That is not a problem, though, since the system will behave as expected — even with negative reference counts — as long as the count is not incremented all the way back to zero. The other thing to note is that, while VM_BUG_ON_PAGE() will crash the kernel when the CONFIG_DEBUG_VM configuration option is selected, it does precisely nothing in production builds. So, on most systems, an overflow of the page reference count will go undetected. That is potentially bad: if the reference count can be incremented to the point where it returns to zero, the page will be freed while a vast number of references remain, and a variety of use-after-free vulnerabilities will result. Good things do not generally ensue after an event like that.
That said, memory-management developers have had good reason to be unworried about this eventuality. Overflowing the reference count to zero would require creating a full four-billion references to the page in question (the negative reference counts for the second two-billion references would look weird, but things will work as intended), and that is not an easy thing to do. As it turns out, according to this merge commit, Jann Horn has figured out how to do it, and it was indeed not easy:
Most of us are unlikely to give such resources to an attacker, even when asked nicely. But a bug is a bug, and the developers who were privy to the information about Horn's exploit decided to fix it. The result was four commits (from Linus Torvalds and Matthew Wilcox) making this particular hole even harder to exploit.
The "specially crafted filesystem" mentioned above is necessary because it is not easy for an attacker running in user space to create large numbers of references to a given page. Simply creating a lot of mappings using mmap() or fork() would run into other limits long before the reference count overflowed. One way that does seem to work, though, is to create large numbers of direct-I/O requests, each of which acquires a reference while the operation is underway. Should the filesystem be extremely slow to finish those operations — even slower than VFAT — the references could eventually add up to the magic number.
The first step toward preventing such exploits was to create a macro to check whether a given reference count is getting close to going negative:
#define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u)
The VM_BUG_ON_PAGE() check shown above was then changed to use this macro. On systems where VM_BUG_ON_PAGE() does anything at all, this new test should prevent reference counts from going negative, thus stopping things far short of incrementing all the way to zero.
That said, if CONFIG_DEBUG_VM is set on a target system, an attacker could still use this overflow for denial-of-service attacks. To prevent that, Torvalds created a new try_get_page() function that will refuse to acquire a reference if the count is close to overflowing. Then, get_user_pages() was changed to use try_get_page() and to fail the entire operation if the needed references cannot be acquired. With those changes in place, direct-I/O requests that threaten to overflow a reference count will simply fail, and the attacker will be left looking for other good uses for 140GB of free memory. Another possible exploit, using pipes, was closed off in a similar fashion by Wilcox.
While these changes were added late in the release cycle like an urgent
fix, this clearly isn't a vulnerability that has a lot of people worried.
In fact, it was first
discussed in January on the closed kernel security list, but then the
developers involved simply forgot
about it for a while. That lapse has now been rectified, which is a good
thing; one never knows when somebody might discover an easier way to force
the page reference count to overflow. When that happens, the patches
merged for 5.1-rc5 should be able to prevent a severe compromise of the
system.
Index entries for this article | |
---|---|
Kernel | Memory management/struct page |
Kernel | Reference counting |
Kernel | Security/Vulnerabilities |
Security | Linux kernel/Vulnerabilities |
Posted Apr 16, 2019 1:52 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (14 responses)
And if refcount_t isn't suitable here, is it suitable anywhere?
Posted Apr 16, 2019 4:52 UTC (Tue)
by eru (subscriber, #2753)
[Link] (5 responses)
Posted Apr 16, 2019 10:49 UTC (Tue)
by LtWorf (subscriber, #124958)
[Link] (1 responses)
Anyone knows more?
Posted Apr 16, 2019 12:57 UTC (Tue)
by corbet (editor, #1)
[Link]
Posted Apr 16, 2019 12:54 UTC (Tue)
by willy (subscriber, #9762)
[Link] (2 responses)
And that's what this patch does. Once the refcount reaches 2^31, "safe" increments are still allowed but "potentially exploitable" ones fail.
Posted Apr 16, 2019 14:41 UTC (Tue)
by nivedita76 (subscriber, #121790)
[Link] (1 responses)
Posted Apr 16, 2019 15:08 UTC (Tue)
by willy (subscriber, #9762)
[Link]
The macro is misnamed, I now realise.
Posted Apr 16, 2019 12:47 UTC (Tue)
by willy (subscriber, #9762)
[Link]
Posted Apr 16, 2019 17:07 UTC (Tue)
by willy (subscriber, #9762)
[Link] (1 responses)
I think it's a question of tradeoffs (as so much of programming is). Getting an object that uses a refcount_t always succeeds. It may have the side-effect of making the object indestructible, but there's no new rarely-executed, security-critical code-path to test.
get_user_pages() can already fail, so all users should be prepared for that. If not, they're currently a security hole -- consider one thread which calls munmap() on addresses that another thread is passing to a syscall.
refcount_t isn't the appropriate fix for this problem, but is generally a better approach than a silently overflowing / wrapping atomic_t
Posted Apr 17, 2019 0:33 UTC (Wed)
by neilbrown (subscriber, #359)
[Link]
I wonder if a refcount_get_may_fail() could be useful elsewhere.
Posted Apr 17, 2019 7:43 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link] (4 responses)
But then I looked up the refcount_t overflow protection and noted it was first designed by PaXTeam with work to mainline it being done by Kees along with impressive benchmarking results.
The reputation of PaXTeam needs no comment, though I admit an unhealthy amount of admiration for Kees for continuing excellent work in the face of past fierce and unkind put downs from the Linux project leader in chief and friction elsewhere.
While we now have 'nice Linus', he himself had a hand in this patch set, so perhaps the reason for not using or adapting refcount_t is one of personal politics rather than technical merit.
I may well be blind to something here (and certainly I'm not close enough to this sub-system) but personally I just don't buy that refcount_t couldn't be used here, or slightly improved for use. The refcount_t saturating behaviour sounds right, as does producing a call stack on first saturation of the counter and then never decrementing the count again. Killing the calling process on first saturation of a count may not be suitable in case a call originates from a kernel thread itself, but I'm sure that could be adapted if needed (or not already) - with possible benefits and uses in other parts of the kernel too.
Hopefully a later comment will clearly explain the technical limitation of refcount_t which could only be overcome by adding a new esoteric mechanism.
Posted Apr 18, 2019 6:49 UTC (Thu)
by willy (subscriber, #9762)
[Link] (3 responses)
For the record, the original idea of using the negative reference count in this way was mine. I didn't consider switching to refcount_t for a second, not because I have a long-standing problem with Spender, but because it would be a ridiculous amount of change for a fix. And as my earlier comment said, it would have had significant downsides.
All the things you ask to be explained have already been explained in earlier responses; I'm not going to type them out again.
Posted Apr 19, 2019 7:44 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
Posted Apr 19, 2019 9:45 UTC (Fri)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
If you are going to tell me where I should write, may I respectfully suggest that as a leading linux kernel developer, perhaps you should read the CoC and at least choose a more respectful tone in your replies. This isn't the lkml and serves a more diverse audience. It's reasonable to expect different levels of understanding here.
That is just your opinion; I don't believe you speak on behalf of LWN.
Having read and re-read the article and comments, that seems to be the crux of the argument for making a new mechanism. Apart from "refcount_t isn't the appropriate fix for this problem", I don't see where any critical technical limitation of refcount_t is stated. Short-cutting the work to make this change might have been the best option here, but it's surely creating technical debt. For a "vulnerability [that] is sufficiently difficult to exploit that almost nobody should feel the need to rush out a kernel update", I hope it is worth it. Thanks for all your hard work all the same
Posted Apr 22, 2019 23:17 UTC (Mon)
by jschrod (subscriber, #1646)
[Link]
Of course, he didn't speak on behalf of LWN -- but he didn't claim to do so, too.
He spoke on behalf of us long-time LWN readers who don't want to see ad-hominem attacks or comments with random assumptions expressing negative connotations on a personal level, why kernel developers did implement a change in a certain way besides them explaining explicitly why it was done this way here in this LWN thread.
Please abstain with this kind of comments in the future. It's not welcome here.
Thank you.
Posted Apr 18, 2019 20:16 UTC (Thu)
by naptastic (guest, #60139)
[Link] (2 responses)
Why not just panic if a page's refcount < -1?
(I'm assuming <-1 and not ≤-1 based on Willy's comment)
As Willy pointed out earlier, this change makes the refcount have four ranges instead of two (or one, depending on how you count.) That seems like a lot of added complexity (hidden behind a macro) to accommodate a use case... actually, can anyone suggest when it would be useful to have 2^32 or more references to a page? I can't think of a good use case, but what do I know? :-)
(Also, who's Willy?)
Regardless, congratulations and thank you to all the developers who identified and fixed this before it became more than a theoretical problem.
Posted Apr 19, 2019 15:07 UTC (Fri)
by willy (subscriber, #9762)
[Link] (1 responses)
It's generally considered unfriendly to panic ... we tend to prefer BUG_ON() which just kills the task.
> can anyone suggest when it would be useful to have 2^32 or more references to a page?
I don't think it's ever useful to have 2^32 actual references to a page. The four ranges are really there to help the system behave well in the presence of an attack.
That said, with huge pages, we often increment refcounts by the number of base pages in the huge page. If we supported 1GB THPs, that would be 2^18 references per task. So with 2^14 tasks mapping the same page, we'd overflow in an entirely legitimate manner. Fortunately, we only support 2MB THPs today, so this doesn't yet apply. (hugetlbfs handles refcounts differently from THP)
> (Also, who's Willy?)
Matthew Wilcox. I've been using 'willy' as my online nickname since about 1994.
Posted Apr 19, 2019 15:21 UTC (Fri)
by naptastic (guest, #60139)
[Link]
Posted May 14, 2019 18:44 UTC (Tue)
by mcortese (guest, #52099)
[Link]
What am I missing?
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Allowing it to be decremented from the saturated value would also be dangerous, because the page might get then freed prematurely.
It seems to me a reference count overflow simply must be handled as an error, otherwise you introduce rare bugs.
Avoiding page reference-count overflows
Once a refcount_t value hits its max, it will no longer be changed — it gets "stuck". Any resource controlled by that count will then never be freed. A resource leak is considered to be less threatening than a premature free. More information in this article and this one.
refcount_t
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows
You should write for The Register....
It is, however, unworthy of an LWN comment.
it would be a ridiculous amount of change for a fix.
Avoiding page reference-count overflows
> That is just your opinion; I don't believe you speak on behalf of LWN.
Avoiding page reference-count overflows
>
> (...)
>
> ...That, of course, wanders into undefined behavior, so the compiler is entitled to set it to 0xdeadbeef and erase the disk drives instead if it so chooses...
Avoiding page reference-count overflows
Avoiding page reference-count overflows
Avoiding page reference-count overflows