Avoiding page reference-count overflows

By Jonathan Corbet
April 16, 2019

The 5.1-rc5 announcement mentioned "changes all over" and highlighted a number of the areas that had been touched. One thing that was not mentioned there was the addition of four patches fixing a security-related issue in the core memory-management subsystem. The vulnerability is sufficiently difficult to exploit that almost nobody should feel the need to rush out a kernel update, but it is still interesting to look at as a demonstration of how things can go wrong.

One of the many things crammed into struct page is a field called _refcount; it is an atomic_t variable that counts the number of references to the page in question. As an atomic_t, it's a signed, 32-bit quantity. As long as this count is nonzero, references to the page exist and it cannot be reused for other purposes; once it drops to zero, the page can be freed.

As with any counter in the kernel, _refcount should be manipulated with care lest it overflow. If one looks at the definition of get_page(), one will see this check:

    VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);

If the reference count is incremented beyond the maximum positive value that can be held in a 32-bit signed variable (2,147,483,647), it will wrap around to the largest-magnitude negative value (-2,147,483,648). That, of course, wanders into undefined behavior, so the compiler is entitled to set it to 0xdeadbeef and erase the disk drives instead if it so chooses. With reasonable compilers and a two's complement representation, though, one can expect the count to go negative when it overflows; that is what the above test is checking for.

There are a couple of interesting things to note about this test. One is that it happens before _refcount is incremented, so it will only trigger on the second overflow. That is not a problem, though, since the system will behave as expected — even with negative reference counts — as long as the count is not incremented all the way back to zero. The other thing to note is that, while VM_BUG_ON_PAGE() will crash the kernel when the CONFIG_DEBUG_VM configuration option is selected, it does precisely nothing in production builds. So, on most systems, an overflow of the page reference count will go undetected. That is potentially bad: if the reference count can be incremented to the point where it returns to zero, the page will be freed while a vast number of references remain, and a variety of use-after-free vulnerabilities will result. Good things do not generally ensue after an event like that.

That said, memory-management developers have had good reason to be unworried about this eventuality. Overflowing the reference count to zero would require creating a full four-billion references to the page in question (the negative reference counts for the second two-billion references would look weird, but things will work as intended), and that is not an easy thing to do. As it turns out, according to this merge commit, Jann Horn has figured out how to do it, and it was indeed not easy:

To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more).

Most of us are unlikely to give such resources to an attacker, even when asked nicely. But a bug is a bug, and the developers who were privy to the information about Horn's exploit decided to fix it. The result was four commits (from Linus Torvalds and Matthew Wilcox) making this particular hole even harder to exploit.

The "specially crafted filesystem" mentioned above is necessary because it is not easy for an attacker running in user space to create large numbers of references to a given page. Simply creating a lot of mappings using mmap() or fork() would run into other limits long before the reference count overflowed. One way that does seem to work, though, is to create large numbers of direct-I/O requests, each of which acquires a reference while the operation is underway. Should the filesystem be extremely slow to finish those operations — even slower than VFAT — the references could eventually add up to the magic number.

The first step toward preventing such exploits was to create a macro to check whether a given reference count is getting close to going negative:

    #define page_ref_zero_or_close_to_overflow(page) \
	    ((unsigned int) page_ref_count(page) + 127u <= 127u)

The VM_BUG_ON_PAGE() check shown above was then changed to use this macro. On systems where VM_BUG_ON_PAGE() does anything at all, this new test should prevent reference counts from going negative, thus stopping things far short of incrementing all the way to zero.

That said, if CONFIG_DEBUG_VM is set on a target system, an attacker could still use this overflow for denial-of-service attacks. To prevent that, Torvalds created a new try_get_page() function that will refuse to acquire a reference if the count is close to overflowing. Then, get_user_pages() was changed to use try_get_page() and to fail the entire operation if the needed references cannot be acquired. With those changes in place, direct-I/O requests that threaten to overflow a reference count will simply fail, and the attacker will be left looking for other good uses for 140GB of free memory. Another possible exploit, using pipes, was closed off in a similar fashion by Wilcox.

While these changes were added late in the release cycle like an urgent fix, this clearly isn't a vulnerability that has a lot of people worried. In fact, it was first discussed in January on the closed kernel security list, but then the developers involved simply forgot about it for a while. That lapse has now been rectified, which is a good thing; one never knows when somebody might discover an easier way to force the page reference count to overflow. When that happens, the patches merged for 5.1-rc5 should be able to prevent a severe compromise of the system.

Index entries for this article
Kernel	Memory management/struct page
Kernel	Reference counting
Kernel	Security/Vulnerabilities
Security	Linux kernel/Vulnerabilities

Avoiding page reference-count overflows

Posted Apr 16, 2019 1:52 UTC (Tue) by neilbrown (subscriber, #359) [Link] (14 responses)

Was there a reason this wasn't just changed to refcount_t, which cannot overflow (once it gets too big, it stays there) ???

And if refcount_t isn't suitable here, is it suitable anywhere?

Avoiding page reference-count overflows

Posted Apr 16, 2019 4:52 UTC (Tue) by eru (subscriber, #2753) [Link] (5 responses)

Doesn't a reference count that gets stuck to its highest value cause the page to be never freed?
Allowing it to be decremented from the saturated value would also be dangerous, because the page might get then freed prematurely.
It seems to me a reference count overflow simply must be handled as an error, otherwise you introduce rare bugs.

Avoiding page reference-count overflows

Posted Apr 16, 2019 10:49 UTC (Tue) by LtWorf (subscriber, #124958) [Link] (1 responses)

I agree with you, but I don't know how the "getting stuck" is handled, perhaps it triggers some special handling to free the resource.

Anyone knows more?

refcount_t

Posted Apr 16, 2019 12:57 UTC (Tue) by corbet (editor, #1) [Link]

Once a refcount_t value hits its max, it will no longer be changed — it gets "stuck". Any resource controlled by that count will then never be freed. A resource leak is considered to be less threatening than a premature free. More information in this article and this one.

Avoiding page reference-count overflows

Posted Apr 16, 2019 12:54 UTC (Tue) by willy (subscriber, #9762) [Link] (2 responses)

This is pretty much correct. Consider somebody using this attack against a page of libc (using it as the source for an O_DIRECT write). If we want to be able to fork() in the future, we need to have _those_ refcount increments succeed while further attempts to do O_DIRECT I/Os fail.

And that's what this patch does. Once the refcount reaches 2^31, "safe" increments are still allowed but "potentially exploitable" ones fail.

Avoiding page reference-count overflows

Posted Apr 16, 2019 14:41 UTC (Tue) by nivedita76 (subscriber, #121790) [Link] (1 responses)

Maybe I'm misunderstanding the macro, but does that make the BUG_ON now only trigger once it reaches 2^32-127, rather than 2^31?

Avoiding page reference-count overflows

Posted Apr 16, 2019 15:08 UTC (Tue) by willy (subscriber, #9762) [Link]

That's right. There are now essentially four ranges for the page refcount -- 0 (unreferenced), up to 2^31 (normal), 2^31 to 2^32-127 (heavily used) and -127 to -1 (underflow)

The macro is misnamed, I now realise.

Avoiding page reference-count overflows

Posted Apr 16, 2019 12:47 UTC (Tue) by willy (subscriber, #9762) [Link]

Page refcount is biased so -1 is used to represent 0. We were looking for a small change, not a wholesale rework

Avoiding page reference-count overflows

Posted Apr 16, 2019 17:07 UTC (Tue) by willy (subscriber, #9762) [Link] (1 responses)

> And if refcount_t isn't suitable here, is it suitable anywhere?

I think it's a question of tradeoffs (as so much of programming is). Getting an object that uses a refcount_t always succeeds. It may have the side-effect of making the object indestructible, but there's no new rarely-executed, security-critical code-path to test.

get_user_pages() can already fail, so all users should be prepared for that. If not, they're currently a security hole -- consider one thread which calls munmap() on addresses that another thread is passing to a syscall.

refcount_t isn't the appropriate fix for this problem, but is generally a better approach than a silently overflowing / wrapping atomic_t

Avoiding page reference-count overflows

Posted Apr 17, 2019 0:33 UTC (Wed) by neilbrown (subscriber, #359) [Link]

Thanks! The distinction between uses where failure must be allowed, and those where is isn't, is helpful.

I wonder if a refcount_get_may_fail() could be useful elsewhere.

Avoiding page reference-count overflows

Posted Apr 17, 2019 7:43 UTC (Wed) by mm7323 (subscriber, #87386) [Link] (4 responses)

I was wondering the same. This change feels almost against the principal of improving existing mechanisms rather than adding new ones.

But then I looked up the refcount_t overflow protection and noted it was first designed by PaXTeam with work to mainline it being done by Kees along with impressive benchmarking results.

The reputation of PaXTeam needs no comment, though I admit an unhealthy amount of admiration for Kees for continuing excellent work in the face of past fierce and unkind put downs from the Linux project leader in chief and friction elsewhere.

While we now have 'nice Linus', he himself had a hand in this patch set, so perhaps the reason for not using or adapting refcount_t is one of personal politics rather than technical merit.

I may well be blind to something here (and certainly I'm not close enough to this sub-system) but personally I just don't buy that refcount_t couldn't be used here, or slightly improved for use. The refcount_t saturating behaviour sounds right, as does producing a call stack on first saturation of the counter and then never decrementing the count again. Killing the calling process on first saturation of a count may not be suitable in case a call originates from a kernel thread itself, but I'm sure that could be adapted if needed (or not already) - with possible benefits and uses in other parts of the kernel too.

Hopefully a later comment will clearly explain the technical limitation of refcount_t which could only be overcome by adding a new esoteric mechanism.

Avoiding page reference-count overflows

Posted Apr 18, 2019 6:49 UTC (Thu) by willy (subscriber, #9762) [Link] (3 responses)

You should write for The Register. Your ability to string together half-understood concepts with innuendo into a full-blown conspiracy theory would ensure your employment at a semi-mainstream news site dependent on sensationalistic clicks for a living. It is, however, unworthy of an LWN comment.

For the record, the original idea of using the negative reference count in this way was mine. I didn't consider switching to refcount_t for a second, not because I have a long-standing problem with Spender, but because it would be a ridiculous amount of change for a fix. And as my earlier comment said, it would have had significant downsides.

All the things you ask to be explained have already been explained in earlier responses; I'm not going to type them out again.

Avoiding page reference-count overflows

Posted Apr 19, 2019 7:44 UTC (Fri) by marcH (subscriber, #57642) [Link]

You can blame The Register for a number of things (even more if you take it seriously when it doesn't), but... "conspiracy theory", really? Citation needed.

Avoiding page reference-count overflows

Posted Apr 19, 2019 9:45 UTC (Fri) by mm7323 (subscriber, #87386) [Link] (1 responses)

You should write for The Register....

If you are going to tell me where I should write, may I respectfully suggest that as a leading linux kernel developer, perhaps you should read the CoC and at least choose a more respectful tone in your replies. This isn't the lkml and serves a more diverse audience. It's reasonable to expect different levels of understanding here.

It is, however, unworthy of an LWN comment.

That is just your opinion; I don't believe you speak on behalf of LWN.

it would be a ridiculous amount of change for a fix.

Having read and re-read the article and comments, that seems to be the crux of the argument for making a new mechanism. Apart from "refcount_t isn't the appropriate fix for this problem", I don't see where any critical technical limitation of refcount_t is stated.

Short-cutting the work to make this change might have been the best option here, but it's surely creating technical debt. For a "vulnerability [that] is sufficiently difficult to exploit that almost nobody should feel the need to rush out a kernel update", I hope it is worth it.

Thanks for all your hard work all the same

Avoiding page reference-count overflows

Posted Apr 22, 2019 23:17 UTC (Mon) by jschrod (subscriber, #1646) [Link]

> > It is, however, unworthy of an LWN comment.
> That is just your opinion; I don't believe you speak on behalf of LWN.

Of course, he didn't speak on behalf of LWN -- but he didn't claim to do so, too.

He spoke on behalf of us long-time LWN readers who don't want to see ad-hominem attacks or comments with random assumptions expressing negative connotations on a personal level, why kernel developers did implement a change in a certain way besides them explaining explicitly why it was done this way here in this LWN thread.

Please abstain with this kind of comments in the future. It's not welcome here.

Thank you.

Avoiding page reference-count overflows

Posted Apr 18, 2019 20:16 UTC (Thu) by naptastic (guest, #60139) [Link] (2 responses)

> VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
>
> (...)
>
> ...That, of course, wanders into undefined behavior, so the compiler is entitled to set it to 0xdeadbeef and erase the disk drives instead if it so chooses...

Why not just panic if a page's refcount < -1?

(I'm assuming <-1 and not ≤-1 based on Willy's comment)

As Willy pointed out earlier, this change makes the refcount have four ranges instead of two (or one, depending on how you count.) That seems like a lot of added complexity (hidden behind a macro) to accommodate a use case... actually, can anyone suggest when it would be useful to have 2^32 or more references to a page? I can't think of a good use case, but what do I know? :-)

(Also, who's Willy?)

Regardless, congratulations and thank you to all the developers who identified and fixed this before it became more than a theoretical problem.

Avoiding page reference-count overflows

Posted Apr 19, 2019 15:07 UTC (Fri) by willy (subscriber, #9762) [Link] (1 responses)

> Why not just panic if a page's refcount < -1?

It's generally considered unfriendly to panic ... we tend to prefer BUG_ON() which just kills the task.

> can anyone suggest when it would be useful to have 2^32 or more references to a page?

I don't think it's ever useful to have 2^32 actual references to a page. The four ranges are really there to help the system behave well in the presence of an attack.

That said, with huge pages, we often increment refcounts by the number of base pages in the huge page. If we supported 1GB THPs, that would be 2^18 references per task. So with 2^14 tasks mapping the same page, we'd overflow in an entirely legitimate manner. Fortunately, we only support 2MB THPs today, so this doesn't yet apply. (hugetlbfs handles refcounts differently from THP)

> (Also, who's Willy?)

Matthew Wilcox. I've been using 'willy' as my online nickname since about 1994.

Avoiding page reference-count overflows

Posted Apr 19, 2019 15:21 UTC (Fri) by naptastic (guest, #60139) [Link]

It's nice to meet you, and thanks for the kind reply! I didn't know about BUG_ON, or that aspect of hugepages. I've got some reading to do, apparently. :-)

Avoiding page reference-count overflows

Posted May 14, 2019 18:44 UTC (Tue) by mcortese (guest, #52099) [Link]

I'm not sure I understand it. The only real fix I see here is to convert all users of get_page() to try_get_page(). What is the new macro for? If BUG_ON does not crash your kernel, then nothing changes. If it does, it simply lifts the limit from one astronomical value to another.

What am I missing?