Another try for getrandom() in the vDSO

By Jonathan Corbet
July 4, 2024

Random numbers, it seems, can never be random enough, and they cannot be generated quickly enough. The kernel's getrandom() system call might, after years of discussion, be seen as sufficiently secure by most users, but it is still a system call. Linux system calls are relatively fast, but they are necessarily slower than calling a function directly. In an attempt to speed the provision of secure random data to user space, Jason Donenfeld has put together an implementation of getrandom() that lives in the virtual dynamic shared object (vDSO) area.

Random data is used in no end of applications, including modeling of natural phenomena, generation of identifiers like UUIDs, and game play; it's how NetHack somehow summons those three balrogs all in one corner of a dark room. Security-related operations, such as the generation of nonces and keys, also make heavy use of random data — and depend heavily on that data actually being random. Some applications need so much random data that they spend a significant amount of time in the getrandom() system call.

One possible solution to this problem is to generate random numbers in user space, perhaps seeded by random data from the kernel; many developers have taken that route over the years. But, as Donenfeld explains in the cover letter to his patch series, that approach is not ideal. The kernel has the best view of the amount of entropy in the system and what is needed to generate truly random data. It is also aware of events, such as virtual-machine forks, that can compromise a random-number generator and make a reseeding necessary. He concluded:

The simplest statement you could make is that userspace RNGs that expand a getrandom() seed at some point T1 are nearly always *worse*, in some way, than just calling getrandom() every time a random number is desired.

Always calling getrandom() ensures the best random data, but the associated performance problem remains. Moving that function into the vDSO can help to address that problem.

`getrandom()` in the vDSO

The vDSO is a special mechanism provided to accelerate tasks that require some kernel involvement, but which can otherwise be carried out just as well in user space. It contains code and data provided in user space directly by the kernel in a memory area mapped into every thread's address space. The classic vDSO function is gettimeofday(), which returns the current system time as kept by the kernel. This function can be implemented as a system call, but that will slow down applications that frequently query the time, of which there are many. So the Linux vDSO includes an implementation of gettimeofday(); that implementation can simply read a time variable in memory shared with the kernel and return it to the caller, avoiding the need to make a system call.

getrandom() is a similar sort of function; it reads data from the kernel and returns it to user space. So a vDSO implementation of getrandom() might make sense. Such an implementation must be done carefully, though; it should return data that is just as random as a direct call into the kernel would, and it must be robust against the types of events (a fork, for example) that could compromise the state of a thread's random-number generation.

In Donenfeld's implementation, user-space programs will continue to just call getrandom() as usual, with no changes needed. Under the hood, though, there are some significant changes needed within the C library, which provides the getrandom() wrapper for the system call.

State-area allocation

The random-number generator works on some state data stored in memory. When random data is needed, a pseudo-random-number generator creates it from that state, mutating the state in the process. Every thread must have its own state, and care must be taken to avoid exposing that state during events like process forks, core dumps, virtual-machine forks, or checkpointing. That state should be reseeded with random data regularly, and specifically at any time when its content might have been compromised.

The vDSO implementation of getrandom() requires that the memory to be used for this state be allocated by the kernel. So the first thing that the C library must do is to allocate this state storage for as many threads as it thinks are likely to run. That is done with a new system call:

    struct vgetrandom_alloc_args {
        u64 flags;
      	u64 num;
      	u64 size_per_each;
      	u64 bytes_allocated;
    };

    void *vgetrandom_alloc(struct vgetrandom_alloc_args *args, size_t args_len);

The structure pointed to by args describes the allocation request, while args_len is sizeof(*args); that allows the structure to be extended in a compatible way if needed in the future. Within that structure, flags must currently be zero, and num is the number of thread-state areas that the kernel is being requested to allocate. On a successful return, num will be set to the number of areas actually allocated, size_per_each describes the size of a state area, and bytes_allocated is the total amount of memory that was allocated. The return value will point to the base of the allocated area.

The allocated area is ordinary anonymous memory, except that it will be specially marked within the kernel using a number of virtual-memory-area flags. The VM_WIPEONFORK flag causes its contents to be zeroed if the process forks (so that the two processes do not generate the same stream of random numbers), VM_DONTDUMP keeps its contents from being written to core dumps, and VM_NORESERVE causes it to not be charged against the process's locked-memory limit. Donenfeld also added a new flag to use with this area: VM_DROPPABLE allows the memory-management subsystem to simply reclaim the memory if need be; since this is anonymous memory, accessing it after it is reclaimed will cause a new, zero-filled page to be allocated. The result is memory that should remain private, but which can be zeroed (or reclaimed, which has the same effect) by the kernel at any time.

Generating random data

The kernel also shares some memory with the vDSO containing this structure:

    struct vdso_rng_data {
	u64	generation;
	u8	is_ready;
    };

This structure is used by the vDSO version of getrandom(), which has this prototype:

    ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags,
                       void *opaque_state, size_t opaque_len);

The first three arguments mirror getrandom(), describing the amount of random data needed and whether the call should block waiting for the kernel's random-number generator to be ready. The final two, instead, describe one of the state areas allocated by vgetrandom_alloc(). This function's job is to provide the same behavior that getrandom() would.

It starts by looking at the is_ready field in the shared structure; if the kernel's random-number generator is not yet ready, vgetrandom() will just call getrandom() to handle the request. Once the random-number generator has initialized, though, that fallback will no longer be necessary. So the next thing to do is to compare the generation count (which tracks the number of times that the kernel's random-number generator has been reseeded) with a generation count stored in the state area. If the two don't match, then the state area must be reseeded with random data obtained from the kernel.

When the state area is first allocated, it is zeroed, so the generation number found there will be zero, which will never match the kernel's generation number; that will cause the state area to be seeded on the first call to vgetrandom(). The same thing will happen if this area has been cleared by the kernel, as the result of a fork (VM_WIPEONFORK) or the memory being reclaimed (VM_DROPPABLE), for example. So the kernel is able to clear that memory at any time in the knowledge that vgetrandom() will do the right thing.

Once the state area is known to be in a good condition, vgetrandom() uses it to generate the requested random data using the same algorithm used within the kernel itself. Doing this calculation securely is a bit tricky; if the process forks or core-dumps while it is underway, any data kept on the stack could be exposed. So vgetrandom() has to use an implementation of the ChaCha20 stream cipher that uses no stack at all. The patch series only includes an x86 implementation of this cipher; other architectures seem certain to follow.

As a final step before returning the generated data to the caller, vgetrandom() checks the generation number one more time. If, for example, the state area was wiped by the kernel while the call was executing, the generation-number check will fail. In such cases, vgetrandom() will throw away its work and start over.

Donenfeld described the end result of this work as "pretty stellar (around 15x for uint32_t generation)" and noted happily that "it seems to be working".

Prospects

LWN last looked at this work at the beginning of 2023. At that time, there were a number of objections, many of which were focused on the VM_DROPPABLE changes to the memory-management subsystem, which included some tricky, x86-specific tricks. When version 15 of the patch series was posted several months later, VM_DROPPABLE remained, but the logic had been simplified considerably in the hope of addressing those concerns, seemingly successfully. There does not appear to be anybody who is arguing against the inclusion of this series now.

As of the current version (20), this work has been added to linux-next for wider testing; if all goes well, it could go upstream as soon as the 6.11 merge window later this month. "If all goes well", of course, includes passing muster with Linus Torvalds, who has not commented this time around; he was not thrilled with previous versions, though. Should the mainline merge happen, the work to integrate the needed changes into the C libraries can begin. The end result will be a significant internal change, but the only thing that users should notice is that their programs run faster.

Index entries for this article
Kernel	Random numbers
Kernel	Security/Random number generation

Linus commenting....

Posted Jul 4, 2024 19:38 UTC (Thu) by knewt (subscriber, #32124) [Link] (1 responses)

Funnily enough, Linus started commenting on it less than an hour after this article was posted 😂

The "deconflicting new syscall numbers for 6.11" thread on lkml: https://lore.kernel.org/lkml/CAHk-=wiGk+1eNy4Vk6QsEgM=Ru3...

Linus commenting....

Posted Jul 5, 2024 8:51 UTC (Fri) by skissane (subscriber, #38675) [Link]

Linus disagrees with vgetrandom_alloc() syscall. He seems to be suggesting, instead, exposing VM_DROPPABLE as a new mmap() flag, MAP_DROPPABLE, so anyone can use it, not just vgetrandom.

https://lore.kernel.org/all/CAHk-=win2mesMNEfL-KZQ_jk1YH8...

Make vm fork information available instead?

Posted Jul 4, 2024 19:54 UTC (Thu) by mb (subscriber, #50428) [Link] (6 responses)

>virtual-machine forks, that can compromise a random-number generator and make a reseeding necessary

Wouldn't it be better to make fork information more easily available to userspace?
A userspace RNG could react on it and it could potentially have more uses beyond RNG.

This all sounds like it should be solved at a lower level.

Make vm fork information available instead?

Posted Jul 4, 2024 22:12 UTC (Thu) by bluca (subscriber, #118303) [Link] (4 responses)

That was proposed and implemented (and even almost merged) several times in several different forms, but always rejected by the random maintainer:

https://lore.kernel.org/lkml/e1c03136-b873-1f1d-8b06-d918...
https://lore.kernel.org/lkml/e09ce9fd-14cb-47aa-a22d-d295...

Make vm fork information available instead?

Posted Jul 5, 2024 4:49 UTC (Fri) by donald.buczek (subscriber, #112892) [Link] (3 responses)

OMG, this is really sad.

Make vm fork information available instead?

Posted Jul 5, 2024 15:24 UTC (Fri) by WolfWings (subscriber, #56790) [Link] (2 responses)

Especially because it seems like they moved their proposal entirely out of random's purvue since they're not tracking reseeds but an entirely different and specific event, so I don't see why the NACK from random matters when it's no longer integrating with their subsystem?

Make vm fork information available instead?

Posted Jul 5, 2024 15:29 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

Those are excellent questions. The issue is that the random subsystem maintainer also happens to the vmgenid acpi driver maintainer.

Make vm fork information available instead?

Posted Jul 5, 2024 15:32 UTC (Fri) by bluca (subscriber, #118303) [Link]

It is a very sad situation indeed, especially because in the end, even though it could be really trivially fixed, the result is that from that point of view, Windows works better as a VM guest than Linux.

Make vm fork information available instead?

Posted Jul 5, 2024 10:53 UTC (Fri) by fw (subscriber, #26023) [Link]

MADV_WIPEONFORK was added to Linux 4.14: https://www.man7.org/linux/man-pages/man2/madvise.2.html

It does not cover VM forks, though. People disagree whether this is a problem.

End goal

Posted Jul 4, 2024 22:39 UTC (Thu) by comex (subscriber, #71521) [Link] (7 responses)

I still don’t understand how this design allows programs to be truly safe in the face of VM cloning. Sure, with kernel support you can fix things up if the clone happens before or during the call to getrandom(). But what if it happens right after? When the program has gotten some random bytes but has yet to do anything with them?

I suppose it’s enough for some cases. Suppose you need to sign a message with a cryptographic scheme that requires a random nonce, where it’s very bad to sign two messages with the same (or perhaps related) nonce. Then you can ensure that your code generates the nonce with a single call to getrandom(), *after* reading in or otherwise determining the message to be signed. If the VM forks before or during the call to getrandom(), you’ll get a fresh new nonce (theoretically). If it forks after, then you’ll get the same nonce, but it’ll be used to sign the same message, which isn’t (necessarily) insecure.

But what about more complex protocols, particularly interactive ones (e.g. TLS)? Is a single call to getrandom() a flexible enough API? It feels like programs need a way to say “abort this connection if a VM fork has happened at any time since the connection started”. Perhaps even something that’s atomic with send() somehow.

Disclaimer: Not a cryptography expert.

End goal

Posted Jul 4, 2024 23:56 UTC (Thu) by josh (subscriber, #17465) [Link] (6 responses)

In general, uses of vmfork that care about security are forking at request time, not at some random time in the middle.

End goal

Posted Jul 5, 2024 4:17 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)

Personally, I'm unclear on how a mid-request fork would even function. Both parent and child VMs are (presumably) going to be trying to talk to the remote host at the same time. If your guests have distinct public IP addresses, then at least one of the two connections should break right there. If not, then you'd have interleaved TCP streams with duplicate sequence numbers, and the whole thing should fall apart pretty quickly.

The only case I can think of where this could possibly work is when the application is in cahoots with the VM management plane and intentionally causes VM forks to occur as part of its request handling logic. But in that case, this is trivial: At worst, you have to design your request handling logic to ensure that the fork does not happen at the same time as some delicate crypto code is running (e.g. take a lock). Which you were maybe already doing anyway, since some code in this genre uses a userspace CSPRNG instead of getrandom (for the performance reasons cited in the article), and that absolutely requires you to be aware of forking and mitigate it.

End goal

Posted Jul 5, 2024 5:51 UTC (Fri) by comex (subscriber, #71521) [Link] (4 responses)

When I mentioned TLS, I was imagining a scenario where the VM just happens to fork while some software on it is coincidentally in the middle of a TLS connection, and meanwhile there is an on-path network attacker specifically waiting for it to fork, with a custom TCP implementation designed to paper over the broken sequencing.

As for why the VM forks in the first place, well, as one possibility, it could be a desktop VM which the user manually chose to fork (while some service was talking to the network in the background). Some desktop VM software offers cloning as an option. Or even without cloning, the risks seem similar if the VM is just restored from a snapshot.

Admittedly, waiting for a desktop VM to be forked/restored seems like a pretty niche thing for an attacker to do, but not completely unrealistic. I'm sure there are people who make a habit of regularly restoring their VMs from snapshot.

End goal

Posted Jul 5, 2024 20:28 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

> When I mentioned TLS, I was imagining a scenario where the VM just happens to fork while some software on it is coincidentally in the middle of a TLS connection, and meanwhile there is an on-path network attacker specifically waiting for it to fork, with a custom TCP implementation designed to paper over the broken sequencing.

That would require the application to be originally deployed in a broken state where it randomly drops TCP connections for no apparent reason. Maybe there are some people who do that, but I wouldn't want to work there.

End goal

Posted Jul 5, 2024 20:43 UTC (Fri) by comex (subscriber, #71521) [Link] (2 responses)

I think you're still envisioning a server. That's definitely one possible scenario, but as I described in the rest of my comment, a simpler scenario is a desktop VM where the user is manually pausing the VM and either restoring it from snapshot or cloning it. Yes, this normally drops TCP connections, but not for no apparent reason.

End goal

Posted Jul 5, 2024 23:28 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

Simpler, and far less common. The people running desktop VMs are mostly security researchers and a few power users and hobbyists. Especially if it's desktop *Linux*. In the real world, to a first approximation, a Linux VM is nearly always a cloud VM.

End goal

Posted Jul 5, 2024 23:31 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

(Just in case I wasn't clear enough: Attackers usually want to compromise as many users as possible, so an attack that only affects a tiny fraction of the computer-using population is simply not worth developing. Especially when a significant portion of that tiny fraction is made up of security researchers, whom malware authors generally try to avoid hitting in order to further obfuscate their work.)

droppable sounds great

Posted Jul 5, 2024 5:00 UTC (Fri) by ringerc (subscriber, #3071) [Link] (3 responses)

I'm actually more interested in VM_DROPPABLE than the randomness patch.

That sounds like something that would be exceedingly handy for garbage collected languages' object reuse lists, various in-app caches, etc.

droppable sounds great

Posted Jul 5, 2024 7:12 UTC (Fri) by smurf (subscriber, #17840) [Link]

Also, when you have VM_DROPPABLE (and a generation counter in the VDSO) the rest can easily be implemented in libc.

droppable sounds great

Posted Jul 5, 2024 8:15 UTC (Fri) by jtaylor (subscriber, #91739) [Link] (1 responses)

can't you do what you can do with VM_DROPPABLE already with madvise(MADV_FREE)?

I'm not quite clear on the difference, I assume VM_DROPPABLE can have memory reset to zero on read at any time while MADV_FREE can only be zeroed after the madvise once?

droppable sounds great

Posted Jul 5, 2024 9:01 UTC (Fri) by skissane (subscriber, #38675) [Link]

See this patch https://lore.kernel.org/linux-mm/20240703183115.1075219-2...

The difference in semantics with MADV_FREE is explained in the comments on the last file in the patch (mm/rmap.c):

1. "Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if they've been dirtied."
2. "Unlike MADV_FREE mappings, VM_DROPPABLE ones never get swap backed on failure to drop."

Known number of threads - bad assumption?

Posted Jul 5, 2024 7:44 UTC (Fri) by taladar (subscriber, #68407) [Link] (1 responses)

I imagine whoever wrote that requirement to specify the number of threads does not work on code bases outside of C where every library has its innards hanging out for everyone to see. How are you supposed to even know how many threads the overall program uses if you have proper encapsulation that does include thread use, possibly by indirect dependencies?

Known number of threads - bad assumption?

Posted Jul 5, 2024 11:28 UTC (Fri) by corbet (editor, #1) [Link]

You can always allocate more space for more threads, so that should not be a problem.

Not sure this solves more headaches than it creates

Posted Jul 5, 2024 11:15 UTC (Fri) by neverpanic (subscriber, #99747) [Link]

I'm not convinced this actually solves more headaches than it creates. Yes, random number generation is kind of slow, and it would be great if cryptographic libraries could just point to getrandom() whenever they needed random numbers without being concerned that it is going to be slow.

Linus' insists on actual users speaking up, but I'm convinced those do exist. Heck, we looked at the OpenSSL RNG setup and thought about ripping out their three-tiered primary-per-process and public/private-per-thread DRBG tree and just replacing it, or parts of it, with getrandom(). Now, we considered this because of FIPS certification and the requirements it imposes on chained DRBGs, all of which become much simpler if you just don't chain.

But, and this brings me to the two points I want to make: Merging this patchset in Linux does not solve the issue of a slow RNG on other platforms, so cryptographic libraries with cross platform support (i.e., essentially all of them) will still keep a multi-level setup, seed from the kernel and then run a user space DRBG, or just continue accepting that they are slow. So yeah, it's great if this would exist, but anybody who takes this seriously and isn't just optimizing for a single platform already has to deal with this in user space anyway, so there really isn't as big of a gain as the author claims.

And second, the author claims this would use the exact same DRBG algorithm as used by the kernel, but completely ignores that the kernel will use a different DRBG in FIPS mode (because ChaCha20 isn't FIPS-compliant). This is also the source of the headache I have with the proposal. What's currently in the kernel sort of works for us, but if this lands we need to take a closer look to either figure out how to use a CTR-DRBG in user space instead of the ChaCha20 one, plus fulfill the DRBG chaining requirements that NIST requires, or figure out a way to completely disable this.

For these reasons, I don't think this should land. Just the generation counter that Linus' proposed, that sounds like a good improvement for user space to do the right thing™ on fork or VM cloning, but the rest of the stuff is possibly an improvement for the top 0.5% of cloud load balancers, and pointless for everybody else.

RFC 7539

Posted Aug 2, 2024 19:48 UTC (Fri) by steve1440 (subscriber, #166333) [Link]

RFC 7539 (ChaCha20 stream cipher link) has been obsoleted by RFC 8439.
https://datatracker.ietf.org/doc/html/rfc8439