Another try for getrandom() in the vDSO
Random data is used in no end of applications, including modeling of natural phenomena, generation of identifiers like UUIDs, and game play; it's how NetHack somehow summons those three balrogs all in one corner of a dark room. Security-related operations, such as the generation of nonces and keys, also make heavy use of random data — and depend heavily on that data actually being random. Some applications need so much random data that they spend a significant amount of time in the getrandom() system call.
One possible solution to this problem is to generate random numbers in user space, perhaps seeded by random data from the kernel; many developers have taken that route over the years. But, as Donenfeld explains in the cover letter to his patch series, that approach is not ideal. The kernel has the best view of the amount of entropy in the system and what is needed to generate truly random data. It is also aware of events, such as virtual-machine forks, that can compromise a random-number generator and make a reseeding necessary. He concluded:
The simplest statement you could make is that userspace RNGs that expand a getrandom() seed at some point T1 are nearly always *worse*, in some way, than just calling getrandom() every time a random number is desired.
Always calling getrandom() ensures the best random data, but the associated performance problem remains. Moving that function into the vDSO can help to address that problem.
getrandom() in the vDSO
The vDSO is a special mechanism provided to accelerate tasks that require some kernel involvement, but which can otherwise be carried out just as well in user space. It contains code and data provided in user space directly by the kernel in a memory area mapped into every thread's address space. The classic vDSO function is gettimeofday(), which returns the current system time as kept by the kernel. This function can be implemented as a system call, but that will slow down applications that frequently query the time, of which there are many. So the Linux vDSO includes an implementation of gettimeofday(); that implementation can simply read a time variable in memory shared with the kernel and return it to the caller, avoiding the need to make a system call.
getrandom() is a similar sort of function; it reads data from the kernel and returns it to user space. So a vDSO implementation of getrandom() might make sense. Such an implementation must be done carefully, though; it should return data that is just as random as a direct call into the kernel would, and it must be robust against the types of events (a fork, for example) that could compromise the state of a thread's random-number generation.
In Donenfeld's implementation, user-space programs will continue to just call getrandom() as usual, with no changes needed. Under the hood, though, there are some significant changes needed within the C library, which provides the getrandom() wrapper for the system call.
State-area allocation
The random-number generator works on some state data stored in memory. When random data is needed, a pseudo-random-number generator creates it from that state, mutating the state in the process. Every thread must have its own state, and care must be taken to avoid exposing that state during events like process forks, core dumps, virtual-machine forks, or checkpointing. That state should be reseeded with random data regularly, and specifically at any time when its content might have been compromised.
The vDSO implementation of getrandom() requires that the memory to be used for this state be allocated by the kernel. So the first thing that the C library must do is to allocate this state storage for as many threads as it thinks are likely to run. That is done with a new system call:
struct vgetrandom_alloc_args { u64 flags; u64 num; u64 size_per_each; u64 bytes_allocated; }; void *vgetrandom_alloc(struct vgetrandom_alloc_args *args, size_t args_len);
The structure pointed to by args describes the allocation request, while args_len is sizeof(*args); that allows the structure to be extended in a compatible way if needed in the future. Within that structure, flags must currently be zero, and num is the number of thread-state areas that the kernel is being requested to allocate. On a successful return, num will be set to the number of areas actually allocated, size_per_each describes the size of a state area, and bytes_allocated is the total amount of memory that was allocated. The return value will point to the base of the allocated area.
The allocated area is ordinary anonymous memory, except that it will be specially marked within the kernel using a number of virtual-memory-area flags. The VM_WIPEONFORK flag causes its contents to be zeroed if the process forks (so that the two processes do not generate the same stream of random numbers), VM_DONTDUMP keeps its contents from being written to core dumps, and VM_NORESERVE causes it to not be charged against the process's locked-memory limit. Donenfeld also added a new flag to use with this area: VM_DROPPABLE allows the memory-management subsystem to simply reclaim the memory if need be; since this is anonymous memory, accessing it after it is reclaimed will cause a new, zero-filled page to be allocated. The result is memory that should remain private, but which can be zeroed (or reclaimed, which has the same effect) by the kernel at any time.
Generating random data
The kernel also shares some memory with the vDSO containing this structure:
struct vdso_rng_data { u64 generation; u8 is_ready; };
This structure is used by the vDSO version of getrandom(), which has this prototype:
ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state, size_t opaque_len);
The first three arguments mirror getrandom(), describing the amount of random data needed and whether the call should block waiting for the kernel's random-number generator to be ready. The final two, instead, describe one of the state areas allocated by vgetrandom_alloc(). This function's job is to provide the same behavior that getrandom() would.
It starts by looking at the is_ready field in the shared structure; if the kernel's random-number generator is not yet ready, vgetrandom() will just call getrandom() to handle the request. Once the random-number generator has initialized, though, that fallback will no longer be necessary. So the next thing to do is to compare the generation count (which tracks the number of times that the kernel's random-number generator has been reseeded) with a generation count stored in the state area. If the two don't match, then the state area must be reseeded with random data obtained from the kernel.
When the state area is first allocated, it is zeroed, so the generation number found there will be zero, which will never match the kernel's generation number; that will cause the state area to be seeded on the first call to vgetrandom(). The same thing will happen if this area has been cleared by the kernel, as the result of a fork (VM_WIPEONFORK) or the memory being reclaimed (VM_DROPPABLE), for example. So the kernel is able to clear that memory at any time in the knowledge that vgetrandom() will do the right thing.
Once the state area is known to be in a good condition, vgetrandom() uses it to generate the requested random data using the same algorithm used within the kernel itself. Doing this calculation securely is a bit tricky; if the process forks or core-dumps while it is underway, any data kept on the stack could be exposed. So vgetrandom() has to use an implementation of the ChaCha20 stream cipher that uses no stack at all. The patch series only includes an x86 implementation of this cipher; other architectures seem certain to follow.
As a final step before returning the generated data to the caller, vgetrandom() checks the generation number one more time. If, for example, the state area was wiped by the kernel while the call was executing, the generation-number check will fail. In such cases, vgetrandom() will throw away its work and start over.
Donenfeld described the end result of this work as "pretty stellar
(around 15x for uint32_t generation)
" and noted happily that "it
seems to be working
".
Prospects
LWN last looked at this work at the beginning of 2023. At that time, there were a number of objections, many of which were focused on the VM_DROPPABLE changes to the memory-management subsystem, which included some tricky, x86-specific tricks. When version 15 of the patch series was posted several months later, VM_DROPPABLE remained, but the logic had been simplified considerably in the hope of addressing those concerns, seemingly successfully. There does not appear to be anybody who is arguing against the inclusion of this series now.
As of the current version (20), this work has been added to linux-next for
wider testing; if all goes well, it could go upstream as soon as the 6.11
merge window later this month. "If all goes well", of course, includes
passing muster with Linus Torvalds, who has not commented this time around;
he was not
thrilled with previous versions, though. Should the mainline merge
happen, the work to integrate the needed changes into the C libraries can
begin. The end result will be a significant internal change, but the only
thing that users should notice is that their programs run faster.
Index entries for this article | |
---|---|
Kernel | Random numbers |
Kernel | Security/Random number generation |
Posted Jul 4, 2024 19:38 UTC (Thu)
by knewt (subscriber, #32124)
[Link] (1 responses)
The "deconflicting new syscall numbers for 6.11" thread on lkml: https://lore.kernel.org/lkml/CAHk-=wiGk+1eNy4Vk6QsEgM=Ru3...
Posted Jul 5, 2024 8:51 UTC (Fri)
by skissane (subscriber, #38675)
[Link]
https://lore.kernel.org/all/CAHk-=win2mesMNEfL-KZQ_jk1YH8...
Posted Jul 4, 2024 19:54 UTC (Thu)
by mb (subscriber, #50428)
[Link] (6 responses)
Wouldn't it be better to make fork information more easily available to userspace?
This all sounds like it should be solved at a lower level.
Posted Jul 4, 2024 22:12 UTC (Thu)
by bluca (subscriber, #118303)
[Link] (4 responses)
https://lore.kernel.org/lkml/e1c03136-b873-1f1d-8b06-d918...
Posted Jul 5, 2024 4:49 UTC (Fri)
by donald.buczek (subscriber, #112892)
[Link] (3 responses)
Posted Jul 5, 2024 15:24 UTC (Fri)
by WolfWings (subscriber, #56790)
[Link] (2 responses)
Posted Jul 5, 2024 15:29 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (1 responses)
Posted Jul 5, 2024 15:32 UTC (Fri)
by bluca (subscriber, #118303)
[Link]
Posted Jul 5, 2024 10:53 UTC (Fri)
by fw (subscriber, #26023)
[Link]
It does not cover VM forks, though. People disagree whether this is a problem.
Posted Jul 4, 2024 22:39 UTC (Thu)
by comex (subscriber, #71521)
[Link] (7 responses)
I suppose it’s enough for some cases. Suppose you need to sign a message with a cryptographic scheme that requires a random nonce, where it’s very bad to sign two messages with the same (or perhaps related) nonce. Then you can ensure that your code generates the nonce with a single call to getrandom(), *after* reading in or otherwise determining the message to be signed. If the VM forks before or during the call to getrandom(), you’ll get a fresh new nonce (theoretically). If it forks after, then you’ll get the same nonce, but it’ll be used to sign the same message, which isn’t (necessarily) insecure.
But what about more complex protocols, particularly interactive ones (e.g. TLS)? Is a single call to getrandom() a flexible enough API? It feels like programs need a way to say “abort this connection if a VM fork has happened at any time since the connection started”. Perhaps even something that’s atomic with send() somehow.
Disclaimer: Not a cryptography expert.
Posted Jul 4, 2024 23:56 UTC (Thu)
by josh (subscriber, #17465)
[Link] (6 responses)
Posted Jul 5, 2024 4:17 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
The only case I can think of where this could possibly work is when the application is in cahoots with the VM management plane and intentionally causes VM forks to occur as part of its request handling logic. But in that case, this is trivial: At worst, you have to design your request handling logic to ensure that the fork does not happen at the same time as some delicate crypto code is running (e.g. take a lock). Which you were maybe already doing anyway, since some code in this genre uses a userspace CSPRNG instead of getrandom (for the performance reasons cited in the article), and that absolutely requires you to be aware of forking and mitigate it.
Posted Jul 5, 2024 5:51 UTC (Fri)
by comex (subscriber, #71521)
[Link] (4 responses)
As for why the VM forks in the first place, well, as one possibility, it could be a desktop VM which the user manually chose to fork (while some service was talking to the network in the background). Some desktop VM software offers cloning as an option. Or even without cloning, the risks seem similar if the VM is just restored from a snapshot.
Admittedly, waiting for a desktop VM to be forked/restored seems like a pretty niche thing for an attacker to do, but not completely unrealistic. I'm sure there are people who make a habit of regularly restoring their VMs from snapshot.
Posted Jul 5, 2024 20:28 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
That would require the application to be originally deployed in a broken state where it randomly drops TCP connections for no apparent reason. Maybe there are some people who do that, but I wouldn't want to work there.
Posted Jul 5, 2024 20:43 UTC (Fri)
by comex (subscriber, #71521)
[Link] (2 responses)
Posted Jul 5, 2024 23:28 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Jul 5, 2024 23:31 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
Posted Jul 5, 2024 5:00 UTC (Fri)
by ringerc (subscriber, #3071)
[Link] (3 responses)
That sounds like something that would be exceedingly handy for garbage collected languages' object reuse lists, various in-app caches, etc.
Posted Jul 5, 2024 7:12 UTC (Fri)
by smurf (subscriber, #17840)
[Link]
Posted Jul 5, 2024 8:15 UTC (Fri)
by jtaylor (subscriber, #91739)
[Link] (1 responses)
I'm not quite clear on the difference, I assume VM_DROPPABLE can have memory reset to zero on read at any time while MADV_FREE can only be zeroed after the madvise once?
Posted Jul 5, 2024 9:01 UTC (Fri)
by skissane (subscriber, #38675)
[Link]
The difference in semantics with MADV_FREE is explained in the comments on the last file in the patch (mm/rmap.c):
1. "Unlike MADV_FREE mappings, VM_DROPPABLE ones can be dropped even if they've been dirtied."
Posted Jul 5, 2024 7:44 UTC (Fri)
by taladar (subscriber, #68407)
[Link] (1 responses)
Posted Jul 5, 2024 11:28 UTC (Fri)
by corbet (editor, #1)
[Link]
Posted Jul 5, 2024 11:15 UTC (Fri)
by neverpanic (subscriber, #99747)
[Link]
Linus' insists on actual users speaking up, but I'm convinced those do exist. Heck, we looked at the OpenSSL RNG setup and thought about ripping out their three-tiered primary-per-process and public/private-per-thread DRBG tree and just replacing it, or parts of it, with getrandom(). Now, we considered this because of FIPS certification and the requirements it imposes on chained DRBGs, all of which become much simpler if you just don't chain.
But, and this brings me to the two points I want to make: Merging this patchset in Linux does not solve the issue of a slow RNG on other platforms, so cryptographic libraries with cross platform support (i.e., essentially all of them) will still keep a multi-level setup, seed from the kernel and then run a user space DRBG, or just continue accepting that they are slow. So yeah, it's great if this would exist, but anybody who takes this seriously and isn't just optimizing for a single platform already has to deal with this in user space anyway, so there really isn't as big of a gain as the author claims.
And second, the author claims this would use the exact same DRBG algorithm as used by the kernel, but completely ignores that the kernel will use a different DRBG in FIPS mode (because ChaCha20 isn't FIPS-compliant). This is also the source of the headache I have with the proposal. What's currently in the kernel sort of works for us, but if this lands we need to take a closer look to either figure out how to use a CTR-DRBG in user space instead of the ChaCha20 one, plus fulfill the DRBG chaining requirements that NIST requires, or figure out a way to completely disable this.
For these reasons, I don't think this should land. Just the generation counter that Linus' proposed, that sounds like a good improvement for user space to do the right thing™ on fork or VM cloning, but the rest of the stuff is possibly an improvement for the top 0.5% of cloud load balancers, and pointless for everybody else.
Posted Aug 2, 2024 19:48 UTC (Fri)
by steve1440 (subscriber, #166333)
[Link]
Linus commenting....
Linus commenting....
Make vm fork information available instead?
A userspace RNG could react on it and it could potentially have more uses beyond RNG.
Make vm fork information available instead?
https://lore.kernel.org/lkml/e09ce9fd-14cb-47aa-a22d-d295...
Make vm fork information available instead?
Make vm fork information available instead?
Make vm fork information available instead?
Make vm fork information available instead?
Make vm fork information available instead?
End goal
End goal
End goal
End goal
End goal
End goal
End goal
End goal
droppable sounds great
droppable sounds great
droppable sounds great
droppable sounds great
2. "Unlike MADV_FREE mappings, VM_DROPPABLE ones never get swap backed on failure to drop."
Known number of threads - bad assumption?
You can always allocate more space for more threads, so that should not be a problem.
Known number of threads - bad assumption?
Not sure this solves more headaches than it creates
RFC 7539
https://datatracker.ietf.org/doc/html/rfc8439