Fixing getrandom()

By Jake Edge
September 27, 2019

A report of a boot hang in the 5.3 series has led to an enormous, somewhat contentious thread on the linux-kernel mailing list. The proximate cause was some changes that made the ext4 filesystem do less I/O early in the boot phase, incidentally causing fewer interrupts, but the underlying issue was the getrandom() system call, which was blocking until the /dev/urandom pool was initialized—as designed. Since the system in question was not gathering enough entropy due to the lack of unpredictable interrupt timings, that would hang more or less forever. That has called into question the design and implementation of getrandom().

Ahmed S. Darwish reported the original problem and tracked it down to the GNOME Display Manager (GDM), which handles graphical logins. It turns out that GDM was calling getrandom() in order to generate the "MIT magic cookie" that is used for authorization by the X Window System. As was pointed out by several in the mega-thread, using cryptographic-strength random numbers for the cookie (or much of anything in terms of X Window security) is well beyond the pale—a much weaker random number generator could have been used with no loss of security. Darwish noted that the call "only" requests a small number of random bytes (five calls requesting 16 bytes each) but, as Theodore Y. Ts'o said, that doesn't matter: by default getrandom() will not return anything until the cryptographic random number generator (CRNG) is initialized—which requires entropy.

When Darwish originally bisected the problem, he pinpointed an ext4 commit that had the effect of reducing the amount of disk I/O that was being done early in the boot process. That performance enhancement also, unfortunately, turned out to reduce the amount of entropy gathered on Darwish's laptop—to the point it would not boot. That change has been reverted for now.

`getrandom()`

Back in 2014, getrandom() was added at least partly in response to a complaint from the LibreSSL project that Linux lacked a way to get random numbers in the face of file-descriptor exhaustion. The "approved" mechanism was to read from /dev/urandom, but if an attacker arranged that all of the file descriptors were already open, that method could fail. So getrandom() was created to provide a way to get random numbers without a file descriptor or, even, a visible /dev/urandom (e.g. from a container or chroot()). In fact, getrandom() was intentionally designed to block until the /dev/urandom pool is initialized; prior to getrandom() there was no way for user space to be sure that enough entropy had been gathered to properly initialize the pool. Since the behavior of /dev/urandom is part of the kernel ABI, it could not change; a new system call was under no such constraints, of course.

Initializing the CRNG requires 512 bits of estimated entropy, or 4096 interrupts using the current calculations, which Ts'o said were conservatively chosen. getrandom() is clearly documented to block until that happens, but it has not stopped user space from sometimes using it incorrectly. Ts'o said that is going to be a problem moving forward:

Ultimately, though, we need to find *some* way to fix userspace's assumptions that they can always get high quality entropy in early boot, or we need to get over people's distrust of Intel and RDRAND. Otherwise, future performance improvements in any part of the system which reduces the number of interrupts is always going to potentially result in somebody's misconfigured system or badly written applications to fail to boot. :-(

Linus Torvalds noted that the RDRAND instruction does not exist everywhere, so it is no panacea. He is also concerned that problems stemming from fewer interrupts will only get worse:

The interrupt thing is only going to get worse as disks turn into ssd's and some of them end up using polling rather than interrupts.. So we're likely to see _fewer_ interrupts in the future, not more.

Error return

Ts'o suggested adding a kind of "fail safe" flag that would let callers request that getrandom() only block for, say, two minutes; after that, the best available random numbers would be returned. But Torvalds believes that blocking by default is simply wrong. He said that any new flag should request the blocking behavior explicitly so that unthinking users get what they expect. Or perhaps an error could be returned:

An alternative might be to make getrandom() just return an error instead of waiting. Sure, fill the buffer with "as random as we can" stuff, but then return -EINVAL because you called us too early.

Several seemed in agreement with that approach; Darwish posted an RFC patch along those lines. Alexander E. Patrakov reworked the commit message, but also complained about the idea of returning an error and forcing user space to deal with the problem ("the whole result looks like shifting the responsibility/blame without achieving anything useful"). Ts'o clearly thinks it is a bad idea, overall, but somewhat waspishly further modified the patch so that the blocking behavior was configurable at build time (or via a kernel command-line parameter). Darwish took that one step further and increased the length of commit message again, adding even more background and details. In addition, at Torvalds's request, the -EINVAL return was removed, so that getrandom() effectively reverted to the same behavior as reading from /dev/urandom: callers get the "best" randomness available at the time of the call.

Lennart Poettering disagreed with that approach, calling it "sticking your head in the sand" by providing bad random numbers to potentially sensitive key-generation operations early in the boot process. He suggested that the problem is not in the kernel at all and that it should be solved in user space:

Let the people putting together systems deal with this. Let them provide a creditable hw rng, and let them pay the price if they don't.

Part of the problem may be that once the GNU C Library (glibc) got around to adding a wrapper for getrandom(), an OpenBSD-like getentropy() call was also added. However the OpenBSD version does not block, while the glibc version is implemented using getrandom() and, thus, can block indefinitely in the early boot process. Developers calling getentropy() might well be unaware of this little "gotcha"—though it is documented in the man page. As Torvalds and others mentioned in the thread, another problem is that once the system has blocked waiting for entropy, said entropy is likely to never arrive. User space needs to cause things to happen (e.g. keys pressed, disks accessed) to produce the interrupts necessary to get the CRNG initialized.

Yet another part of the problem that Torvalds sees is that there are (at least) two different kinds of users of getrandom() who are passing 0 for the flags value (which he calls "getrandom(0)"): those that actually want/need to block in order to get random numbers only after the CRNG has been initialized and those who are just after "good" random numbers and didn't think too hard about it. Callers of glibc's getentropy() could also fall into that latter category.

Limiting delays

Unsurprisingly, Torvalds was not in favor of a configuration option; his first solution was to limit the wait time of getrandom() to 15 seconds on the first call when the CRNG is not initialized, reducing the delay on each subsequent call so that the maximum possible delay is 30 seconds. The code returns -EAGAIN in that case so that user space can detect it. In a comment in the code (repeated in the email message), he said: "Just asking for blocking random numbers is completely and fundamentally wrong, and the kernel will not play that game."

That set off another huge sub-thread. Poettering once again said that the problem should not be solved by the kernel in a "never trust userspace" fashion. Darwish posted another version of his patch set that proposed a getrandom2() system call, which used new flag names to be ever more explicit about the intentions of the caller. But there are still plenty of flag bits available for getrandom(), Torvalds said, so introducing a new system call seems unnecessary.

Instead, he suggested reworking the flag values to better represent what was being asked for. The GRND_EXPLICIT flag would be used to indicate that user space "knows what it is doing", so if it explicitly asks to block forever, that will be honored. The GRND_SECURE and GRND_INSECURE values would ask for blocking and non-blocking behavior respectively, but both would also set the GRND_EXPLICIT bit. The patch left the getrandom(0) case alone, but Torvalds has plans for that as well:

In particular, this still leaves the semantics of that nasty "getrandom(0)" as the same "blocking urandom" that it currently is. But now it's a separate case, and we can make that perhaps do the timeout, or at least the warning.

And the new cases are defined to *not* warn. In particular, GRND_INSECURE very much does *not* warn about early urandom access when crng isn't ready. Because the whole point of that new mode is that the user knows it isn't secure.

Ts'o wondered why the getrandom(0) case was not simply mapped to the same behavior as GRND_SECURE, which would effectively be the same as it is today, but Torvalds was adamant that was the wrong approach; he is concerned that Ts'o is getting overly caught up in what Torvalds sees as theoretical attacks and is missing the real getrandom(0) problem. Torvalds intends the patch to be backported to the stable kernels, so any change to getrandom(0) will be in separate, mainline-only patch.

Jitter entropy

Patrakov asked about using "jitter entropy" as is done by the haveged entropy daemon. Using haveged is being suggested by some distributions as a way to ensure that there is enough entropy early in the system boot. He noted that the technique is controversial as some are concerned that it is not truly random data. Torvalds said that he is one of the skeptics, but thought it might provide a solution to the current mess:

I would perhaps be willing to just put my foot down, and say "ok, we'll solve the 'getrandom(0)' issue by just saying that if that blocks too much, we'll do the jitter entropy thing".

Making absolutely nobody happy, but working in practice. And maybe encouraging the people who don't like jitter entropy to use GRND_SECURE instead.

But getrandom() has been in the kernel for five years and in glibc for more than two years, so it is clearly part of the kernel ABI. The behavior that some want when they call getrandom(0) should not be arbitrarily changed to provide "bad" random numbers in a way that breaks user-space programs. That was the upshot of Andy Lutomirski's argument in the thread. He agreed that the getrandom() call was poorly thought out before it was added, but that should not change now:

There are programs that call getrandom(0) *today* that expect secure output. openssl does a horrible dance in which it calls getentropy() if available and falls back to syscall(__NR_getrandom, buf, buflen, 0) otherwise. We can't break this use case. Changing the semantics of getrandom(0) out from under them seems like the worst kind of ABI break -- existing applications will *appear* to continue working but will, in fact, become insecure.

Lutomirski also believes it is a "straight up kernel bug" that blocking in getrandom(0) early in the boot deadlocks the system by waiting for entropy. He suggested actively fixing that problem: "How about we make getrandom() (probably actually wait_for_random_bytes()) do something useful to try to seed the RNG if the system is otherwise not doing IO." Torvalds is in agreement with that, though he seems to be leaning toward the jitter-entropy stopgap:

And yes, we'll have to block - at least for a time - to get some entropy. But at some point we either start making entropy up, or we say "0 means jitter-entropy for ten seconds".

That will _work_, but it will also make the security-people nervous, which is just one more hint that they should move to GRND_SECURE[_BLOCKING].

The goal is to ensure that callers are really aware that they are asking to block (and potentially deadlock) waiting for the CRNG to be properly initialized. In the ambiguous default case, that may well not be the case, so Torvalds is determined to find a way to make that not block:

So we absolutely _will_ come up with some way 0 ends the wait. Whether it's _just_ a timeout, or whether it's jitter-entropy or whatever, it will happen.

He is concerned about the amount of time it might take to gather enough jitter entropy to initialize the CRNG, however. He suggested that he was willing to block as long as 15 seconds, but thought that might require some kind of accelerated jitter-entropy technique. Patrakov said that acceleration was not needed as the existing technique can generate plenty of entropy in two seconds. In addition, as had also been noted elsewhere in the thread, Matthew Garrett pointed out that the Zircon kernel for the Fuchsia operating system initializes its CRNG using jitter entropy, which may lend some credibility to the technique.

ABI

In a departure from his usual stance, Torvalds seems fairly unconcerned about changing the kernel ABI in this case. He said that any breakage from changing getrandom(0) to time out was theoretical, but that the boot deadlock problem was real. In order for the generation of keys to fail under that scheme, he said, they would have to be generated at boot on idle machines that are not doing anything that would allow entropy to be collected. As Garrett noted, though, that is the exact scenario for which the getrandom(0) behavior was designed. Torvalds does not see that kind of key generation as anything other than a hypothetical, it seems.

The main difference between the proposals from Torvalds and Lutomirski is whether or not to actually provide some way for getrandom() callers to block, possibly forever, or not. Torvalds is willing to have that as a non-default option to getrandom(), while Lutomirski would prefer to simplify getrandom() (though the patch text calls it "getentropy()"), while also removing all of the machinery behind the /dev/random blocking pool. The net effect would be that users who truly need today's getrandom(0) behavior could still get it by reading /dev/random.

The thread is long and twisty; Torvalds's final decision is not yet clear. It does seem that something will be done to getrandom(0), but whether it times out or switches to jitter entropy in the problematic case is unclear. It does also seem that the blocking random number pool's days are numbered, as well, based on Torvalds's statements in the thread. But the final shape of those changes is not yet apparent.

It would seem that, once again, the kernel development community has failed in the design of an API/ABI. According to Torvalds and others, the default for getrandom() should never have been "block forever", but that information comes five years too late. API/ABI review is an area that the kernel has struggled with over the years; hopefully situations like this will provide enough incentive to take some extra time (and do some testing, though that probably would not have mattered here) before committing to an ABI that has to be supported, for the most part, anyway, forever.

Index entries for this article
Kernel	Random numbers

Fixing getrandom()

Posted Sep 27, 2019 16:25 UTC (Fri) by jcm (subscriber, #18262) [Link] (7 responses)

There’s a UEFI random number protocol that ought to seed the kernel PRNG on boot - was the system using UEFI, and had it subsequently run out of entropy?

Fixing getrandom()

Posted Sep 27, 2019 17:53 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

This is only used if the system is booted via the EFI boot stub - grub still defaults to jumping directly into the main kernel entry point, so on a lot of distributions this is skipped.

Fixing getrandom()

Posted Sep 27, 2019 17:55 UTC (Fri) by patrakov (subscriber, #97174) [Link] (4 responses)

This is what I would call "shifting the responsibility" and reject on this basis. Exactly the same logic as described in the article for getrandom() also applies to the UEFI random number protocol: where does it get the seed? If it is fake, then it is a very roundabout way to return fake entropy. If it consults some hardware, then the kernel can do the same.

Fixing getrandom()

Posted Sep 27, 2019 18:37 UTC (Fri) by jem (subscriber, #24231) [Link] (3 responses)

A new seed is written before the computer is rebooted. This way the accumulated entropy is not flushed at restart.

Fixing getrandom()

Posted Sep 27, 2019 19:05 UTC (Fri) by walters (subscriber, #7396) [Link] (2 responses)

Yeah though, actually *crediting* it is a different step: https://github.com/systemd/systemd/issues/4271 (and crediting is highly relevant to this discussion)

Fixing getrandom()

Posted Sep 29, 2019 20:05 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

I recommend clicking through to that bug report. This is more complicated than I had imagined, because in cases where people take images of live systems, you really shouldn't credit any "stored" entropy at all (because it's been duplicated umpteen times into other instances of the same image, so it's no longer unpredictable). But you can't know that someone imaged the system, so how do you square that circle?

Fixing getrandom()

Posted Sep 29, 2019 20:27 UTC (Sun) by patrakov (subscriber, #97174) [Link]

This is exactly why I say "don't". Too many bugs stem from our desire do achieve the impossible instead of giving up immediately.

OTOH, jitter entropy will definitely help here, up to the point of making it completely unneeded to save entropy between reboots.

Fixing getrandom()

Posted Oct 1, 2019 20:54 UTC (Tue) by kmeyer (subscriber, #50720) [Link]

Trusting the BIOS to provide good entropy seems like more of a mistake than jitter-entropy.

Fixing getrandom()

Posted Sep 27, 2019 18:03 UTC (Fri) by flussence (guest, #85566) [Link] (8 responses)

> “Let the people putting together systems deal with this. Let them provide a creditable hw rng, and let them pay the price if they don't.”

I can see the hysterical tech tabloid headlines already: “systemd announces business plan to brick all old systems unless you purchase an expensive security dongle”.

Fixing getrandom()

Posted Sep 27, 2019 18:09 UTC (Fri) by jccleaver (guest, #127418) [Link] (7 responses)

> I can see the hysterical tech tabloid headlines already: “systemd announces business plan to brick all old systems unless you purchase an expensive security dongle”.

It's worth pointing out that half this problem is actually *caused* by having moved everything into systemd. If you needed entropy in early but post-initramfs boot and needed to be sure it was there, it was trivial enough to put some sort of arbitrary shell action way up in the script to do it.

Fixing getrandom()

Posted Sep 28, 2019 16:36 UTC (Sat) by mads (subscriber, #55377) [Link]

But it would be trivial to add that to systemd too, and why does it have to be a shell script? (not that systemd can't use bash, but why should it)

Fixing getrandom()

Posted Sep 29, 2019 9:26 UTC (Sun) by mezcalero (subscriber, #45103) [Link] (5 responses)

systemd reads a random seed off disk just fine for you, no need to write any script for that. Problem with the approach is that it's waaaay too late: we can only credit a random seed read from disk when we can also update it on disk, so that it is never reused. This means /var needs to be writable, which is really later during boot, long after we already needed entropy, and long after the initrd.

Hence, no, systemd is not causing this, systemd does what it can, but it can't magically create entropy where there is none.

Or to say this differently: that "arbitrary shell script" you are envisioning, what is it supposed to do? Where would it derive entropy from where neither the kernel nor systemd do or could do it at least as good?

if you care, have a look here, about the approach systemd takes to help you with the general problem: https://systemd.io/RANDOM_SEEDS.html

Lennart

Fixing getrandom()

Posted Sep 29, 2019 19:37 UTC (Sun) by flussence (guest, #85566) [Link]

So, I had a lingering question that the last paragraph of that page answers sufficiently.

I've got a system where the NVRAM is probably fine, but it has a broken EFI implementation (AMI), where nobody bothered to implement deallocating deleted vars, so eventually it'd start returning -ENOSPC for every write operation. Me naively leaving pstore panic logging enabled soon flushed that out (followed by real panic at efibootmgr failing, and a day of downtime trying to figure out what went wrong and tearing the room up to get at a CMOS jumper).

The kernel help text for EFI features could use a gentle reminder that yes, EFI firmware *is* written by the same nincompoops as the bad old BIOSes of the 90s, and should be equally mistrusted.

Fixing getrandom()

Posted Sep 30, 2019 19:30 UTC (Mon) by wahern (subscriber, #37304) [Link] (3 responses)

Who cares if it's reused? The same 32-byte random seed cryptographically mixed with a non-repeating nonce, like the system clock, would have the same strength as a new 32-byte random seed, presuming the seed remains confidential. Long-term it's better to change the seed in case confidentiality was unknowingly violated or the nonce repeats (clock reset), but as a practical matter those aren't prerequisites to having a reasonably sane and secure system, especially considering the alternative--a multiplicity of userland CSPRNGs all scavenging for entropy independently.

Fixing getrandom()

Posted Oct 1, 2019 17:28 UTC (Tue) by alonz (subscriber, #815) [Link] (2 responses)

You're assuming that there is a reasonably-initialized system clock at the point where entropy is required – this is just as wrong as any of the other assumptions regarding entropy.

Fixing getrandom()

Posted Oct 1, 2019 19:07 UTC (Tue) by wahern (subscriber, #37304) [Link] (1 responses)

I am assuming it, but I think it's a reasonable assumption--that there are adequate sources available to minimize the chance of a repeating nonce. The system clock was just an example. And of course there are systems where the assumption can't hold at all. So what? How many systems do there need to be to hold back everybody else? 1%? 0.1?%? 0.01%? 0%?

Some systems are just hopelessly broken when it comes to entropy. And that can't be fixed. But those systems are increasingly (and at this point likely *entirely*), small, embedded systems. It was always the responsibility of the designers of those systems to either make sure there's an entropy source available or design their firmware so that it wasn't necessary (i.e. no sshd generating a new private key on first boot). Are we going to let them hold back the inevitable *forever*? At some point we have to hold the stragglers' feet to the fire and cut our losses on the installed base--most of which would never upgrade, anyhow, and are unlikely to even be using getrandom(2) in the first place.

With the prevalence of not only RDRAND and similar on-chip sources, but also many other sources (e.g. Intel QuickAssist provided a hardware generator on the NIC controller since *before* it was even branded QuickAssist, EFI provides randomness, which in some cases comes from a hardware source--but that's a QoI issue), it's time to make the switch over to assuming (*loudly* assuming) that strong entropy is available at boot or will be available very shortly after boot (see CPU jitter hack as a last-ditch effort). Almost all of userland already makes this assumption, and has for quite some time, rightly or wrongly; now the ball is in the kernel's court to make good on that assumption to the best of its ability.

This *will* happen eventually, the only question is how long we'll wring our hands over misplaced concern for embedded platforms that are and were fundamentally broken. It's been almost 15 years since the VIA C3 included an on-chip RNG. Embedded designers have had ample warning about the necessity of providing strong entropy for a long time.

Fixing getrandom()

Posted Oct 1, 2019 20:23 UTC (Tue) by wahern (subscriber, #37304) [Link]

Also, just to be clear, the context for the boot seed was systemd. The overlap of embedded systems lacking both hardware entropy such as RDRAND and a reliable system clock but still running systemd is likely not very large. But then you also need to discount that by the odds of consecutive boots where systemd couldn't re-save a seed. *And* you need to discount it further by the odds the system was doing anything security critical. *And* you need to discount this by the odds that such a scenario would be distinguishable and exploitable.

Can this scenario exist? Sure. Does it exist? We should assume so. The only question is what's the risk, and does that risk outweigh the risk of not improving other aspects of the system's randomness semantics with the consequence that software will attempt to compensate *poorly*. And, again, what's that relative risk within the context of embedded system + systemd - RNG - clock?

Fixing getrandom()

Posted Sep 27, 2019 18:37 UTC (Fri) by cesarb (subscriber, #6266) [Link] (2 responses)

> In order for the generation of keys to fail under that scheme, he said, they would have to be generated at boot on idle machines that are not doing anything that would allow entropy to be collected.

Isn't "idle machines that are not doing anything else" exactly the situation in the first boot of a newly-installed distribution, which is when the long-term ssh host keys (which do need strong random numbers) are usually generated?

Fixing getrandom()

Posted Oct 4, 2019 7:14 UTC (Fri) by kmeyer (subscriber, #50720) [Link] (1 responses)

1. The installation process generates a lot of IO, sufficient to seed the CSPRNG. It should emit a seed that can be used by the next boot (much like any other reboot entropy save).
2. Optionally, the installer can also generate and write out sshd host keys. There's not a lot of reason to wait until first boot for that.

Fixing getrandom()

Posted Oct 4, 2019 11:36 UTC (Fri) by Jandar (subscriber, #85683) [Link]

The devices in question aren't installed individually. One image of a single install is put on thousands or millions of them, so generating host keys at installation time is the worst thing to do.

Fixing getrandom()

Posted Sep 27, 2019 18:44 UTC (Fri) by mgedmin (subscriber, #34497) [Link] (3 responses)

Anyone remember https://factorable.net/?

Fixing getrandom()

Posted Sep 28, 2019 12:07 UTC (Sat) by corsac (subscriber, #49696) [Link] (2 responses)

Agreed. I didn't read the whole thread, but I'm surprised not to see any reference to “Mining your Ps and Qs” from Usenix Security 2012 (https://www.usenix.org/conference/usenixsecurity12/techni... and https://factorable.net/weakkeys12.extended.pdf)

Fixing getrandom()

Posted Sep 28, 2019 19:44 UTC (Sat) by patrakov (subscriber, #97174) [Link] (1 responses)

This site was indeed mentioned in the discussion. However, a return of the factorable.net issue for systems that do not properly restore the random seed is preferable, in my viewpoint, to breaking almost all systems.

Fixing getrandom()

Posted Sep 29, 2019 9:01 UTC (Sun) by corsac (subscriber, #49696) [Link]

Factorable.net problem is not only for systems not restoring random seed. Most of them are devices booting for the first time and generating long term keys (ssh etc.) with a really similar entropy state (because of unseeded RNG). It's likely there are a *lot* of devices like these beeing shipped every day, even more so with cloud VMs, so breaking those is a really bad idea I think.

Inject entropy from the disk using the bootloader and similar problem fixed in Python

Posted Sep 27, 2019 21:07 UTC (Fri) by vstinner (subscriber, #42675) [Link]

"Since the system in question was not gathering enough entropy due to the lack of unpredictable interrupt timings, (...)"

OpenBSD fixed this boot issue: their bootloader loads entropy from disk, and the installer collects enough entropy. It doesn't cover all cases (read-only livecd, systems with no entropy source to feed new entropy, etc.), but it fix the "lack of entropy at boot" issue for the common case.

In 2015, a systemd script used Python to compute a hash. Python blocked on getrandom() at boot.

When the bug has been reported, a very long discussion started on the bug tracker, continued on the python-dev mailing list. A new mailing list has been created just to discuss this bug :-)

At the end, we decided to fallback on /dev/urandom if getrandom() blocks, to initialize the "secret hash seed". This secret is used to randomize the dictionary hash function, to reduce the disk of a denial of attack on dictionaries.

Moreover, the os.urandom() function has been modified on Linux (and Solaris: systems providing getrandom() syscall/function) to block (on purpose) until the system collected enough entropy.

Calling os.getrandom(1, os.GRND_NONBLOCK) can be used to check if getrandom() is going to block or not. Some people asked for this feature, but I'm not sure that it's really used in practice.

The https://www.python.org/dev/peps/pep-0524/ describes the issue and fix.

Python was an early adopter of getrandom() syscall, before it was exposed as a function in the glibc ;-)

Python keeps a file descriptor open on /dev/urandom for best performance. Some badly written applications close the file descriptor by mistake, so Python detects if the file descriptor changed (compare st_dev and st_ino) to workaround application bugs. Moreover, there is no lock on the file descriptor for best performance, which requires to detect when two threads open the file "at the same time".

So well, getrandom() avoids all these issues.

Fixing getrandom()

Posted Sep 28, 2019 5:08 UTC (Sat) by ncm (guest, #165) [Link] (7 responses)

I am always amazed when people insist that truly random data is a scarce resource. Most machines nowadays have at least one of a CCD camera, microphone, or radio reciever. All are excellent sources of enough truly random noise (along with possibly deterministic bits, which are harmless) for almost any purpose.

Fixing getrandom()

Posted Sep 28, 2019 6:00 UTC (Sat) by alonz (subscriber, #815) [Link]

The actual scarce resource (in my opinion 😏) is random data that can be trusted by a truly-paranoid person. (Whether the paranoia is justified or not is a different question; I would expect the smart paranoid to use a hardware RNG, not trust the off-the-shelf randomness from a general-purpose computer + OS).

For most uses, a simple userspace solution that runs very early in the boot sequence and credits some environment noise as entropy should be sufficient. This would solve even the “initial SSHD seed” concerns — however it is easily broken by distributors / packagers who might remove it in the name of “faster boot”.

Fixing getrandom()

Posted Sep 29, 2019 6:04 UTC (Sun) by edeloget (subscriber, #88392) [Link]

Well, for the past ten years I have been developing specialized distributions for network devices, and they tend to not have any of the above. And yes, SSH keys generated on the first boot are really a pain (actually less of a pain these days because newer CPUs tend to propose a hwrng but hey, this is an industry where you are routinely dealing with severly out-of-date CPUs...).

Fixing getrandom()

Posted Sep 30, 2019 8:26 UTC (Mon) by anton (subscriber, #25547) [Link] (3 responses)

I do not know if a microphone or radio are good random sources, but a camera is. The resolution of camera sensors is high enough that the randomness of the photons coming in is reflected in the raw sensor output (and it is a lot for a (not too) bright picture). However, that means that the sensor must be on and receive significant light on booting, and you need a way to get the raw data (transformation into JPEG usually tries to get rid of the noise that we want for the RNG).

Fixing getrandom()

Posted Sep 30, 2019 11:59 UTC (Mon) by excors (subscriber, #95769) [Link] (1 responses)

I think you can get a decent amount of entropy even if there is no light on the camera, because of thermal noise in the sensor. That's probably safer than receiving significant light, because if the sensor gets saturated then you'll get a pure white image with no entropy. Phone cameras don't have physical shutters but the sensors often have some non-exposed pixels around the edge (for black level calibration etc) and you could probably use those.

> you need a way to get the raw data (transformation into JPEG usually tries to get rid of the noise that we want for the RNG)

It's not just the JPEG compression - the Android camera API is happy to give you uncompressed YUV but that still wouldn't be raw enough. You'd want the (typically) 10-bit Bayer data directly from the sensor, before the ISP has tried to make it look pretty (doing noise reduction, adjusting levels in a way that might saturate the noise out of existence, smoothing the image, etc). And you probably want to manually configure the sensor to maximise noise (long exposure, high gain, disable binning, etc). Android provides enough control to let applications request that, but I don't know how many of the camera drivers implement it fully, so it's probably not a very portable approach.

Fixing getrandom()

Posted Sep 30, 2019 12:49 UTC (Mon) by anton (subscriber, #25547) [Link]

Thermal noise is relatively small compared to photon noise if the sensor receives significant photons, but may be enough for initializing the RNG. And of course you don't want to be have so much brightness and so much exposure that the sensor saturates, but you can recognize anything approaching saturation, and then use shorter exposure time, if too many pixels are saturated. Combining high gain with long exposure will give more thermal noise in darkness, but produce saturation if there is light.

Fixing getrandom()

Posted Oct 3, 2019 10:40 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

As long as you can crank up the gain high enough to input "hiss" or "static" (electronic shot noise or cosmic background radiation) then even a microphone jack with no microphone plugged in, or a radio with no aerial, is a source of physical randomness. I don't know if the common on-board microphone sockets in the PC world can be used this way. I guess the problem for the truly paranoid, is how to tell whether what is coming in has been deviously compromised so as to only look random.

Personally I'd go with a boot parameter "paranoia = n" (maybe the current and maximum value is 11, with a nod to Spinal Tap). 10 would allow use of the random number generator on the CPU chip if there is one, and thereby solve all the problems other than the possibility that (insert conspiracy theory here).

Fixing getrandom()

Posted Nov 19, 2021 16:51 UTC (Fri) by Lawless-M (guest, #155377) [Link]

My VMs do not have cameras, radios or microphones

Fixing getrandom()

Posted Sep 28, 2019 5:52 UTC (Sat) by josh (subscriber, #17465) [Link] (2 responses)

> As Garrett noted, though, that is the exact scenario for which the getrandom(0) behavior was designed. Torvalds does not see that kind of key generation as anything other than a hypothetical, it seems.

Many distributions (both live and initial-boot) generate SSH keys on boot. They do this *today*. That's not a hypothetical, that's a case that Debian folks have been discussing for a while now, where systems take forever to boot. This is still a bug today, if you don't have a hardware random number generator.

Fixing getrandom()

Posted Oct 4, 2019 7:21 UTC (Fri) by kmeyer (subscriber, #50720) [Link] (1 responses)

It doesn't fix the issue for live systems (are there many of those without RDRAND?), but for installed initial-boot systems: why not have the installer write out a random seed and optionally also sshd host keys?

Fixing getrandom()

Posted Oct 4, 2019 9:23 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

Because host keys need to be unique. Installer is used once to create a template. This template is then used number of times to create virtual machines.
Template cannot contain pregenerated host keys, because every VM would have the same key.
Using installer everytime when creating new VM is not feasible, installation process takes too much time. Creating new VM is something that should take no more than few seconds.

Fixing getrandom()

Posted Sep 28, 2019 6:09 UTC (Sat) by alonz (subscriber, #815) [Link] (1 responses)

In my opinion, a better solution would be to remove the automatic collection of entropy from the kernel at boot time, and require userspace to provide randomness (or to explicitly start the kernel randomness collection). And - make getrandom() (and /dev/{u,}random) return an error if they have no randomness to provide.

This would mean that a userspace that doesn't initialize randomness early enough will just fail, loudly and deterministically. So even the folks who try to “optimize boot time” by just removing boot-time items without thinking won't be able to build a broken system that boots but isn't secure.

Fixing getrandom()

Posted Oct 4, 2019 7:31 UTC (Fri) by kmeyer (subscriber, #50720) [Link]

I don't see what possible benefit removing kernel entropy collection might offer. Many kernel entropy sources simply aren't available to userspace. Additionally, the kernel is the logical place to pool entropy for the system; it outlives all ephemeral userspace programs.

As to the rest, the APIs you ask for already exist.

> make getrandom() (and /dev/{u,}random) return an error if they have no randomness to provide.

getrandom(1, GRND_NONBLOCK) ⇒ -1/EAGAIN; poll(/dev/random, POLLIN, 0) ⇒ 0.

> This would mean that a userspace that doesn't initialize randomness early enough will just fail, loudly and deterministically.

All you need to do is add one of the above checks and a printf() to your userspace init process to produce the loud warning.

> So even the folks who try to “optimize boot time” by just removing boot-time items without thinking won't be able to build a broken system that boots but isn't secure.

That is the status quo with correct use of getrandom().

Fixing getrandom()

Posted Sep 28, 2019 9:57 UTC (Sat) by dd9jn (✭ supporter ✭, #4459) [Link] (3 responses)

So in Libgcrypt we have been urged to move to getrandom to avoid some issues in the early boot phase and to get better performance in almost all cases. Weakening getrandowm now is a no-go and would badly reflect on the security consciousness of _some_ kernel folks.

Using Stephan Müller's jitter based entropy generator inside the kernel is by any means the Right Thing to do - even if it is for now only a fallback. In Libgcrypt's Windows version we already use it because on Windows the JitterRNG is the only non-external-hardware RNG which has been approved by Germany's BSI for use in restricted communication at the VS-NfD level. On Linux getrandom has been evaluated as fine but nevertheless we mix some entropy from the JitterRNG into our own entropy pool. Right, we also use RDRAND in addition and that is technically okay. But because RDRAND can't be evaluated the evaluation of Libgcrypt assumes that RDRAND adds 0 bits of entropy to the pool.

Fixing getrandom()

Posted Sep 28, 2019 19:52 UTC (Sat) by patrakov (subscriber, #97174) [Link] (2 responses)

Could you please submit some documental evidence of this approval? Asking because it would be the perfect response to the last paragraph in https://lore.kernel.org/lkml/CAHk-=whz7Okts01ygAP6GZWBvCV...

Fixing getrandom()

Posted Sep 28, 2019 20:38 UTC (Sat) by joib (subscriber, #8541) [Link]

Probably the BSI cares only about x86(-64) Windows on cpu's supported by currently maintained Windows versions, so they can assume the presence of TSC? So it doesn't really address what seemed to be Linus main objection.

(Though in my non-expert opinion, it seems having a jitter entropy generator in the kernel for supported targets would be the least bad approach of those discussed here. Those few that run unsupported targets are hopefully sufficiently clueful that they can use a hw RNG, haveged, or maybe they don't need early boot random numbers anyway.)

Fixing getrandom()

Posted Oct 2, 2019 0:54 UTC (Wed) by mangix (guest, #126006) [Link]

FWIW OpenWrt master (and 19.07) uses this. It's faster than haveged.

Fixing getrandom()

Posted Sep 29, 2019 7:29 UTC (Sun) by patrakov (subscriber, #97174) [Link] (4 responses)

The story has received an update: Linus Torvalds posted a patch to get the entropy from the timing of schedule() calls. Very similar to jitter entropy.

https://lore.kernel.org/lkml/CAHk-=wgjC01UaoV35PZvGPnrQ81...

Fixing getrandom()

Posted Sep 30, 2019 10:54 UTC (Mon) by joib (subscriber, #8541) [Link] (3 responses)

Seems this approach was merged:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

Fixing getrandom()

Posted Sep 30, 2019 11:57 UTC (Mon) by patrakov (subscriber, #97174) [Link] (2 responses)

Yes, but with a few phrases in the commit message that I not necessarily agree with (or maybe should interpret as sarcasm, because then it makes perfect sense). Let me quote the problematic sentence.

"""
While this was triggered by what is arguably a user space bug with GDM/gnome-session asking for secure randomness during early boot, when they didn't even need any such truly secure thing, the issue ends up being that our "getrandom()" interface is prone to that kind of confusion, because people don't think very hard about whether they want to block for sufficient amounts of entropy.
"""

If things as late as GDM/gnome-session are still "early boot", then which service does not count as early boot? See the problem?

Fixing getrandom()

Posted Sep 30, 2019 13:19 UTC (Mon) by Otus (subscriber, #67685) [Link]

> If things as late as GDM/gnome-session are still "early boot", then which service does not count as early boot? See the problem?

From the point of view of the random pools, before this change, anything before the user gets a login screen is early boot. That's when you start getting more than a trickle of entropy.

Fixing getrandom()

Posted Oct 1, 2019 2:46 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

Well, if that's too early, that means you're plausibly running all kinds of nonsense, like Python.

Fixing getrandom()

Posted Oct 1, 2019 9:39 UTC (Tue) by ceplm (subscriber, #41334) [Link] (6 responses)

Fixing getrandom() is removing it, or providing getnormalrandom() (which reads random data from /dev/urandom without any other checks). In the moment when there is silly method with a nice kernel call and reasonable method which is even slightly more complicated to use, everybody uses the first one even in the situation when they shouldn’t. Data from /dev/random are not more random or somehow with a pixie dust on the top. They are exactly the same, but sometimes /dev/random blocks for reasons which don’t matter for almost anybody (https://www.2uo.de/myths-about-urandom/).

Fixing getrandom()

Posted Oct 1, 2019 9:49 UTC (Tue) by ceplm (subscriber, #41334) [Link] (3 responses)

I was working on the mistaken assumption that getrandom() gets data from /dev/random. It doesn’t, it is actually desgined well, and this is just a bug, which need to be fixed, no more fuss about it.

Fixing getrandom()

Posted Oct 1, 2019 16:53 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Using /dev/random is a bad idea in general. There's no reason it's more secure than /dev/urandom and it can lead to large delays.

Fixing getrandom()

Posted Oct 4, 2019 6:50 UTC (Fri) by kmeyer (subscriber, #50720) [Link] (1 responses)

If you have a device no one can use, why have it? So I'd say it's the design of Linux /dev/random that's at fault. Linux could drop their "entropy draining" concept tomorrow and have a /dev/random like BSD /dev/[u]random and everyone would be just as happy. There is no academic or practical basis in the "entropy draining" model Linux's /dev/random espouses.

Fixing getrandom()

Posted Oct 10, 2019 20:28 UTC (Thu) by nix (subscriber, #2304) [Link]

Entropy draining is still useful, but not for blocking reads -- more, so that expensive methods of *accumulating* entropy only need to run when people have actually been reading from /dev/random in the first place (right now, adding entropy to a full pool will block until the pool drains a bit). These expensive methods are not theoretical: I have one attached to the machine I'm typing this on right now. Adding entropy non-stop will max out a core...

Fixing getrandom()

Posted Oct 1, 2019 17:05 UTC (Tue) by zlynx (guest, #2285) [Link] (1 responses)

What do you do about the fact that IoT devices built with identical hardware and identical software boot up and produce one of *maybe* 4,000 or so *identical* /dev/urandom streams. It's pretty much based on temperature which can *slightly* modify the interrupt rates vs clock cycles.

At any rate, it's a real problem. Because if they do something during boot such as generate their own SSH or SSL certificates the attacker only has to guess a few possibilities.

Fixing getrandom()

Posted Oct 1, 2019 19:02 UTC (Tue) by ceplm (subscriber, #41334) [Link]

Hold on! I haven’t said it cannot be real problem (although, couldn’t that first start of ssh daemon be a bit dealyed after whole system is up; systemd should be able to do something like that, shoudln’t it?). I just said a) there is something wrong with getrandom() defaults to the blocking version (sshd is certainly minority here, so it should be calling getrandom() with some specific options to make sure it gets lovely perfect random number), b) there is something wrong with gdm, when it doesn’t expect to be run rather early with possibly not enough entropy for something which really doesn’t require super-random numbers (it should certainly call getrandom(0) because it really doesn’t matter that much how superspecial those random numbers are). c) well, those IoT devices should know about this situation and somehow collect more enthropy themselves (mixing in MAC addresses, some their serial numbers; yes, it is predictable for the given device, but possibly less predictable for anonymous device in general? Not sure.)