User: Password:
|
|
Subscribe / Log in / New account

Python's os.urandom() in the absence of entropy

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
July 10, 2016
Python applications, like those written in other languages, often need to obtain random data for purposes ranging from cryptographic key generation to initialization of scientific models. For years, the standard way of getting that data is via a call to os.urandom(), which is documented to "return a string of n random bytes suitable for cryptographic use." An enhancement in Python 3.5 caused a subtle change in how os.urandom() behaves on Linux systems, leading to some long, heated discussions about how randomness should be obtained in Python programs. When the dust settles, Python benevolent dictator for life (BDFL) Guido van Rossum will have the unenviable task of choosing between two competing proposals.

Blocking os.urandom()

Traditionally, os.urandom() has been implemented on Linux by opening /dev/urandom and reading the requested amount of data. This interface is non-blocking; it will not wait if the amount of entropy in the system's entropy pool is low. The implementation of /dev/urandom is such that the quality of the random data it returns will be high even if the entropy pool is depleted — with one possible exception. Immediately after the system boots, when the entropy pool will contain little or no entropy, /dev/urandom may return relatively predictable data. In most systems, this window of poor randomness is only open for a few seconds at most, but exceptions do exist.

In the Python 3.5.0 release, os.urandom() was changed to use the relatively new getrandom() system call on Linux. Unless it has been called with the GRND_NONBLOCK flag, getrandom() will wait, if need be, for the system entropy pool to be initialized. os.urandom() does not supply that flag, meaning that it can block if the entropy pool has not yet accumulated enough randomness. It seems like a relatively small change, with the prospect of being sure of living up to the "suitable for cryptographic use" promise in compensation. But one need only look at Python issue 26839 to see that the implications are not quite as simple as one would expect.

It turns out that, in some distribution configurations, Python scripts are run at the very beginning of the user-space bootstrap process. If the entropy pool is not yet ready, those scripts will block until entropy-pool initialization is complete. If there is nothing else going on, and especially if the system is booting as a virtualized guest, it may take a long time to accumulate enough entropy to proceed. In the incident that led to the bug report, the boot process simply hung for 90 seconds until systemd lost patience and killed the blocking process. That kind of behavior, created in the search for unpredictable random numbers, has quite predictable effects in the form of unhappy users.

From this sprung the bug-tracker entry referenced above, which turned into a fierce discussion on the wisdom of the API change and whether os.urandom() should return crypto-quality randomness at any cost. The discussion spilled over onto the python-dev list when 3.5 release manager Larry Hastings despaired of reaching any sort of consensus and asked Van Rossum to simply rule on the matter. The resulting thread led many participants to question whether they wanted to continue following the list at all but, in the end, it did come to some useful conclusions.

If one is doing cryptographically sensitive work early in the bootstrap process — generating an SSH host key, for example — then blocking the boot almost certainly makes sense. The consequences of the alternative — generating weak keys — can be severe. In this case, though, it turns out that such high-quality randomness was not needed. Nobody was generating keys; instead, Python was initializing its own internal random-number generator and setting up dictionary randomization to defend against hash-collision attacks. These internal calls were (inadvertently) changed when os.urandom() was changed, but there seems to be a rough consensus that they do not need blocking behavior.

So the proper fix for the observed boot hang is to do these internal initializations without blocking on the entropy pool. For Python 3.5, the os.urandom() change will also be partially reverted, in that the function will, once again, be non-blocking. It will call getrandom() with the GRND_NONBLOCK flag and, if that call fails, fall back to reading /dev/urandom as before. With these fixes in place, the blocking part of the change is effectively reverted and the immediate problem has been solved.

Blocking or exceptions?

That still leaves open the issue of how os.urandom() should behave; developers who are concerned about security are adamant that it should not return data when the entropy pool is not yet ready. So there is still pressure on Van Rossum (and the community as a whole) to specify blocking behavior starting with the upcoming 3.6 release. Python's benevolent dictator seems inclined to downplay the issue:

The problem with security experts is that they're always right when they say you shouldn't do something. The only truly secure computer is one that's disconnected and buried 6 feet under the ground. There's always a scenario through which an attacker could exploit a certain behavior. And there's always the possibility that the computer that's thus compromised is guarding a list of Chinese dissidents, or a million credit card numbers, or the key Apple uses to sign iPhone apps. But much more likely it just has my family photos and 100 cloned GitHub projects.

It became clear in the discussion, though, that opposition to returning questionable randomness from os.urandom() is strong. It seems likely that, in 3.6 and later releases, os.urandom() will no longer return data drawn from an uninitialized entropy pool. The question of how it will behave is, as yet, unresolved, though. In the end, Van Rossum asked the proponents of two different approaches to write up their ideas as Python enhancement proposals (PEPs); he will then choose between the two.

The first approach, favored by Victor Stinner, is to simply make os.urandom() blocking and be done with it — after ensuring that Python itself uses non-blocking behavior during its initialization. Changing to blocking behavior is arguably an incompatible change in a longstanding Python API but, as Stinner points out: "First of all, no user complained yet that 'os.urandom()' blocks. This point is currently theoretical." As long as the problems with starting Python itself are resolved, the thinking goes, there should not be problems for other users.

The alternative comes from Nick Coghlan. With this proposal, os.urandom() will raise a BlockingIOError exception if random data cannot be had without blocking. Adding a new exception to an established API has its own hazards; no existing code will be expecting that exception, so surprising explosions might result. But, for a problem that should only be possible during the bootstrap process, Coghlan believes that this is the best approach:

The hard part is then knowing that you *need* to wait. If you're silently getting more-predictable-than-you-expected random data, you may never realise. If your system hangs, you might eventually figure it out, but only after a likely frustrating debugging effort. By contrast, if your application fails with "BlockingIOError: system random number generator not ready", then you can search for that on the internet, see the above snippet for "How to wait for the system random number generator to be ready on Linux" and stick that into your code.

This proposal also envisions adding a function to the in-development "secrets" modulesecrets.wait_for_system_rng() — that would simply block until the system's entropy pool is fully initialized and ready. The small (possibly nonexistent) body of code that breaks with unhandled BlockingIOError exceptions could call this function to ensure the availability of strong random data from os.urandom().

It is not clear when a decision between these two proposals will be made. It is worth noting, though, that Coghlan has indicated that he is happy enough with Stinner's proposal that he can support it should that be the one that is accepted in the end. So the discussion may have been long and painful, but the end result should be strong random data in Python in a way that the community as a whole is able to agree upon. Hopefully that means everybody can rest and prepare for the inevitable debate over whether this change should be backported to Python 2.


(Log in to post comments)

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 15:01 UTC (Sun) by fandingo (guest, #67019) [Link]

Isn't "urandom" in os.urandom() supposed to mean something? The name is clearly a reference to /dev/urandom, so it should behave the same as its namesake. I don't think there's anything more to it. Urandom means nonblocking. If they want to switch out the syscalls they use to getrandom(), that's fine but keep the same behavior.

The os module isn't designed to do anything fancy or add complexities on top of syscalls. It's just there to make them available. The less opinion the run time injects into this sort of thing the better.

Focus this effort on the secrets module. That's where people expect them to invest the time on these thorny issues.

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 17:00 UTC (Sun) by Otus (subscriber, #67685) [Link]

> The name is clearly a reference to /dev/urandom, so it should behave the same as its namesake.

Arguably, /dev/urandom itself should not behave as it does. If it weren't for strict API guarantees preventing changes, it should similarly block in early boot until sufficient entropy exists, like getrandom().

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 22:05 UTC (Sun) by SLi (subscriber, #53131) [Link]

How would it differ from /dev/random then?

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 22:23 UTC (Sun) by josh (subscriber, #17465) [Link]

/dev/urandom and /dev/random both use a CSPRNG to produce cryptographically secure random data. Both require a seed, after which they'll generate an infinite amount of cryptographically secure random data, unpredictable and suitable for all purposes. /dev/random blocks based on an incorrect understanding of how CSPRNGs work, and should never be used for any purpose; /dev/urandom only blocks when it hasn't been seeded yet, and works fine as long as it has been seeded. getrandom behaves the way urandom always should have worked.

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 22:53 UTC (Sun) by SLi (subscriber, #53131) [Link]

I do not think /dev/urandom blocks, ever, based on a cursory review of both urandom(4) and urandom_read() in drivers/char/random.c.

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 23:17 UTC (Sun) by josh (subscriber, #17465) [Link]

I edited that part of my comment before posting it, but forgot that LWN's "publish comment" submits the version you last previewed without any edits; you have to hit "preview comment" again before submitting.

That should have said: /dev/urandom never blocks; it should block only when it hasn't been seeded yet, as it works fine as long as it has been seeded. getrandom behaves the way urandom should always have worked.

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 5:57 UTC (Mon) by Otus (subscriber, #67685) [Link]

> How would it differ from /dev/random then?

/dev/random keeps an entropy tally and can block whenever that goes low. /dev/urandom only needs initial entropy and should (but does not) only block early when it lacks even that.

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 22:14 UTC (Sun) by arjan (subscriber, #36785) [Link]

sadly on most systems there IS a hardware RNG that's good enough for urandom.... but it's not being used (enough) to avoid this problem.

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 10:21 UTC (Mon) by keeperofdakeys (subscriber, #82635) [Link]

Arguably, the first thing the initramfs should do after mounting the root file system is reading the random seed stored on disk, and initialising the random pool (IIRC this is how openbsd works). The problem with this is you still have to wait for the init system to start before the seed can be rewritten.

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 14:17 UTC (Mon) by hkario (subscriber, #94864) [Link]

this is exactly what basically every single desktop or server Linux distribution does

the problem is with embedded systems and as you pointed out, when you run inside initrd image

Python's os.urandom() in the absence of entropy

Posted Jul 15, 2016 23:58 UTC (Fri) by tytso (✭ supporter ✭, #9993) [Link]

The reason why /dev/urandom doesn't block, ever is because if we ever made this change, it would be backwards incompatible, and it *would* cause certain userspace setups to never boot. The zero day kernel tester reported, when I tried this as a quick check, that both Ubuntu and Cerowrt would hang forever at boot if I tried making this change in the zero-day's VM environment.

getrandom(2) is a new system call interface (now, if only glibc would get off their collective backsides and support it), so I could make it block until the entropy pool is initialized. That way, we don't break backwards compatibility. The problem is that there are userspace programs that use /dev/urandom for various non-cryptographic use cases, and we don't want to break them --- especially if it means breaking a distribution in early boot.

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 22:22 UTC (Sun) by cardoe (guest, #72234) [Link]

Something similar recently came up in Rust as well. A few programs I wrote that were started up in an early boot environment were blocking for long enough to raise the ire of systemd. On the Rust side it was nearly impossible to use any of the standard crates (think of Python Packages). After talking to some of the Rust maintainers the solution they wanted me to implement was to not cause the internal libstd bits to block but any users of the random crate would get the blocking implementation. The result is any users of HashMap are not guaranteed to have a cryptographically secure seed. For the relavant pull request see: Rust Lang PR 33086 and the original issue: Rust Lang Issue 32953. Honestly I know my use case is special but it matters to me (and clearly others from this post). You don't always need cryptographically secure numbers for every case that you need a random value. And acting like the world is black and white with regards to this is what gives the security folks a bad rap.

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 22:39 UTC (Sun) by arjan (subscriber, #36785) [Link]

well the blocking behavior makes people write code that just doesn't use random inputs anymore. which is bad.

there is a lot of entropy in most modern systems, from the prng instructions in modern CPUs to the timestamp at the time of interrupts... yes you can argue that you don't want to trust any of that for generating long lasting keys (e.g. /dev/random), and I appreciate that a lot. But for almost everything else, throwing a lot of the "some entropy, just not provable how much" things together will be better than having people stop using random numbers (or worse, implement a custom PRNG per app) in their code.

(the provably perfect /dev/random is one that never returns anything. it's also provably the most useless one.... )

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 3:03 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

For systemd at least, why not have a target like network-wait-online.target but for seeding the RNG like random-wait-seed.target? Anything which needs it can then just add an After= and Requires= on it.

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 11:21 UTC (Mon) by cortana (subscriber, #24596) [Link]

Does systemd-random-seed.service provide this?

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 11:23 UTC (Mon) by cortana (subscriber, #24596) [Link]

... if so, then unless a service uses DefaultDependencies=no then it already gets put in the right place, since that service is Before sysinit.target.

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 12:29 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

No, it just brings in the previous boot's random seed from the disk, therefore is also not available for containers or first-boot anyways. It does not wait for sufficient entropy (I'd expect it to have "wait" in its name and be a target otherwise). Also, I don't expect it would be in DefaultDependencies anyways since it would otherwise introduce noticeable boot lag in all services, even those which do not actually need it (like ssh-keygen and similar services).

https://www.freedesktop.org/software/systemd/man/systemd-...

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 12:32 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Oh, the reason for being a target is so that things like hardware randomness harvest daemons have an explicit thing to do Before= and WantedBy= on in case there are different ways of getting other randomness. I think of it as a checkpoint in the boot graph rather than a single task, hence a target.

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 23:43 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Ugh. I'm _so_ glad I'm not using Python for anything major anymore.

As for the whole random/urandom stuff - attacks on random number generators range from "impossible" to "requires a lab to pull off in a controlled situation". There's simply no reason whatsoever to use /dev/random.

Waiting _once_ until there's enough initial entropy in urandom pool makes sense, but 128 bits are plenty enough for that and you can get them quickly enough.

Python's os.urandom() in the absence of entropy

Posted Jul 10, 2016 23:59 UTC (Sun) by droundy (subscriber, #4559) [Link]

Waiting _once_ for some entropy is what the change to python which caused the trouble did.

Wrong place

Posted Jul 11, 2016 5:18 UTC (Mon) by ncm (subscriber, #165) [Link]

The language is the wrong place to solve this problem. Generally the system has plenty of entropy available at the time it shuts down, so if it can stir a pot on disk once in a while while it's up, it can draw on that pot at startup, and deliver it to programs in any language.

Even if you boot from a LiveCD and so have no place to put a file, RTCs can store a hundred bits (or so) across power-cycles. Even the act of getting bits out of the RTC can generate some entropy, because of mismatched clocks.

Wrong place

Posted Jul 11, 2016 10:27 UTC (Mon) by keeperofdakeys (subscriber, #82635) [Link]

Operating systems already do this, they save a seed at shutdown, and reseed the random pool during bootup (also saving a new seed). This is usually done in the init system, so there is a small amount of time during startup where there may not be enough entropy.

Wrong place

Posted Jul 13, 2016 20:49 UTC (Wed) by akkornel (subscriber, #75292) [Link]

The only issue there is that the normal way this is done is essentially something like this:

cat /var/lib/random_pool_from_shutdown > /dev/urandom

That will mix data into the pool, but it does not actually credit any entropy. So when you do `cat /proc/sys/kernel/random/entropy_avail`, it won't show any increase.

If you do want to add random data to the pool _and_ credit entropy, you have to do an ioctl. The random(4) man page has the details!

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 7:55 UTC (Mon) by lkundrak (subscriber, #43452) [Link]

Sounds like there should be a randomness-ready.target in systemd that would be reached when getrandom() no longer block and services that need the pool to be ready should depend on it?

Python's os.urandom() in the absence of entropy

Posted Jul 11, 2016 21:57 UTC (Mon) by flussence (subscriber, #85566) [Link]

Why would blocking a billion layers of abstraction up the stack (and causing a thundering herd) be better than blocking in the syscall where it's needed?

VirtIO RNG

Posted Jul 11, 2016 9:47 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

So far as I can see what people need to be doing is insisting upon Virtio RNG.

The imaginary box in which your virtualised computer exists doesn't have any reliable source of actual random numbers. So you should suck entropy from the host, which does. As I understand it VirtIO RNG isn't in terribly good shape, but it's not a network or disk, it only needs to successfully move a few random bits from the host to kick things off in early boot, so it doesn't need to be a work of entrancing efficiency so long as it isn't full of security holes.


Copyright © 2016, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds