LWN: Comments on "Pitchforks for RDSEED"

Pitchforks for RDSEED

DanilaBerezin — Sat, 24 Feb 2024 15:14:39 +0000

But the process isn't bringing down the machine, a failing `RDSEED` -- which is kernel functionality -- is. I think a warning is warranted in this case especially when the only known cause for this is a deliberate attack on the randomness subsystem. If crash on warn is enabled and this warning causes the system to crash, that would be the admins fault.

Pitchforks for RDSEED

jcm — Tue, 13 Feb 2024 04:49:19 +0000

There are (shockingly) architectures beyond x86. I know! It's surprising isn't it ;) But anyway, this isn't a unique problem. However, on at least one other architecture we do have some platform implementation guidance around rate of reseed and consideration for denial of service from bad actors sharing a socket.

Pitchforks for RDSEED

mathstuf — Sat, 10 Feb 2024 15:26:28 +0000

> If this is occurring after a retry loop of 10 (which, if I did my math right, has less than a 0.001% chance of happening if a 'normal' RDSEED has a failure rate of 30%)

That sounds like you're assuming events are independent. I feel like there's some level of dependence when one is right after another (i.e., failures will cluster).

Pitchforks for RDSEED

zx2c4 — Fri, 09 Feb 2024 16:54:59 +0000

This didn't make it in time for publication, but here's the latest patch on the matter: https://lore.kernel.org/all/20240209164946.4164052-1-Jaso...

Instruction failure

corbet — Fri, 09 Feb 2024 14:31:57 +0000

The RD* instructions do not fail silently; the CPU sets the carry flag to indicate whether they succeed or not.

Pitchforks for RDSEED

spacefrogg — Fri, 09 Feb 2024 14:31:30 +0000

Then the CPUs are actually broken. You must guard your entropy state or its useless to you.

Daniel J. Bernstein has an excellent write-up about how you could potentially mis-use an entropy source to force your encryption scheme to leak your private key alongside your ciphertext while making it look perfectly fine. I just fail to find it right now. (It was concentrating on why encryption schemes should not mindlessly access entropy)

The argument is quite simple: You do know how the encryption scheme distributes bits. If you control the entropy bits and the encryption scheme, you can hide the private key in the entropy bits and make the ciphertext partially predictable, enough to recover the private key and thus the original message.

Pitchforks for RDSEED

khim — Fri, 09 Feb 2024 14:22:22 +0000

> I have worked in the field of hardware true random number generation and I can tell you with confidence: However you build a system that contains a TRNG, you will always find a significant amount of instances in which the TRNG will fail at some point.

Nope. Or, rather: technically yes, practically no.

You are forgetting one fundamental truth: there's non-zero probability (around 10⁻¹² or something like that) that anything at all would fail in a computer. Even simple jz test_passed may jump to a wrong place once-in-a-while.

> Conclusion: Every TRNG that does not fail from time to time is already failing.

Why should it fail significantly more often than once per 10¹² calls, though? That is the question and I don't see why that would be the case.

We ignore probabilities smaller than these (simply because we don't really, have much choice) and if we do that than making non-fallible (in a practical sense) TRNG should be possible.

Pitchforks for RDSEED

DemiMarie — Fri, 09 Feb 2024 14:01:12 +0000

The problem is that there is usually very little software can do to handle that failure.

Many random number generation APIs are infallible: they promise to always succeed. Always. The simplest way for software to handle this is to busy-spin until success. If the RNG is working, this succeeds with probability 1, and the likelihood of more retry loops being needed decreases exponentially with time.

I’d much rather have RDAND use a CSPRNG like Fortuna that never runs out of entropy once seeded, block the boot until the RNG is seeded, and reseed with entropy from the TRNG whenever possible. Fortuna is designed to be secure even if the new entropy is malicious, so one can feed raw output from the TRNG without any further conditioning. Entropy estimation is then only needed to determine when, and if, RDRAND needs to block. There would be no error returns for software to check: if the TRNG is not working for long enough, you get a machine check exception.

Pitchforks for RDSEED

dullfire — Fri, 09 Feb 2024 13:58:57 +0000

I was actually just thinking about that.

I think you MIGHT be able to maintain that probably format, if there's a change (possibly delays) you can make to make the next RDSEED mostly unrelated to the first. Also note that isn't not necessary to try accounting for things like another thread attempting to drain entropy (since that would be an attack, in which case a warning, or panic if panic_on_warn, is a perfectly sane response)

IF that's possible[0], then you just need to pick a loop count that makes the likelyhood of successive failures unreasonably small.

Although, honestly I think the sanest course of action would simply to dedicated hardware (that requires privilege to access to use) in the non-cloud case. In my humble opinion the whole notion of confidential cloud compute is intractable, so I have no proposed solutions for it .

[0] I think it should be, but have no proof.

Pitchforks for RDSEED

epa — Fri, 09 Feb 2024 13:43:15 +0000

Oh, I completely misunderstood the article in that case. By the instruction ‘failing’ I thought it meant generating some kind of fault or trap, and not giving any value back. If it appears to work, but the values are no longer random enough for practical use (they have become predictable) then that is a much harder kind of failure to detect and handle.

Pitchforks for RDSEED

zx2c4 — Fri, 09 Feb 2024 13:28:43 +0000

These are unprivileged CPU instructions. This isn't a kernel scheduler issue.

Pitchforks for RDSEED

spacefrogg — Fri, 09 Feb 2024 12:59:09 +0000

Thinking about it further, entropy is like CPU time. Once consumed by a process, you cannot get it back (opposed to memory), but you have to let time pass. So, an entropy-access scheduler seems to be one solution. Processes would need to announce to the kernel that they want to be part of the list of entropy users and get a scheduled and limited amount of RDSEED calls. Obviously, this needs close monitoring of the systems entropy state, which might be hard to do and may end up to be as good as limiting calls to x/sec.

Pitchforks for RDSEED

Wol — Fri, 09 Feb 2024 12:52:35 +0000

And I think you need to study some statistics. There's a reason buses come in threes ...

Cheers,
Wol

Pitchforks for RDSEED

spacefrogg — Fri, 09 Feb 2024 12:47:21 +0000

I don't think you understand the relations. You cannot, be assured. (At least not by the way of waving your hands).

If you really must and invest a lot of money, the best you can currently achieve is the quality level (very high), speed (kbit/s), chip size (10mm^2) and usage intervals (<100 per day) of smartcards.

Pitchforks for RDSEED

spacefrogg — Fri, 09 Feb 2024 12:42:03 +0000

True, I was not arguing against that. It came across in the article that Dave Hansen suggested to consider failing TRNGs to signify broken hardware, which is not a proper way to look at the issue.

Cross-domain fairness seems to be hard to achieve when I think about it, especially because entropy is such a scarce resource. Access to the entropy source should be exclusive to the kernel(s). Applications should have to make due with derived PRNG values. But I am just thinking out loud...

Pitchforks for RDSEED

smurf — Fri, 09 Feb 2024 12:30:38 +0000

Sure – but you can make this number of instances less significant than, well, any arbitrary probability you desire. Simply retry often enough so that it only happens once per century, assuming that a billion CPUs do nothing but call RDSEED.

Pitchforks for RDSEED

zx2c4 — Fri, 09 Feb 2024 12:18:15 +0000

The issue isn't that it fails but that one trust domain can potentially force failures in another. If it would "eventually succeed", one could try again and again until success. But if a different domain can induce perpetual failure, then we've now got two problems. So solutions include making RDRAND keep generating stream output without fail, or imposing some sort of cross-domain fairness with regards to failures. The threads on LKML discuss this.

Pitchforks for RDSEED

spacefrogg — Fri, 09 Feb 2024 11:37:41 +0000

I have worked in the field of hardware true random number generation and I can tell you with confidence: However you build a system that contains a TRNG, you will always find a significant amount of instances in which the TRNG will fail at some point.

Any expectation that the OS can just assume (or even assert) that it is an error that a TRNG fails at some point, is severely misguided.

I can even deliver a mathematical argument if you desire. Any functionality test of a TRNG must restrict itself to use a limited amount of TRNG output to draw its conclusion. Otherwise it would never finish. Any TRNG must eventually produce a sequence, by chance(!), that contradicts the functionality test. Otherwise it would not be completely random, because it would specifically avoid the test patterns. Conclusion: Every TRNG that does not fail from time to time is already failing.

Yes you can shift the probabilities etc. pp., but you cannot avoid the fundamental mechanics. By delivering billions of CPUs each running for thousands of hours, you will find those instances.

Pitchforks for RDSEED

taladar — Fri, 09 Feb 2024 10:42:24 +0000

I don't think you can calculate the probability of repeated failures in the retry loop like that. It is not as if they are independent events. If entropy is exhausted in one RDSEED instruction it will not be just as likely to be restored to a usable level if the next CPU instruction is another RDSEED as it would be for a random RDSEED occurring after many other instructions.

Pitchforks for RDSEED

adobriyan — Fri, 09 Feb 2024 10:12:04 +0000

5950X (microcode 0xa20120e) doesn't seem to fail rdseed.

Pitchforks for RDSEED

james — Fri, 09 Feb 2024 09:44:58 +0000

...the host controls a guest's scheduling and could, in theory, interfere with a retry loop as well. If retries are forced to happen when the random-number generator is exhausted, they will never succeed.

Or the host could just never let the guest run? Same effect?

Pitchforks for RDSEED

Jannes — Fri, 09 Feb 2024 00:08:57 +0000

I understood this as an *unprivileged* process being able to bring down the entire machine. That's probably not the admin's intention.

A misbehaving app should just crash itself, not bring down the entire kernel and thereby DOSsing all other apps.

Pitchforks for RDSEED

corsac — Thu, 08 Feb 2024 19:24:17 +0000

Seems that on my older system (i5-5200U) rdseed never fails

Pitchforks for RDSEED

vstinner — Thu, 08 Feb 2024 19:24:10 +0000

> he observed that RDSEED failed nearly 30% of the time

"Result on my i7-11850H: RDRAND: 100.00%, RDSEED: 29.26%"

So RDRAND has a success rate of 100% (and 0% failure), and RDSEED has a success rate of 29.26% which means around 70% of failure, no?

panic_on_warn

corbet — Thu, 08 Feb 2024 18:29:09 +0000

Yes, the system is behaving as configured in that setting. Still, developers need to be (and are) aware that issuing warnings can have that effect in a fairly wide swath of deployed systems and consider warnings carefully.

Pitchforks for RDSEED

dullfire — Thu, 08 Feb 2024 18:27:13 +0000

> Many systems run with panic_on_warn enabled; on such systems, a warning will cause a system crash. That would turn a random-number-generator failure into a denial-of-service problem.

Sorry but this is a bad statement. The if the system is running with panic_on_warn then the system has explicitly been told to panic (effectively go offline) on a warning event. Which means that it can't be a "denial-of-service" when it is behaving exactly the way the admins[0] requested. Additionally, panic_on_warn turns any ability to generate warnings into a DoS by that definition.

Would you call an admin ssh-ing in running a "sudo reboot" a "denial of service". If so, that makes the term so broad as to be useless.

Further more: If this is occurring after a retry loop of 10 (which, if I did my math right, has less than a 0.001% chance of happening if a 'normal' RDSEED has a failure rate of 30%), then most likely the best case is someone is simply probing you system (or your host) for weaknesses. In the worst case, there's an actual attack in progress. Immediately an panic might be the correct response.

[0] Alternately a non-"admin" managed to control your kernel command line (or equivalent), but if that is the case, you have other, very different, and much worse problems.