-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDRAND-based output is (too) biased #228
Comments
In the
Interestingly, the rdrand crate source code says that the bit pattern is all-ones when an AMD CPU fails, but it also checks all-zeroes, but doesn't explain why. And getrandom does the same, without any explanation. I dug into the systemd patch at poettering/systemd@c6372be and it also checks both values. It has a clearer explanation of the reasoning:
So the value 0 is being screened out "just in case." Doesn't sound good to me. I think that what might be not too terrible for systemd in that particular circumstance may be not quite good enough for a general purpose |
I try not to comment on this repo too much (since I passed off maintainership), but I do remember a little about this issue.
Biased, yes, and you are right that this was initially only considered in the case of
I wish we could just go to the AMD website and find a listing of all possible failure cases. Alas, we lack comprehensive sources of information, and also the possibility that this might later be discovered in future CPUs, so is there much more we can do? |
Apparently this also affects some of the latest AMD CPUs, but in a harder-to-detect way: collisions on Ryzen 5900X. (Perhaps the better answer is never to trust RDRAND on an AMD CPU? Not to say there aren't potential issues with other vendors.) |
Thanks @dhardy. I remember that second systemd issue now. Here's an important comment from that issue:
I don't know if that claim is accurate, but if it is accurate then it explains why systemd is comfortable with using a non-uniform RNG. The |
I think we should introduce separate code paths for affected families and everyone else. I have proposed it previously, but we chose the current approach for simplicity sake. It would be nice to have more info on the 0x17 family.
This hardware bug is really scary, so I think "just in case" is perfectly fine here. I think it's better to have small bias, than potentially zeroed nonces/keys. Personally, I think that on affected families it may be worth to go even further and check that we do not encounter collisions by calling the step function twice and checking that it returns different results each time. |
Why only all-zeroes and all-ones are scary, but no other values? My understanding is that AMD documented and/or somebody observed that all-ones happen in the case of the specific AMD failure; did AMD document or did somebody observe that the all-zero value occurs for these failures?
False positives would occur. (This type of test is documented in NIST/FIPS standards and implementation guidance, FYI.) As I mentioned in #230, I don't think it is a good idea to default to using RDRAND for any target. What I would prefer is that there is no purely-RDRAND-based implementation in |
Some context on why that check is the way it is. The check is basically "did the implementation forget to set the CF flag". Both AMD and Intel document what value is put in the destination register on failure. For AMD it is all 1s (as discussed above), for Intel it is all 0s (see section 5.2 of Intel's DRNG whitepaper). So we just check for both. This is obviously not an ideal state, as ideally the
I agree on many of the things in #230, but removing RDRAND completely is unlikely. Many users of this crate (myself included) rely on this functionality and its simple ergonomics, and I wouldn't want to break them. |
I was looking into this issue today, and I think we should just mimic what the Linux Kernel does here:
This would also make our code almost match that in BoringSSL as well, except for them disabling RDRAND on It turns out it was added in response to the Zen 2 RDRAND issues (Phoronix, Reddit post). As far as I can tell, a bad BIOS caused RDRAND to always fail, but still signal failure correctly via |
See
getrandom/src/rdrand.rs
Lines 33 to 42 in 30308ae
The condition
if el != 0 && el != !0
is testing that the value returned by RDRAND, after it reports success, is not zero or all-one bits, i.e. neverusize::MIN
orusize::MAX
. Consequently,getrandom::getrandom
will never return a result where there a single word is zero or all-one bits, where as word is a 4-byte chunk on 32-bit x86 or a 8-byte chunk on 64-bit x86, when the RDRAND implementation is being used. Such values are expected to occur every 2/2^N words on average. As a result, any use ofgetrandom::getrandom
returns results that are wrongly biased; 2/2^N values are rejected on an N-bit platform.32-bit x86 support for the RDRAND feature was added in PR #134, after PR #48. The analysis on the bias when the code was implemented was based on the probability 2/2^64 which is correct when this code runs on a 64-bit CPU, but not for when it runs on a 32-bit CPU. I suspect that when PR #134 was under consideration, the difference in the bias was perhaps overlooked. The bias is notably worse on 32-bit platforms since 2/2^32 is much more likely than 2/2^64.
Outside of cryptography, I find the present solution particularly unfortunate because if getrandom's output is used for randomized testing of a 64-bit/32-bit function on x86_64/x86 on a target for which the RDRAND implementation is used, the important boundary conditions of the values
usize::MIN
andusize::MAX
will never be reached, ever!In the context of implementing cryptographic functions, and especially keygen and nonce generation, I suspect any user of this crate is likely to get pestered about this bias and will have to address it in some way.
In any case, I think the present solution is weird enough that it is worth having more eyes on it and more discussion, and also I hope we could find a better solution.
Incidentally, BoringSSL's code has this to say:
It seems like Google has reason to believe that some family 0x17 models may also be bad. So the comment indicating the problem is only with families 14h-16h may be worth updating, at least.
The text was updated successfully, but these errors were encountered: