-
Notifications
You must be signed in to change notification settings - Fork 753
Convert OPENSSL_memcmp
to pure rust
#1433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
OPENSSL_memcmp
to pure rustOPENSSL_memcmp
to pure rust
Thanks for your work on this. I do want to rewrite this code in Rust. I also am interested in improving the performance. Here are my thoughts.
This is a strong statement to make, both for the existing code and for the new code. I'm not sure it's justified. The existing code doesn't use
between A longer-term goal of my goal is to find a way where we feel comfortable implementing those primitives in Rust. Regarding the performance and auto-vectorization, if you look at the callers of the function, and the callers of the wrapper One minor complication is that It seems likely that |
I am not sure that we have to be blocked on that. I feel like we could get good results regardless of that.
I don't think this is a given either. I wouldn't be surprised if, for a common application, we could actually have a net reduction in object code, if we are able to communicate the length information for the input arrays appropriately. |
Looking at that code in boringssl, it looks to me more like a compiler trick to leave it alone, and not optimise the value, that could probably easily be done by using volatile memory writes/reads.
This looks a bit tricky, it seems you can't use some "array generic" (afaik) that just fills in the size at compile time. If there's a few predetermined sizes, we could make a macro which generates these functions with those sizes instead, those would create alike functions that the compiler can reason about independently. Also, thanks for looking at this! |
@ShadowJonathan I can't spend a lot of time on this at this exact moment, but I sketched some ideas at #1434. No guarantees it is correct, but it shows some ideas I have had. The difficulty with emulating |
Thank you, i'll take a look at the Asm it generates, and/or see how the compiler works with it. |
My understanding is that the BoringSSL developers were told by the Clang developers to use that trick, and that it would continue to work. I don't know to what extent Googlers are ensuring it works. |
We're planning to have a Rust and/or assembly solution everywhere there is C, but not planning to have a tracking issue for each C -> Rust conversion. |
Uh oh!
There was an error while loading. Please reload this page.
When I was browsing the codebase somewhat to see how to take a crack at #25, I encountered
OPENSSL_memcmp
, which to me looked like a random small C function with a lot of binding code around it.Seeing as it was an easy target, and i understood the implications and methods of constant-time comparison pretty well (just
OR
-update a value and only after the loop take a look at it), I decided to take a crack at it.Here's a godbolt with the results, i'll take the rest of the issue explaining why this is possible, with much performance benefits;
https://godbolt.org/z/q6Ws57aTz
I first compared the Assembly of the C function, and took some look at how it did, its a simple 3-block function with a loop in the 2nd one;
However, I was immediately puzzled as to why rust had a more complex call graph on opt-level=2/3:
This was a bit daunting, as more blocks = more jumps = possible early/shorter exits, which wouldn't guarantee constant-time.
However, looking at opt-level=1 made the function look a bit more like the C equivalent;
With only an early exit if
rdi
/len
== 0, otherwise this was still performing the same way,.It turns out that opt-level 2+ introduced LLVM loop vectorization, disabling that gets this;
Still somewhat daunting, but some echoes from the opt-level 1 graph come back to this.
Turns out, the
LLB0_9
and10
blocks in that graph are actually performing more efficient comparisons over a quad then one-by-one byte comparisons that the previous one did, if(len & 3) != 0
, and performs the last 1-3 operations if there's any leftovers.This basically speeds up the function depending on
len
, but not depending on the contents of the memory addresses.The vectorization loops perform somewhat of the same comparison, only
len >= 32
, with a loop block that takes 64 bytes at a time, after which it steps down to 32-byte and 16-byte blocks, and if there's any "leftovers" still, to the byte-by-byte loop.(Upon closer look, it seems to do this actually rather inefficiently, when encountering a 32-byte block, it'll step to the 16-byte one first, before doing byte-by-byte. Similarly, it'll not do the 64 byte block if there's not 16 bytes extra in front of it. I wonder if this is a performance trade-off.)
All in all, studying this assembly (and learning a lot from it in the process) makes me confident of the following;
This'll depend on rust-lang/rust#74265, as i'm using that nightly feature to get more readable and consistent assembly codegen.
The text was updated successfully, but these errors were encountered: