Convert `OPENSSL_memcmp` to pure rust #1433

ShadowJonathan · 2021-12-09T10:46:53Z

When I was browsing the codebase somewhat to see how to take a crack at #25, I encountered OPENSSL_memcmp, which to me looked like a random small C function with a lot of binding code around it.

Seeing as it was an easy target, and i understood the implications and methods of constant-time comparison pretty well (just OR-update a value and only after the loop take a look at it), I decided to take a crack at it.

Here's a godbolt with the results, i'll take the rest of the issue explaining why this is possible, with much performance benefits;

https://godbolt.org/z/q6Ws57aTz

#![feature(slice_ptr_get)]

pub unsafe fn constant_time_cmp(len: usize, a: *const [u8], b: *const [u8]) -> bool {
    let mut diviation: u8 = 0;

    for i in 0..len {
        diviation |= a.get_unchecked(i).read() ^ b.get_unchecked(i).read()
    }

    return diviation == 0
}

I first compared the Assembly of the C function, and took some look at how it did, its a simple 3-block function with a loop in the 2nd one;

However, I was immediately puzzled as to why rust had a more complex call graph on opt-level=2/3:

This was a bit daunting, as more blocks = more jumps = possible early/shorter exits, which wouldn't guarantee constant-time.

However, looking at opt-level=1 made the function look a bit more like the C equivalent;

With only an early exit if rdi/len == 0, otherwise this was still performing the same way,.

It turns out that opt-level 2+ introduced LLVM loop vectorization, disabling that gets this;

Still somewhat daunting, but some echoes from the opt-level 1 graph come back to this.

Turns out, the LLB0_9 and 10 blocks in that graph are actually performing more efficient comparisons over a quad then one-by-one byte comparisons that the previous one did, if (len & 3) != 0, and performs the last 1-3 operations if there's any leftovers.

This basically speeds up the function depending on len, but not depending on the contents of the memory addresses.

The vectorization loops perform somewhat of the same comparison, only len >= 32, with a loop block that takes 64 bytes at a time, after which it steps down to 32-byte and 16-byte blocks, and if there's any "leftovers" still, to the byte-by-byte loop.

(Upon closer look, it seems to do this actually rather inefficiently, when encountering a 32-byte block, it'll step to the 16-byte one first, before doing byte-by-byte. Similarly, it'll not do the 64 byte block if there's not 16 bytes extra in front of it. I wonder if this is a performance trade-off.)

All in all, studying this assembly (and learning a lot from it in the process) makes me confident of the following;

We can replace the C memcmp function with a pure-rust one
It does not violate constant-time.
We actually get a performance boost from it, even more depending on the CPU it is performed on.
...at the cost of a bit more assembly.

This'll depend on rust-lang/rust#74265, as i'm using that nightly feature to get more readable and consistent assembly codegen.

The text was updated successfully, but these errors were encountered:

briansmith · 2021-12-10T00:37:31Z

Thanks for your work on this. I do want to rewrite this code in Rust. I also am interested in improving the performance. Here are my thoughts.

It does not violate constant-time.

This is a strong statement to make, both for the existing code and for the new code. I'm not sure it's justified. The existing code doesn't use value_barrier_w. My goal is to have all code that needs to make a constant-time decision use value_barrier_w or a related constant_time_ function for the decision. In particular, for this code:

    let result = unsafe { OPENSSL_memcmp(a.as_ptr(), b.as_ptr(), a.len()) };
    match result {
        0 => Ok(()),
        _ => Err(error::Unspecified),
    }

between let result and match, value_barrier_w should be used.

A longer-term goal of my goal is to find a way where we feel comfortable implementing those primitives in Rust.

Regarding the performance and auto-vectorization, if you look at the callers of the function, and the callers of the wrapper verify_slices_are_equal, you can see that in many cases it would be better to let the compiler optimize the loop/vectorization based on the length of the array, as the exact length of the array is known at compile time, or the rough length of the array is known at compile time. For example, for ring::aead the array is always 16/core::mem::size_of::<usize>() machine words, and so the ideal comparison function is just a few machine words long with no branching and in particular no need for deciding between vectorized and non-vectorized code paths. Even in the cases where it isn't known exactly at compile time, if we had a way of letting the compiler know what the maximum length is and/or that the length is a multiple of 4/8 bytes, then that would help the optimization a lot.

One minor complication is that OPENSSL_memcmp has a caller in curve25519.c that is non-Rust code. It's a solvable complication.

It seems likely that ring::aead is the user of this code that would benefit most from optimizing and inlining, so I suggest we use the AEAD benchmarks to verify the real-world effect of making performance improvements here, to start with.

briansmith · 2021-12-10T00:39:53Z

This'll depend on rust-lang/rust#74265, as i'm using that nightly feature to get more readable and consistent assembly codegen.

I am not sure that we have to be blocked on that. I feel like we could get good results regardless of that.

...at the cost of a bit more assembly.

I don't think this is a given either. I wouldn't be surprised if, for a common application, we could actually have a net reduction in object code, if we are able to communicate the length information for the input arrays appropriately.

ShadowJonathan · 2021-12-10T09:06:24Z

The existing code doesn't use value_barrier_w. My goal is to have all code that needs to make a constant-time decision use value_barrier_w or a related constant_time_ function for the decision.

Looking at that code in boringssl, it looks to me more like a compiler trick to leave it alone, and not optimise the value, that could probably easily be done by using volatile memory writes/reads.

Regarding the performance and auto-vectorization, if you look at the callers of the function, and the callers of the wrapper verify_slices_are_equal, you can see that in many cases it would be better to let the compiler optimize the loop/vectorization based on the length of the array, as the exact length of the array is known at compile time, or the rough length of the array is known at compile time.

This looks a bit tricky, it seems you can't use some "array generic" (afaik) that just fills in the size at compile time.

If there's a few predetermined sizes, we could make a macro which generates these functions with those sizes instead, those would create alike functions that the compiler can reason about independently.

Also, thanks for looking at this!

briansmith · 2021-12-10T17:23:27Z

@ShadowJonathan I can't spend a lot of time on this at this exact moment, but I sketched some ideas at #1434. No guarantees it is correct, but it shows some ideas I have had. The difficulty with emulating value_barrier_w in Rust is the one I don't have a great solution for. I am aware of the technique of (ab)using volatile reads/writes but I'm skeptical about that. The implementation of value_barrier_w is quite suspect too, though my understanding is that at least some people in the clang community put effort into ensuring it will work.

ShadowJonathan · 2021-12-10T17:26:39Z

Thank you, i'll take a look at the Asm it generates, and/or see how the compiler works with it.

briansmith · 2021-12-10T19:22:39Z

Looking at that code in boringssl, it looks to me more like a compiler trick to leave it alone, and not optimise the value, that could probably easily be done by using volatile memory writes/reads.

My understanding is that the BoringSSL developers were told by the Clang developers to use that trick, and that it would continue to work. I don't know to what extent Googlers are ensuring it works.

briansmith · 2023-10-14T03:59:24Z

We're planning to have a Rust and/or assembly solution everywhere there is C, but not planning to have a tracking issue for each C -> Rust conversion.

ShadowJonathan changed the title ~~Move OPENSSL_memcmp to pure rust~~ Convert OPENSSL_memcmp to pure rust Dec 9, 2021

briansmith closed this as completed Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert `OPENSSL_memcmp` to pure rust #1433

Convert `OPENSSL_memcmp` to pure rust #1433

ShadowJonathan commented Dec 9, 2021 •

edited

Loading

briansmith commented Dec 10, 2021

Uh oh!

briansmith commented Dec 10, 2021

Uh oh!

ShadowJonathan commented Dec 10, 2021 •

edited

Loading

Uh oh!

briansmith commented Dec 10, 2021

Uh oh!

ShadowJonathan commented Dec 10, 2021

Uh oh!

briansmith commented Dec 10, 2021

Uh oh!

briansmith commented Oct 14, 2023

Uh oh!

Convert OPENSSL_memcmp to pure rust #1433

Convert OPENSSL_memcmp to pure rust #1433

Comments

ShadowJonathan commented Dec 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

briansmith commented Dec 10, 2021

Uh oh!

briansmith commented Dec 10, 2021

Uh oh!

ShadowJonathan commented Dec 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

briansmith commented Dec 10, 2021

Uh oh!

ShadowJonathan commented Dec 10, 2021

Uh oh!

briansmith commented Dec 10, 2021

Uh oh!

briansmith commented Oct 14, 2023

Uh oh!

Convert `OPENSSL_memcmp` to pure rust #1433

Convert `OPENSSL_memcmp` to pure rust #1433

ShadowJonathan commented Dec 9, 2021 •

edited

Loading

ShadowJonathan commented Dec 10, 2021 •

edited

Loading