Skip to content

Hook up sha2 instruction and detect sha-supporting cpu #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sakridge
Copy link

@sakridge sakridge commented Oct 23, 2019

Newer AMD ryzen-based cpus support sha2 hardware instructions. This hooks up that instruction and detects using the relevant cpuid flag.

This gives about 2.7x performance improvement on AMD Threadripper 1900x and 2950x.

@sakridge
Copy link
Author

@newpavlov are you taking new PRs here?

@tarcieri
Copy link
Member

@sakridge I can potentially review/merge your code (which at first glance looks fine to me), but we need @newpavlov to cut another release

@sakridge
Copy link
Author

@tarcieri that would be awesome. thanks

@sakridge
Copy link
Author

Any update on this?

@tarcieri
Copy link
Member

@sakridge I can review more thoroughly soon

@newpavlov
Copy link
Member

@sakridge
Sorry for the late reply! I was somewhat overwhelmed with work at my university, so I didn't have much time on open-source projects.

I think we can do it directly in the sha2 (and sha1 can be updated similiarly) crate using SIMD intrinsics. I have a draft for SHA-1 on my PC, but changes were blocked on updating block-buffer with a method with the input2 method, but unfortunately the implementation has broken MSRV promise, so I had to yank versions which contained it. So I though about doing it together with the next minor version update.

Unfortunately without this method internal state of a hash function will not be kept in SIMD registers (I guess because noalias is still broken on LLVM...), which significantly regresses the final performance.

@sakridge
Copy link
Author

@sakridge
Sorry for the late reply! I was somewhat overwhelmed with work at my university, so I didn't have much time on open-source projects.

Hi, no problem, thanks for the reply.

I think we can do it directly in the sha2 (and sha1 can be updated similiarly) crate using SIMD intrinsics. I have a draft for SHA-1 on my PC, but changes were blocked on updating block-buffer with a method with the input2 method, but unfortunately the implementation has broken MSRV promise, so I had to yank versions which contained it. So I though about doing it together with the next minor version update.

It sounds like we cannot for now given all the obstacles that you listed. Maybe we should go with something like this change for now which is ready than wait for all those things to be fixed. Can you also upload your change to a branch in whatever state it is in?

Unfortunately without this method internal state of a hash function will not be kept in SIMD registers (I guess because noalias is still broken on LLVM...), which significantly regresses the final performance.

My benchmarking of this change shows much higher performance on ryzen-based parts. Can you give an example where it is regressed?

@dignifiedquire
Copy link
Member

To help push the intrinsic versio forward I created a version of input2 that works on the current MSRV: RustCrypto/utils#36

@dignifiedquire
Copy link
Member

I have also created a pure rust port here: RustCrypto/hashes#90

@sakridge
Copy link
Author

I have also created a pure rust port here: RustCrypto/hashes#90

@dignifiedquire I tried your pure rust PR, it seems about 30% slower than this version, at least for the case I tested.

@dignifiedquire
Copy link
Member

@dignifiedquire I tried your pure rust PR, it seems about 30% slower than this version, at least for the case I tested.

Interesting, how did you test?
I think in general though the goal should to not be depending on a C compiler, so for absolute speed I would argue an assembly version is probably better, even if the rust intrinsics are somewhat slower.

@tarcieri
Copy link
Member

I am definitely a huge fan of "pure Rust" as well, and would honestly prefer slightly slower core::arch implementations (or ones which leverage e.g. packed_simd) to binary blobs.

To a certain extent this has some risks around constant time operation. But let me be blunt: ASM blobs have failed in practice over and over and over. They are inscrutable. They are a pain in the ass to link. And they are, for the most part, a fixed object which rarely see improvements due to the inscrutability, except when they have catastrophic security breakages.

As far as asm blobs go, ring is the go-to crate in the Rust ecosystem, and I don't see any need for this project to compete with that (and perhaps in saying that, I'm also saying the concept of asm-hashes is misguided in general).

The goal of this org is "cryptographic algorithms written in pure Rust" and that is definitely my personal preference.

@dignifiedquire
Copy link
Member

I am curious where you are getting the 30% difference from, these are numbers I am getting on a AMD 3970X:

Pure rust

test bench1_10    ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100   ... bench:          54 ns/iter (+/- 0) = 1851 MB/s
test bench3_1000  ... bench:         506 ns/iter (+/- 1) = 1976 MB/s
test bench4_10000 ... bench:       5,014 ns/iter (+/- 26) = 1994 MB/s

Asm (this PR)

test bench1_10    ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100   ... bench:          52 ns/iter (+/- 0) = 1923 MB/s
test bench3_1000  ... bench:         493 ns/iter (+/- 1) = 2028 MB/s
test bench4_10000 ... bench:       4,882 ns/iter (+/- 10) = 2048 MB/s

@sakridge
Copy link
Author

sakridge commented Mar 5, 2020

I am curious where you are getting the 30% difference from, these are numbers I am getting on a

I had tested a bit of a pathological case.
https://github.com/solana-labs/solana/blob/1f2aaf3f988023118c085e562a98ea570f6b947e/core/benches/poh.rs#L17

That is the benchmark, it's essentially a loop of hashing:

for _ in 0..n {
   new_hash = hash(current_hash.as_ref());
   current_hash = new_hash;
}

Maybe not an important case for most people, but our project is relying on this being as fast as possible. I was testing on AMD 2950x.

@tarcieri
Copy link
Member

It looks like sha256_x64.c could potentially be expressed in Rust using core::arch?

@newpavlov
Copy link
Member

Will close in favor of RustCrypto/hashes#167.

@newpavlov newpavlov closed this Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants