Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use v8.4A-based x1 Keccak-f1600 on Apple CPUs #153

Merged
merged 1 commit into from
Sep 24, 2024
Merged

Conversation

hanno-becker
Copy link
Contributor

Previous, we would use the lazy-rotation scalar assembly. Since Barrel shifting comes at a performance penalty on Apple CPUs (see https://dougallj.github.io/applecpu/firestorm.html), this implementation is slower than the standard C implementation.

Moreover, the standard C implementation is slower than the x2-batched Neon implementation, restricted to one lane.

This commit therefore changes the default Keccak-f1600 implementation on Apple CPUs to be a 1-fold Neon-based Keccak using SHA3 instructions.

@hanno-becker hanno-becker requested review from a team and mkannwischer September 23, 2024 05:19
@mkannwischer mkannwischer added the benchmark this PR should be benchmarked in CI label Sep 24, 2024
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Thank you @hanno-becker!
I tested this on my x86 machine and looked at the RPi CI.

Can you please post some M1 performance figures in this PR before and after this change? That could be useful for debugging in the future. It's unfortunate that we don't have automated M1 benchmarking.

Afterwards we can merge.

fips202/asm/asm.h Show resolved Hide resolved
@hanno-becker
Copy link
Contributor Author

ML-KEM-512 (after)         ML-KEM-512 (before)    
keypair cycles=24992       keypair cycles=29791   
encaps cycles=27965        encaps cycles=37524    
decaps cycles=33610        decaps cycles=43361    
                                                  
ML-KEM-768 (after)         ML-KEM-768 (before)    
keypair cycles=44971       keypair cycles=52351   
encaps cycles=46128        encaps cycles=53451    
decaps cycles=53993        decaps cycles=61258    
                                                  
ML-KEM-1024 (after)        ML-KEM-1024 (before)   
keypair cycles=66607       keypair cycles=73948   
encaps cycles=68540        encaps cycles=76446    
decaps cycles=79471        decaps cycles=87363    


@hanno-becker hanno-becker enabled auto-merge (rebase) September 24, 2024 08:41
Previous, we would use the lazy-rotation scalar assembly.
Since Barrel shifting comes at a performance penalty on
Apple CPUs (see https://dougallj.github.io/applecpu/firestorm.html),
this implementation is slower than the standard C implementation.

Moreover, the standard C implementation is slower than the
x2-batched Neon implementation, restricted to one lane.

This commit therefore changes the default Keccak-f1600 implementation
on Apple CPUs to be a 1-fold Neon-based Keccak using SHA3 instructions.

Signed-off-by: Hanno Becker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark this PR should be benchmarked in CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants