Use v8.4A-based x1 Keccak-f1600 on Apple CPUs #153

hanno-becker · 2024-09-23T05:19:54Z

Previous, we would use the lazy-rotation scalar assembly. Since Barrel shifting comes at a performance penalty on Apple CPUs (see https://dougallj.github.io/applecpu/firestorm.html), this implementation is slower than the standard C implementation.

Moreover, the standard C implementation is slower than the x2-batched Neon implementation, restricted to one lane.

This commit therefore changes the default Keccak-f1600 implementation on Apple CPUs to be a 1-fold Neon-based Keccak using SHA3 instructions.

mkannwischer

This looks good to me! Thank you @hanno-becker!
I tested this on my x86 machine and looked at the RPi CI.

Can you please post some M1 performance figures in this PR before and after this change? That could be useful for debugging in the future. It's unfortunate that we don't have automated M1 benchmarking.

Afterwards we can merge.

fips202/asm/asm.h

hanno-becker · 2024-09-24T08:41:28Z

ML-KEM-512 (after)         ML-KEM-512 (before)    
keypair cycles=24992       keypair cycles=29791   
encaps cycles=27965        encaps cycles=37524    
decaps cycles=33610        decaps cycles=43361    
                                                  
ML-KEM-768 (after)         ML-KEM-768 (before)    
keypair cycles=44971       keypair cycles=52351   
encaps cycles=46128        encaps cycles=53451    
decaps cycles=53993        decaps cycles=61258    
                                                  
ML-KEM-1024 (after)        ML-KEM-1024 (before)   
keypair cycles=66607       keypair cycles=73948   
encaps cycles=68540        encaps cycles=76446    
decaps cycles=79471        decaps cycles=87363

Previous, we would use the lazy-rotation scalar assembly. Since Barrel shifting comes at a performance penalty on Apple CPUs (see https://dougallj.github.io/applecpu/firestorm.html), this implementation is slower than the standard C implementation. Moreover, the standard C implementation is slower than the x2-batched Neon implementation, restricted to one lane. This commit therefore changes the default Keccak-f1600 implementation on Apple CPUs to be a 1-fold Neon-based Keccak using SHA3 instructions. Signed-off-by: Hanno Becker <[email protected]>

hanno-becker requested review from a team and mkannwischer September 23, 2024 05:19

hanno-becker force-pushed the apple_keccak_x1 branch from 15bcf03 to c2649c5 Compare September 23, 2024 18:47

mkannwischer added the benchmark this PR should be benchmarked in CI label Sep 24, 2024

mkannwischer approved these changes Sep 24, 2024

View reviewed changes

fips202/asm/asm.h Show resolved Hide resolved

hanno-becker enabled auto-merge (rebase) September 24, 2024 08:41

hanno-becker force-pushed the apple_keccak_x1 branch from c2649c5 to 4740cd7 Compare September 24, 2024 08:41

hanno-becker merged commit 09d6442 into main Sep 24, 2024

hanno-becker deleted the apple_keccak_x1 branch September 24, 2024 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use v8.4A-based x1 Keccak-f1600 on Apple CPUs #153

Use v8.4A-based x1 Keccak-f1600 on Apple CPUs #153

hanno-becker commented Sep 23, 2024

mkannwischer left a comment

hanno-becker commented Sep 24, 2024

Use v8.4A-based x1 Keccak-f1600 on Apple CPUs #153

Use v8.4A-based x1 Keccak-f1600 on Apple CPUs #153

Conversation

hanno-becker commented Sep 23, 2024

mkannwischer left a comment

Choose a reason for hiding this comment

hanno-becker commented Sep 24, 2024