Skip to content

Commit

Permalink
optimize neon loadu_128/storeu_128 (#384)
Browse files Browse the repository at this point in the history
vld1q_u8 and vst1q_u8 has no alignment requirements.

This improves performance on Oracle Cloud's VM.Standard.A1.Flex by 1.15% on a 16*1024 input, from 13920 nanoseconds down to 13800 nanoseconds (approx)
divinity76 authored Mar 12, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
1 parent 5b9af1c commit 58bea0b
Showing 1 changed file with 2 additions and 4 deletions.
6 changes: 2 additions & 4 deletions c/blake3_neon.c
Original file line number Diff line number Diff line change
@@ -10,14 +10,12 @@

INLINE uint32x4_t loadu_128(const uint8_t src[16]) {
// vld1q_u32 has alignment requirements. Don't use it.
uint32x4_t x;
memcpy(&x, src, 16);
return x;
return vreinterpretq_u32_u8(vld1q_u8(src));
}

INLINE void storeu_128(uint32x4_t src, uint8_t dest[16]) {
// vst1q_u32 has alignment requirements. Don't use it.
memcpy(dest, &src, 16);
vst1q_u8(dest, vreinterpretq_u8_u32(src));
}

INLINE uint32x4_t add_128(uint32x4_t a, uint32x4_t b) {

0 comments on commit 58bea0b

Please sign in to comment.