You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following code runs in 11ns vs 15ns for the current version using vbfmlaltq_f32. Does it make sense to use vbfdotq_32? I'm not sure this code is correct - how do we typically test the results?
Should we add an unaligned vector size to the benchmarks?
SIMSIMD_PUBLIC void simsimd_dot_bf16_neon(simsimd_bf16_t const* a, simsimd_bf16_t const* b, simsimd_size_t n,
simsimd_distance_t* result) {
float32x4_t ab_vec = vdupq_n_f32(0);
while (n >= 8) {
bfloat16x8_t a_vec = vld1q_bf16((simsimd_bf16_for_arm_simd_t const*)a);
bfloat16x8_t b_vec = vld1q_bf16((simsimd_bf16_for_arm_simd_t const*)b);
ab_vec = vbfdotq_f32(ab_vec, a_vec, b_vec);
n -= 8;
a += 8;
b += 8;
}
// TODO handle the remainder
*result = vaddvq_f32(ab_vec);
}
The text was updated successfully, but these errors were encountered:
I'm not sure this code is correct - how do we typically test the results?
@MarkReedZ, C++ benchmarks will log the accuracy delta compared to serial baseline. Python tests will fail if this instruction does something weird. Let's run those two.
@MarkReedZ, I'm not sure if I've used that instruction before. If it doesn't affect compilation settings and CPU-capability requirements, sure! Otherwise, we can add a note everywhere vbfmlaltq_f32 is used. But for #163 it probably still makes sense.
The following code runs in 11ns vs 15ns for the current version using
vbfmlaltq_f32
. Does it make sense to usevbfdotq_32
? I'm not sure this code is correct - how do we typically test the results?Should we add an unaligned vector size to the benchmarks?
The text was updated successfully, but these errors were encountered: