Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using vbfdotq_f32 for dot_bf16_neon is faster? #167

Closed
MarkReedZ opened this issue Sep 8, 2024 · 3 comments
Closed

Using vbfdotq_f32 for dot_bf16_neon is faster? #167

MarkReedZ opened this issue Sep 8, 2024 · 3 comments

Comments

@MarkReedZ
Copy link
Contributor

The following code runs in 11ns vs 15ns for the current version using vbfmlaltq_f32. Does it make sense to use vbfdotq_32? I'm not sure this code is correct - how do we typically test the results?

Should we add an unaligned vector size to the benchmarks?

SIMSIMD_PUBLIC void simsimd_dot_bf16_neon(simsimd_bf16_t const* a, simsimd_bf16_t const* b, simsimd_size_t n,
                                          simsimd_distance_t* result) {
    float32x4_t ab_vec = vdupq_n_f32(0);

    while (n >= 8) {
        bfloat16x8_t a_vec = vld1q_bf16((simsimd_bf16_for_arm_simd_t const*)a);
        bfloat16x8_t b_vec = vld1q_bf16((simsimd_bf16_for_arm_simd_t const*)b);

        ab_vec = vbfdotq_f32(ab_vec, a_vec, b_vec);

        n -= 8;
        a += 8;
        b += 8;
    }
    // TODO handle the remainder
    *result = vaddvq_f32(ab_vec);
}
@ashvardanian
Copy link
Owner

ashvardanian commented Sep 8, 2024

I'm not sure this code is correct - how do we typically test the results?

@MarkReedZ, C++ benchmarks will log the accuracy delta compared to serial baseline. Python tests will fail if this instruction does something weird. Let's run those two.

@ashvardanian
Copy link
Owner

Does it make sense to use vbfdotq_32?

@MarkReedZ, I'm not sure if I've used that instruction before. If it doesn't affect compilation settings and CPU-capability requirements, sure! Otherwise, we can add a note everywhere vbfmlaltq_f32 is used. But for #163 it probably still makes sense.

@MarkReedZ
Copy link
Contributor Author

PR here: #169

Run on AWS t4g, and c7g instances with gcc 12/13 only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants