Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uneven performance of blst_p1s_mult_pippenger #235

Open
chfast opened this issue Oct 31, 2024 · 5 comments · May be fixed by #246
Open

Uneven performance of blst_p1s_mult_pippenger #235

chfast opened this issue Oct 31, 2024 · 5 comments · May be fixed by #246

Comments

@chfast
Copy link

chfast commented Oct 31, 2024

When benchmarking blst_p1s_mult_pippenger I noticed sudden increases in performance at number of points: 64, 128, 256 and further.

g1msm_perf

@dot-asm
Copy link
Collaborator

dot-asm commented Nov 1, 2024

And what's the issue? :-) But on a more serious note, the keyword is that the tangent becomes more and more moderate, and it depends on how you slice scalars depending on amount of inputs, which is a balancing act. The "scalar-slicing" procedure is prone to rounding errors, which is why the curve is bound to have breaks. Now with this in mind, what's the issue? That the breaks are too big?

@chfast
Copy link
Author

chfast commented Nov 4, 2024

I just wanted to notify that decision how to slice scalars depending on the number of inputs may be improved. E.g. currently, for our data it is faster to compute MSM for 65 points than for 55.

@dot-asm
Copy link
Collaborator

dot-asm commented Nov 5, 2024

For the record, performance for such small amounts of inputs has never been subject to such close scrutiny, let alone single-thread performance[!]. The latter is because even single-board computers are multi-core this time and day. But anyway, try to modify pippenger_window_size() in src/multi_scalar.c by adding npoints += 8; after size_t wbits; declaration.

@chfast
Copy link
Author

chfast commented Dec 9, 2024

I tried the suggestion (npoints += 8) but it's effect is limited.

I got the best results by:

  • increasing the bits by 1 from what is originally computed,
      size_t r = wbits>12 ? wbits-3 : (wbits>4 ? wbits-2 : (wbits ? 2 : 1));
      return r + 1;
  • decreasing the threshold of the fallback to mult_wbits by 4, for p1 this is 64 → 16
    if ((npoints * sizeof(ptype##_affine) * 8 * 3) <= SCRATCH_LIMIT / 4)

blst multi scalar parameters analysis

@dot-asm
Copy link
Collaborator

dot-asm commented Dec 18, 2024

Thanks! Could you double-check #246?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants