Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement faster inversion #28

Closed
kilic opened this issue Feb 16, 2023 · 1 comment
Closed

Implement faster inversion #28

kilic opened this issue Feb 16, 2023 · 1 comment

Comments

@kilic
Copy link
Collaborator

kilic commented Feb 16, 2023

Upcoming optimized msm method privacy-scaling-explorations/halo2#40 heavily uses base field inversion. Currently inversion is an exponentiation with (p-2). It can be optimized by replacing with one of these

@mratsim
Copy link
Contributor

mratsim commented Jun 15, 2023

The current state-of-the-art is based on either Bernstein-Yang or Pornin's GCD.

Overview

Modular inverse can be computed either via Fermat's little theorem or a variant of Extended Euclid Algorithm to solve GCD (for example using binary shift/add/sub instead the canonical with quotients/division)

Binary Extended Euclid Algorithm (bEEA)

I'll use the algorithm by Niels Möller as the basis of my analysis (Niles is author of the GNU Nettle cryptographic library and a significant contributor to GMP, in particular the constant-time modular inverse).

Algorithm 5 in
Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine
Danilo Camara, Conrado P. L. Gouvea, Julio Lopez, and Ricardo Dahab
https://link.springer.com/content/pdf/10.1007%2F978-3-642-40588-4_10.pdf

Input: integer x, odd integer n, x < n
Output: x−1 (mod n)
1:   function ModInv(x, n)
2:   (a, b, u, v) ← (x, n, 1, 1)
3:   ℓ ← ⌊log2 n⌋ + 1            ⮚ number of bits in n
4:   for i ← 0 to 2ℓ − 1 do
5:     odd ← a & 1
6:     if odd and a ≥ b then
7:       a ← a − b
8:     else if odd and a < b then
9:       (a, b, u, v) ← (b − a, a, v, u)
10:    a ← a >> 1
11:    if odd then u ← u − v
12:    if u < 0 then u ← u + n
13:    if u & 1 then u ← u + n
14:    u ← u >> 1
15:  return v

image

Constant-time algorithms make it easier to analyze speed because they always take the worst case time. As you can see, the worst case grows with iteration of 2l with l the bitsize of the modulus so 254 for BN254/BN256/alt_bn128.

Fermat's Little Theorem (FLT)

Using Little Fermat's Theorem, we can use exponentiation by p-2 and rely on a highly optimized modular multiplication.
However multiplication has an asymptotic complexity of O(n²) meaning as the prime modulus gets bigger, FLT gets slower relative to bEEA.

Addition chains can reduce the number of multiplications but not of squaring, In particular BN254 has a quite high hamming weight for a cryptographic prime.

Benching my old transition (2020) to addition chains, the speedup was only 30% (mratsim/constantine#80 )
image

The bottleneck

Both traditional EEA and FLT have a bottleneck, each operations are done on the full integer width (254 bits).

The big innovation in Bernstein-Yang and Pornin is that instead of working on the full integer width, you only work on:

  • the bottom 62 bits (Bernstein-Yang)
  • or top 31 and bottom 31 bits (Pornin)
    and those are enough to make all the if/else branch decisions of EEA.

Then you propagate those transitions based on 62 or 31/31 on the full integer width (254 bits) once every 31 or 62 iterations.
Hence, they are asymptotically about 254/62 = 4.1x faster.

References

For implementation details and gotchas, see mratsim/constantine#172 (comment)

Pornin's writeup: https://research.nccgroup.com/2020/09/28/faster-modular-inversion-and-legendre-symbol-and-an-x25519-speed-record/

Bernstein-Yang inversion:

Pornin's inversion:

Benchmarks

Both 255-bit Pornin's constant-time inversion from BLST and 255-bit BY's inversion from Constantine are in the 1100 to 1300ns on my machine:
image

For BN254 specifically, Bernstein-Yang variable-time
image

And Gnark's Pornin inversion (variable-time):
image

Note that BLST uses Assembly and Constantine/Gnark do not.

Legendre symbol

Both BY and Pornin's GCD method can be adapted to compute the Legendre symbol, see:

Legendre symbol is a bottleneck in Hashing to curve of BN254 (#47)

@davidnevadoc davidnevadoc changed the title Impelement faster inversion Implement faster inversion Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants