Implement faster inversion #28

kilic · 2023-02-16T15:39:25Z

Upcoming optimized msm method privacy-scaling-explorations/halo2#40 heavily uses base field inversion. Currently inversion is an exponentiation with (p-2). It can be optimized by replacing with one of these

Montgommery inversion
using add-chains like this https://github.com/ChihChengLiang/modexp

The text was updated successfully, but these errors were encountered:

mratsim · 2023-06-15T08:21:30Z

The current state-of-the-art is based on either Bernstein-Yang or Pornin's GCD.

Overview

Modular inverse can be computed either via Fermat's little theorem or a variant of Extended Euclid Algorithm to solve GCD (for example using binary shift/add/sub instead the canonical with quotients/division)

Binary Extended Euclid Algorithm (bEEA)

I'll use the algorithm by Niels Möller as the basis of my analysis (Niles is author of the GNU Nettle cryptographic library and a significant contributor to GMP, in particular the constant-time modular inverse).

Algorithm 5 in
Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine
Danilo Camara, Conrado P. L. Gouvea, Julio Lopez, and Ricardo Dahab
https://link.springer.com/content/pdf/10.1007%2F978-3-642-40588-4_10.pdf

Input: integer x, odd integer n, x < n
Output: x−1 (mod n)
1:   function ModInv(x, n)
2:   (a, b, u, v) ← (x, n, 1, 1)
3:   ℓ ← ⌊log2 n⌋ + 1            ⮚ number of bits in n
4:   for i ← 0 to 2ℓ − 1 do
5:     odd ← a & 1
6:     if odd and a ≥ b then
7:       a ← a − b
8:     else if odd and a < b then
9:       (a, b, u, v) ← (b − a, a, v, u)
10:    a ← a >> 1
11:    if odd then u ← u − v
12:    if u < 0 then u ← u + n
13:    if u & 1 then u ← u + n
14:    u ← u >> 1
15:  return v

Constant-time algorithms make it easier to analyze speed because they always take the worst case time. As you can see, the worst case grows with iteration of 2l with l the bitsize of the modulus so 254 for BN254/BN256/alt_bn128.

Fermat's Little Theorem (FLT)

Using Little Fermat's Theorem, we can use exponentiation by p-2 and rely on a highly optimized modular multiplication.
However multiplication has an asymptotic complexity of O(n²) meaning as the prime modulus gets bigger, FLT gets slower relative to bEEA.

Addition chains can reduce the number of multiplications but not of squaring, In particular BN254 has a quite high hamming weight for a cryptographic prime.

Benching my old transition (2020) to addition chains, the speedup was only 30% (mratsim/constantine#80 )

The bottleneck

Both traditional EEA and FLT have a bottleneck, each operations are done on the full integer width (254 bits).

The big innovation in Bernstein-Yang and Pornin is that instead of working on the full integer width, you only work on:

the bottom 62 bits (Bernstein-Yang)
or top 31 and bottom 31 bits (Pornin)
and those are enough to make all the if/else branch decisions of EEA.

Then you propagate those transitions based on 62 or 31/31 on the full integer width (254 bits) once every 31 or 62 iterations.
Hence, they are asymptotically about 254/62 = 4.1x faster.

References

For implementation details and gotchas, see mratsim/constantine#172 (comment)

Pornin's writeup: https://research.nccgroup.com/2020/09/28/faster-modular-inversion-and-legendre-symbol-and-an-x25519-speed-record/

Bernstein-Yang inversion:

References:
- Original Bernstein-Yang paper, https://eprint.iacr.org/2019/266
- Formal verification by Hvass-Aranha-Spitters, https://eprint.iacr.org/2021/549
Implementations:
- In formally-verified fiat-crypto:
  - https://github.com/mit-plv/fiat-crypto/blob/e612a32/src/Arithmetic/BYInv.v
  - https://github.com/mratsim/constantine/blob/c02e6bdf845898742d661bbfe048092bd32a0271/formal_verification/bls12_381_q_64.c#L2623
- In Relic by Aranha: https://github.com/relic-toolkit/relic/blob/6d29b27/src/fp/relic_fp_inv.c#L553
- In Bitcoin's secp256k1:
  - doc: https://github.com/bitcoin-core/secp256k1/blob/0775283/doc/safegcd_implementation.md
  - impl: https://github.com/bitcoin-core/secp256k1/blob/0775283/src/modinv64_impl.h
- In Constantine: https://github.com/mratsim/constantine/blob/151f284/constantine/math/arithmetic/limbs_exgcd.nim

Pornin's inversion:

References:
- https://eprint.iacr.org/2020/972.pdf
  - https://github.com/pornin/bingcd
Discussion on SIMD optimization [Optim] SIMD for Pornin's GCD inverse supranational/blst#62
Implementations:

Benchmarks

Both 255-bit Pornin's constant-time inversion from BLST and 255-bit BY's inversion from Constantine are in the 1100 to 1300ns on my machine:

For BN254 specifically, Bernstein-Yang variable-time

And Gnark's Pornin inversion (variable-time):

Note that BLST uses Assembly and Constantine/Gnark do not.

Legendre symbol

Both BY and Pornin's GCD method can be adapted to compute the Legendre symbol, see:

Legendre symbol is a bottleneck in Hashing to curve of BN254 (#47)

kilic moved this to Icebox in zkEVM Community Edition Feb 16, 2023

kilic added this to zkEVM Community Edition Feb 16, 2023

mratsim mentioned this issue Jun 11, 2024

Towards state-of-the-art multi-scalar-muls #163

Open

6 tasks

davidnevadoc changed the title ~~Impelement faster inversion~~ Implement faster inversion Aug 11, 2023

mratsim mentioned this issue Aug 29, 2023

Fast modular inverse - 9.4x acceleration #83

Merged

han0110 closed this as completed in #83 Oct 19, 2023

davidnevadoc moved this to ✅ Done in zkEVM Community Edition Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement faster inversion #28

Implement faster inversion #28

kilic commented Feb 16, 2023 •

edited

Loading

mratsim commented Jun 15, 2023

Implement faster inversion #28

Implement faster inversion #28

Comments

kilic commented Feb 16, 2023 • edited Loading

mratsim commented Jun 15, 2023

Overview

Binary Extended Euclid Algorithm (bEEA)

Fermat's Little Theorem (FLT)

The bottleneck

References

Bernstein-Yang inversion:

Pornin's inversion:

Benchmarks

Legendre symbol

kilic commented Feb 16, 2023 •

edited

Loading