Implement CIOS for ARM F::mul #134

sragss · 2024-02-02T02:25:03Z

Implements CIOS for Montgomery 256-bit field multiplication. Specifically the fast variant (algorithm 2). These changes are particularly relevant on ARM where we do not have x86 / BMI2 / AVX512 and the associated assembly backend. There does not appear to be a NEON equivalent for MULX / ADOX / ADCX.

Accelerates 7-15%, roughly reaching parity with Arkworks on ARM. See sragss/speedy-fields to benchmark. cargo run --release is more consistent than cargo bench due to matching elements.

Currently bn256::test::test_consistent_hash_to_curve fails due to an attempt to multiply a number larger than both the montgomery radix and the field modulus. This results in a 1-bit difference between implementations. Specifically:

let lhs: [u64; 4] = [
		0x338ea436d0dbaca2,
		0x4439f510abdd1b04,
		0x5ee775ab94064231,
		0xf04dfd6e270d4842
    ];

I'm not sure if this was intended to be supported. Other libraries (such as Arkworks) do not handle numbers outside of the range of the modulus. Any elements created via F::new() undergo field multiplication (by R2) where they're brought into the proper range.

This can be handled by checking if the inputs are greater than the field modulus and subtracting in advance but there's some non-zero cost (2-5%) to performing the check.

                    let mut lhs = self.0;
                    if bigint_geq(&lhs, &$modulus.0) {
                        let mut borrow = 0;
                        (lhs[0], borrow) = sbb(lhs[0], $modulus.0[0], borrow);
                        (lhs[1], borrow) = sbb(lhs[1], $modulus.0[1], borrow);
                        (lhs[2], borrow) = sbb(lhs[2], $modulus.0[2], borrow);
                        (lhs[3], borrow) = sbb(lhs[3], $modulus.0[3], borrow);
                    }

duguorong009 · 2024-02-05T03:33:32Z

Thanks for the PR! @sragss

I think we should add the input range check in the mul function as you suggested.
Even though it comes with some cost, I believe it is the only way to introduce the CIOS algo to the repo, at the moment.

(Normally, I use the F::from_raw function for input range check.)

What do you think? @CPerezz @davidnevadoc

davidnevadoc

Really clean and easy to follow, thanks for the improvement!
LGTM 👍

src/arithmetic.rs

src/derive/field.rs

Forgot to check the bit discrepancy for operands outside the normal range.

sragss · 2024-02-07T20:12:20Z

Fixed comments @davidnevadoc – Let me know what you think about the operands outside the normal range.

We can add the reduction before, or can update the bn256::test::test_consistent_hash_to_curve test which uses the hash_to_curve outside the modulus range. I suspect this test has bad data due to random generation by a script (~25% chance of generating out of the range) rather than intentional usage.

davidnevadoc · 2024-02-08T18:11:07Z

In regards to the outside of range operands, the approach I like is controlling the ways in which we create field elements and then assuming they are in the appropriate ranges in all operations.
For this concrete case, the problem was that the multiplication in from_u512 non-asm version was performing a multiplication on an unchecked field element that could be any 256 bits.

halo2curves/src/derive/field.rs

Line 159 in 3c43d3c

$field(lower_256) * $r2 + $field(upper_256) * $r3

The asm version on the other hand, was using Self::montgomery_reduction to handle this conversion from bytes to a valid field element.
I think this should be the approach for the non-asm version as well. I propose a solution along this lines: 359619a

I have modified the montgomery_reduction function to have a non-asm version that performs the check @sragss suggested and then multiplies by the appropriate R.

Let me know what you think and feel free to add the change if you like it.

sragss · 2024-02-08T21:14:14Z

Agree with your approach – added those changes. All tests are passing now and we don't have the slowdown from adding the check to mul.

chfast · 2024-02-15T20:02:02Z

src/derive/field.rs

+                        (t[N - 1], _) = adc(c_2, c, 0);
+                    }
+
+                    if bigint_geq(&t, &$modulus.0) {


It seems the bigint_geq procedure and its usage may be suboptimal. First of all, you can notice that $t ≥ m ⇔ !(t < m)$. You can implement the "less-than" as subtraction $t - m$. If it overflows (borrow is 1) then $t < m$.
Then you can notice that you are actually computing the $t - m$ just after. The common pattern is to compute the subtraction speculatively and use its result in case it hasn't overflowed.

(tmp, borrow) = t - m t = borrow ? t : tmp

Nice catch!

I believe strategy would only save instructions in the case that bigint_geq == true. It would cost some instructions in the case that bigint_geq == false as sbb gets broken out into a few instructions on ARM where as the 4 u64 LTs should be a single instruction each.

* impl CIOS * more details * add Fast CIOS for bn256 * rolled Fast CIOS * clean comment * geq for last line in bigint_geq * update comment to include WORD_SIZE * mod in montomgery * cargo fmt * cargo clippy --------- Co-authored-by: sragss <[email protected]>

sragss added 5 commits February 1, 2024 11:34

impl CIOS

6658895

more details

f0aeaaa

add Fast CIOS for bn256

c1ddc89

rolled Fast CIOS

2317b12

clean comment

ba0f861

sragss changed the title ~~Implement CIOS for non-asm mul~~ Implement CIOS for ARM F::mul Feb 2, 2024

sragss marked this pull request as ready for review February 2, 2024 02:56

duguorong009 requested review from CPerezz and davidnevadoc and removed request for CPerezz February 5, 2024 03:28

davidnevadoc previously approved these changes Feb 5, 2024

View reviewed changes

src/arithmetic.rs Outdated Show resolved Hide resolved

src/derive/field.rs Outdated Show resolved Hide resolved

sragss added 2 commits February 7, 2024 12:08

geq for last line in bigint_geq

eff8411

update comment to include WORD_SIZE

cdbfe79

sragss and others added 4 commits February 8, 2024 13:09

mod in montomgery

5e88184

cargo fmt

84fd4b0

cargo clippy

be50bfc

Merge branch 'main' into main

dad1114

duguorong009 requested a review from davidnevadoc February 9, 2024 01:21

duguorong009 approved these changes Feb 9, 2024

View reviewed changes

davidnevadoc approved these changes Feb 9, 2024

View reviewed changes

davidnevadoc added this pull request to the merge queue Feb 9, 2024

Merged via the queue into privacy-scaling-explorations:main with commit 9fff22c Feb 9, 2024
11 checks passed

chfast reviewed Feb 15, 2024

View reviewed changes

davidnevadoc mentioned this pull request Apr 7, 2024

hash to curve suite #146

Merged

chfast mentioned this pull request Apr 17, 2024

Implement Fast CIOS for Montgomery modular multiplication ethereum/evmone#869

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CIOS for ARM F::mul #134

Implement CIOS for ARM F::mul #134

sragss commented Feb 2, 2024 •

edited

Loading

duguorong009 commented Feb 5, 2024

davidnevadoc left a comment

sragss commented Feb 7, 2024

davidnevadoc commented Feb 8, 2024

sragss commented Feb 8, 2024

chfast Feb 15, 2024

sragss Feb 15, 2024

Implement CIOS for ARM F::mul #134

Implement CIOS for ARM F::mul #134

Conversation

sragss commented Feb 2, 2024 • edited Loading

duguorong009 commented Feb 5, 2024

davidnevadoc left a comment

Choose a reason for hiding this comment

sragss commented Feb 7, 2024

davidnevadoc commented Feb 8, 2024

sragss commented Feb 8, 2024

chfast Feb 15, 2024

Choose a reason for hiding this comment

sragss Feb 15, 2024

Choose a reason for hiding this comment

sragss commented Feb 2, 2024 •

edited

Loading