Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CIOS for ARM F::mul #134

Merged
merged 11 commits into from
Feb 9, 2024

Conversation

sragss
Copy link
Contributor

@sragss sragss commented Feb 2, 2024

Implements CIOS for Montgomery 256-bit field multiplication. Specifically the fast variant (algorithm 2). These changes are particularly relevant on ARM where we do not have x86 / BMI2 / AVX512 and the associated assembly backend. There does not appear to be a NEON equivalent for MULX / ADOX / ADCX.

Accelerates 7-15%, roughly reaching parity with Arkworks on ARM. See sragss/speedy-fields to benchmark. cargo run --release is more consistent than cargo bench due to matching elements.

Currently bn256::test::test_consistent_hash_to_curve fails due to an attempt to multiply a number larger than both the montgomery radix and the field modulus. This results in a 1-bit difference between implementations. Specifically:

let lhs: [u64; 4] = [
		0x338ea436d0dbaca2,
		0x4439f510abdd1b04,
		0x5ee775ab94064231,
		0xf04dfd6e270d4842
    ];

I'm not sure if this was intended to be supported. Other libraries (such as Arkworks) do not handle numbers outside of the range of the modulus. Any elements created via F::new() undergo field multiplication (by R2) where they're brought into the proper range.

This can be handled by checking if the inputs are greater than the field modulus and subtracting in advance but there's some non-zero cost (2-5%) to performing the check.

                    let mut lhs = self.0;
                    if bigint_geq(&lhs, &$modulus.0) {
                        let mut borrow = 0;
                        (lhs[0], borrow) = sbb(lhs[0], $modulus.0[0], borrow);
                        (lhs[1], borrow) = sbb(lhs[1], $modulus.0[1], borrow);
                        (lhs[2], borrow) = sbb(lhs[2], $modulus.0[2], borrow);
                        (lhs[3], borrow) = sbb(lhs[3], $modulus.0[3], borrow);
                    }

@sragss sragss changed the title Implement CIOS for non-asm mul Implement CIOS for ARM F::mul Feb 2, 2024
@sragss sragss marked this pull request as ready for review February 2, 2024 02:56
@duguorong009 duguorong009 requested review from CPerezz and davidnevadoc and removed request for CPerezz February 5, 2024 03:28
@duguorong009
Copy link
Contributor

Thanks for the PR! @sragss

I think we should add the input range check in the mul function as you suggested.
Even though it comes with some cost, I believe it is the only way to introduce the CIOS algo to the repo, at the moment.

(Normally, I use the F::from_raw function for input range check.)

What do you think? @CPerezz @davidnevadoc

davidnevadoc
davidnevadoc previously approved these changes Feb 5, 2024
Copy link
Contributor

@davidnevadoc davidnevadoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really clean and easy to follow, thanks for the improvement!
LGTM 👍

src/arithmetic.rs Outdated Show resolved Hide resolved
src/derive/field.rs Outdated Show resolved Hide resolved
@davidnevadoc davidnevadoc dismissed their stale review February 6, 2024 11:39

Forgot to check the bit discrepancy for operands outside the normal range.

@sragss
Copy link
Contributor Author

sragss commented Feb 7, 2024

Fixed comments @davidnevadoc – Let me know what you think about the operands outside the normal range.

We can add the reduction before, or can update the bn256::test::test_consistent_hash_to_curve test which uses the hash_to_curve outside the modulus range. I suspect this test has bad data due to random generation by a script (~25% chance of generating out of the range) rather than intentional usage.

@davidnevadoc
Copy link
Contributor

In regards to the outside of range operands, the approach I like is controlling the ways in which we create field elements and then assuming they are in the appropriate ranges in all operations.
For this concrete case, the problem was that the multiplication in from_u512 non-asm version was performing a multiplication on an unchecked field element that could be any 256 bits.

$field(lower_256) * $r2 + $field(upper_256) * $r3

The asm version on the other hand, was using Self::montgomery_reduction to handle this conversion from bytes to a valid field element.
I think this should be the approach for the non-asm version as well. I propose a solution along this lines: 359619a

I have modified the montgomery_reduction function to have a non-asm version that performs the check @sragss suggested and then multiplies by the appropriate R.

Let me know what you think and feel free to add the change if you like it.

@sragss
Copy link
Contributor Author

sragss commented Feb 8, 2024

Agree with your approach – added those changes. All tests are passing now and we don't have the slowdown from adding the check to mul.

@davidnevadoc davidnevadoc added this pull request to the merge queue Feb 9, 2024
Merged via the queue into privacy-scaling-explorations:main with commit 9fff22c Feb 9, 2024
11 checks passed
(t[N - 1], _) = adc(c_2, c, 0);
}

if bigint_geq(&t, &$modulus.0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the bigint_geq procedure and its usage may be suboptimal. First of all, you can notice that $t ≥ m ⇔ !(t < m)$. You can implement the "less-than" as subtraction $t - m$. If it overflows (borrow is 1) then $t < m$.
Then you can notice that you are actually computing the $t - m$ just after. The common pattern is to compute the subtraction speculatively and use its result in case it hasn't overflowed.

(tmp, borrow) = t - m
t = borrow ? t : tmp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

I believe strategy would only save instructions in the case that bigint_geq == true. It would cost some instructions in the case that bigint_geq == false as sbb gets broken out into a few instructions on ARM where as the 4 u64 LTs should be a single instruction each.

@davidnevadoc davidnevadoc mentioned this pull request Apr 7, 2024
jonathanpwang pushed a commit to axiom-crypto/halo2curves that referenced this pull request Apr 19, 2024
* impl CIOS

* more details

* add Fast CIOS for bn256

* rolled Fast CIOS

* clean comment

* geq for last line in bigint_geq

* update comment to include WORD_SIZE

* mod in montomgery

* cargo fmt

* cargo clippy

---------

Co-authored-by: sragss <[email protected]>
jonathanpwang pushed a commit to axiom-crypto/halo2curves that referenced this pull request Apr 24, 2024
* impl CIOS

* more details

* add Fast CIOS for bn256

* rolled Fast CIOS

* clean comment

* geq for last line in bigint_geq

* update comment to include WORD_SIZE

* mod in montomgery

* cargo fmt

* cargo clippy

---------

Co-authored-by: sragss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants