Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSM optimisations: CycloneMSM #130

Merged
merged 6 commits into from
Apr 12, 2024

Conversation

kilic
Copy link
Collaborator

@kilic kilic commented Jan 23, 2024

Batch affine addition with scheduler approach as in CycloneMSM is implemented. This PR also should cover some part of the optimisation suggestions in here

Batch addition looks like below for example for 16 additions.

b0' = b0 + p0
b1' = b1 + p1
...
b15' = b15 + p15

And in affine coordinates we use a shared inversion for all add operation so that batch affine addition becomes cheaper than jacobian addition. In scheduler technique we update multiple buckets at once. Natuurally same buckets in the BATCH_SIZE range shouldn't be selected twice in order not to miss the update. So if we hit the same bucket index we delay this particular update.

Other notes about the implementation:

  • Even though dev: add name for lookup halo2#29 and pse/halo2/#40 already uses batch addition this PR should be much easier to review and reason about.
  • Parallelization is applied on a range of windows rather than splitting MSM into number_of_threads chunks. This approach also appears in gnark-crypto and ashWhiteHat-PR
  • Window size is selected as it was selected in previous implementation
  • Batch addition size is hardcoded to 64. This comes from experiments. Around 60 and 80 I got the best and close results.
  • Greedy scheduler approach is implemented. It means we keep track of two bucket sets one is affine and other is jacobian. In the batch_size window if a bucket is not already selected by slice of scalar (booth index) we add the point into the affine scheduler. Otherwise point goes to corresponding jacobian bucket.
  • Sadly we have to use Coordinates API to keep current msm API and make it also work for pasta_curves. it means:
    • We have to copy base points into newly introduced Affine struct and for each base point a useless is_on_curve check runs
    • Similarly is_on_curve check runs in aggregation phase for each affine bucket
    • 20% efficiency loss against curve specific implementation.

If one wants to play with the implementation in a place that is more experimental I'd recommend to see https://github.com/kilic/cyclone-msm-expreriment

With benchmarks on M1 it seems like we achieve 30-40% gain even with coordinates API. Also we might want to file an RFC for zkcrypto to cheaper access to coordinates of CurveAffine without is_on_curve check in order to get ~20% more gain. Leaving the related issue and PR at zkcrypto side.

current k=16 ................................................................91.589ms
cyclone k=16 ..............................................................  55.399ms

current k=17 ................................................................156.792ms
cyclone k=17 ..............................................................  111.048ms

current k=18 ................................................................292.118ms
cyclone k=18 ..............................................................  199.327ms

current k=19 ................................................................571.160ms
cyclone k=19 ..............................................................  357.218ms

current k=20 ................................................................1.013s
cyclone k=20 ..............................................................  684.233ms

current k=21 ................................................................1.860s
cyclone k=21 ..............................................................  1.299s

current k=22 ................................................................3.623s
cyclone k=22 ..............................................................  2.651

@han0110 han0110 self-requested a review January 23, 2024 09:57
@jonathanpwang
Copy link
Contributor

Can you use something like this to get around the is_on_curve? https://github.com/axiom-crypto/halo2curves/blob/aa4d981060a16eab93ed7b977adc1b4fc3e203fb/src/arithmetic.rs#L76

@kilic
Copy link
Collaborator Author

kilic commented Jan 23, 2024

Can you use something like this to get around the is_on_curve? https://github.com/axiom-crypto/halo2curves/blob/aa4d981060a16eab93ed7b977adc1b4fc3e203fb/src/arithmetic.rs#L76

@jonathanpwang It would work for curves in halo2curves but not for pasta_curves and also we might need expose CurveAffineExt as CurveAffine

@jonathanpwang
Copy link
Contributor

jonathanpwang commented Jan 23, 2024

ah, it is always pasta curves that's the problem...

You could have a default implementation of CurveAffineExt that falls back to the coordinates() function?

Copy link
Contributor

@han0110 han0110 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Found 2 low-hanging fruits that speed up another ~10% on my machine (i9-13900K 24 cores).

Also it might be worth mentioning that if there is multiple same points in bases, there is a high probability to panic in batch_add, perhaps we could fallback to add into the jacobian bucket when batch_add fails, or we could leave this to user to make sure the bases shouldn't contain same point multiple times (but still there could have chance to panic).

src/msm.rs Outdated Show resolved Hide resolved
src/msm.rs Outdated Show resolved Hide resolved
}

fn contains(&self, buck_idx: usize) -> bool {
self.set.iter().any(|sch| sch.buck_idx == buck_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems buck_idx == 0 would never be added because the default buck_idx of SchedulePoint is also 0.

I tried to set it as max to avoid this like:

impl Default for SchedulePoint {
    fn default() -> Self {
        Self {
            base_idx: 0,
            buck_idx: usize::MAX,
            sign: false,
        }
    }
}

but it actually doesn't affect the performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it never hits buck_idx == 0 since in the main loop we check as if buck_idx != 0 { ... . So do you still think I should make that change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm but it seems we have set it to buck_idx - 1 after we've checked it's non-zero here https://github.com/privacy-scaling-explorations/halo2curves/pull/130/files#diff-ebe254da862cf489fe020d422527386871313f19c242195dfd55c4f1ac06b6e5R389

I think it's fine to keep it as is, since I didn't observe performance difference

@kilic
Copy link
Collaborator Author

kilic commented Jan 24, 2024

Also it might be worth mentioning that if there is multiple same points in bases, there is a high probability to panic in batch_add

@han0110 do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Copy link
Contributor

@han0110 han0110 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Sounds like a good idea, just in case in certain environments the older one is better.

}

fn contains(&self, buck_idx: usize) -> bool {
self.set.iter().any(|sch| sch.buck_idx == buck_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm but it seems we have set it to buck_idx - 1 after we've checked it's non-zero here https://github.com/privacy-scaling-explorations/halo2curves/pull/130/files#diff-ebe254da862cf489fe020d422527386871313f19c242195dfd55c4f1ac06b6e5R389

I think it's fine to keep it as is, since I didn't observe performance difference

@mratsim
Copy link
Contributor

mratsim commented Jan 30, 2024

do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Sounds like a good idea, just in case in certain environments the older one is better.

What do you mean by environment, x86 vs ARM?

*t = acc * (buckets[*buck_idx].y() - bases[*base_idx].y);
} else {
*t = acc * (buckets[*buck_idx].y() + bases[*base_idx].y);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle the case Q = -P or P = infinity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it does not it expects independent points. I aggree that we should change the name

acc *= *z;
}

acc = acc.invert().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't, we might want to change the name to batch_add_nonexceptional otherwise we will likely divide by 0 and propagate 0 everywhere.

src/msm.rs Outdated
acc += bases[coeff_idx];
pub fn best_multiexp<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C]) -> C::Curve {
// TODO: consider adjusting it with emprical data?
let batch_size = 64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to be too small for large batches

Assuming c with the number of buckets being 2^(c-1), I use the formula: 4c² - 16*c - 128:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4c² - 16*c - 128 makes this implementation much slower. I think it is because our parallelazing methods are different?

with k = 20

  • batch_size = 4c² - 16*c - 128 takes 618.459ms
  • batch_size = 64 takes 887.797ms

Copy link
Contributor

@mratsim mratsim Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the numbers swapped? because 618.459ms is 69.7% of 887.797ms hence the compute is faster.

Copy link
Collaborator Author

@kilic kilic Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it copy pasted wrongly

let bases_local: Vec<_> = bases.par_iter().map(Affine::from).collect();

// number of windows
let number_of_windows = C::Scalar::NUM_BITS as usize / c + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelism for a 256 bits scalar will be limited to 256 / 16 = 16 cores here. We need another layer of parallelism which can easily be added by having schedulers be responsible for a range.

This can be added as a refinement, see https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L243-L250

  Scheduler*[NumNZBuckets, QueueLen: static int, EC, ECaff] = object
    points:                        ptr UncheckedArray[ECaff]
    buckets*:                      ptr Buckets[NumNZBuckets, EC, ECaff]
    start, stopEx:                 int32                # Bucket range
    numScheduled, numCollisions:   int32
    collisionsMap:                 BigInt[NumNZBuckets] # We use a BigInt as a bitmap, when all you have is an axe ...
    queue:                         array[QueueLen, ScheduledPoint]
    collisions:                    array[QueueLen, ScheduledPoint]

The start, stopEx: int32 # bucket range fields

Then you can create as many schedulers as there are cores, and because the number of buckets is 2^(c-1) so often in the thousands to millions https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L115-L133 even with conservative c and can benefit from system with hundreds of cores.

Then we only need to change the if buck_idx != 0 { to if scheduler.start < buck_idx < scheduler.stop {

This will make each threads read the full data but it's linear reads so speed is OK.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mratsim I tried to implement this strategy but results are not good and close to serial implementation. Probably I'm missing something. I didn't want to pollute this PR so you can give some feedbaack at there kilic/cyclone-msm-expreriment#1

// jacobian buckets for already scheduled points
let mut j_bucks = vec![Bucket::<C>::None; 1 << (c - 1)];

// schedular for affine addition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my benchmarks, affine is only worth it starting from c = 9 or ~1024 points.

for (base_idx, coeff) in coeffs.iter().enumerate() {
let buck_idx = get_booth_index(w, c, coeff.as_ref());

if buck_idx != 0 {
Copy link
Contributor

@mratsim mratsim Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change here to if scheduler.start < buck_idx < scheduler.stop {, and have as many schedulers as there are threads, and partition the start/stop.

@han0110
Copy link
Contributor

han0110 commented Jan 31, 2024

do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Sounds like a good idea, just in case in certain environments the older one is better.

What do you mean by environment, x86 vs ARM?

I mean in certain environment we might not want the possibility for the msm to panic.

@davidnevadoc
Copy link
Contributor

Ready to merge?
cc @kilic

@kilic kilic added this pull request to the merge queue Apr 12, 2024
Merged via the queue into privacy-scaling-explorations:main with commit 8af4f1e Apr 12, 2024
11 checks passed
@ed255
Copy link
Member

ed255 commented Apr 16, 2024

do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Sounds like a good idea, just in case in certain environments the older one is better.

Recently I've been thinking about optimizations for commitments to small values, and I just found a case where the old implementation is faster.

My machine is AMD Ryzen 5 3600 6-Core Processor

I ran the same benchmark but the scalars were only 1 bit values (this would be the case for committing selector columns for example).

End:     cyclone k=14 ..............................................................11.269ms
End:     older k=14 ................................................................10.967ms
End:     cyclone k=15 ..............................................................19.787ms
End:     older k=15 ................................................................8.992ms
End:     cyclone k=16 ..............................................................34.523ms
End:     older k=16 ................................................................16.949ms
End:     cyclone k=17 ..............................................................54.059ms
End:     older k=17 ................................................................32.522ms
End:     cyclone k=18 ..............................................................85.195ms
End:     older k=18 ................................................................39.849ms
End:     cyclone k=19 ..............................................................142.293ms
End:     older k=19 ................................................................68.671ms
End:     cyclone k=20 ..............................................................245.669ms
End:     older k=20 ................................................................102.556ms
End:     cyclone k=21 ..............................................................481.905ms
End:     older k=21 ................................................................191.431ms
End:     cyclone k=22 ..............................................................966.360ms
End:     older k=22 ................................................................309.965ms

This is the patch I applied:

--- a/src/msm.rs
+++ b/src/msm.rs
@@ -481,7 +481,7 @@ mod test {
     use ff::{Field, PrimeField};
     use group::{Curve, Group};
     use pasta_curves::arithmetic::CurveAffine;
-    use rand_core::OsRng;
+    use rand_core::{OsRng, RngCore};
 
     #[test]
     fn test_booth_encoding() {
@@ -537,8 +537,10 @@ mod test {
         C::Curve::batch_normalize(&points[..], &mut affine_points[..]);
         let points = affine_points;
 
+        let bits = 1;
+        let max_val = 2u64.pow(bits);
         let scalars = (0..1 << max_k)
-            .map(|_| C::Scalar::random(OsRng))
+            .map(|_| C::Scalar::from(OsRng.next_u64() % max_val))
             .collect::<Vec<_>>();
 
         for k in min_k..=max_k {

For reference here is the benchmark on my machine with the original test:

End:     cyclone k=14 ..............................................................29.378ms
End:     older k=14 ................................................................57.678ms
End:     cyclone k=15 ..............................................................34.141ms
End:     older k=15 ................................................................64.278ms
End:     cyclone k=16 ..............................................................63.279ms
End:     older k=16 ................................................................113.628ms
End:     cyclone k=17 ..............................................................101.339ms
End:     older k=17 ................................................................202.703ms
End:     cyclone k=18 ..............................................................199.750ms
End:     older k=18 ................................................................407.586ms
End:     cyclone k=19 ..............................................................379.565ms
End:     older k=19 ................................................................745.613ms
End:     cyclone k=20 ..............................................................738.133ms
End:     older k=20 ................................................................1.346s
End:     cyclone k=21 ..............................................................1.276s
End:     older k=21 ................................................................2.465s
End:     cyclone k=22 ..............................................................2.665s
End:     older k=22 ................................................................4.919s

jonathanpwang pushed a commit to axiom-crypto/halo2curves that referenced this pull request Apr 19, 2024
* impl msm with batch addition

* bring back multiexp serial

* parallelize coeffs to repr

Co-authored-by: Han <[email protected]>

* parallelize bases to affine

Co-authored-by: Han <[email protected]>

* add missing dependency

* bring back old implementation

postfix new one as `_independent_points`

---------

Co-authored-by: Han <[email protected]>
jonathanpwang pushed a commit to axiom-crypto/halo2curves that referenced this pull request Apr 24, 2024
* impl msm with batch addition

* bring back multiexp serial

* parallelize coeffs to repr

Co-authored-by: Han <[email protected]>

* parallelize bases to affine

Co-authored-by: Han <[email protected]>

* add missing dependency

* bring back old implementation

postfix new one as `_independent_points`

---------

Co-authored-by: Han <[email protected]>
@mratsim mratsim mentioned this pull request Apr 29, 2024
alexander-camuto added a commit to zkonduit/halo2 that referenced this pull request Aug 26, 2024
- Leverage cyclone msm
privacy-scaling-explorations/halo2curves#130
- Leverage improved FFT implementations 
- Much improved parallelism for mv-lookup and permutation commitment
calcs
- ASM in h2curves

Results: 30-80% reduction in proving time for benchmark circuits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants