MSM optimisations: CycloneMSM #130

kilic · 2024-01-23T09:54:04Z

Batch affine addition with scheduler approach as in CycloneMSM is implemented. This PR also should cover some part of the optimisation suggestions in here

Batch addition looks like below for example for 16 additions.

b0' = b0 + p0
b1' = b1 + p1
...
b15' = b15 + p15

And in affine coordinates we use a shared inversion for all add operation so that batch affine addition becomes cheaper than jacobian addition. In scheduler technique we update multiple buckets at once. Natuurally same buckets in the BATCH_SIZE range shouldn't be selected twice in order not to miss the update. So if we hit the same bucket index we delay this particular update.

Other notes about the implementation:

Even though dev: add name for lookup halo2#29 and pse/halo2/#40 already uses batch addition this PR should be much easier to review and reason about.
Parallelization is applied on a range of windows rather than splitting MSM into number_of_threads chunks. This approach also appears in gnark-crypto and ashWhiteHat-PR
Window size is selected as it was selected in previous implementation
Batch addition size is hardcoded to 64. This comes from experiments. Around 60 and 80 I got the best and close results.
Greedy scheduler approach is implemented. It means we keep track of two bucket sets one is affine and other is jacobian. In the batch_size window if a bucket is not already selected by slice of scalar (booth index) we add the point into the affine scheduler. Otherwise point goes to corresponding jacobian bucket.
Sadly we have to use Coordinates API to keep current msm API and make it also work for pasta_curves. it means:
- We have to copy base points into newly introduced Affine struct and for each base point a useless is_on_curve check runs
- Similarly is_on_curve check runs in aggregation phase for each affine bucket
- 20% efficiency loss against curve specific implementation.

If one wants to play with the implementation in a place that is more experimental I'd recommend to see https://github.com/kilic/cyclone-msm-expreriment

With benchmarks on M1 it seems like we achieve 30-40% gain even with coordinates API. Also we might want to file an RFC for zkcrypto to cheaper access to coordinates of CurveAffine without is_on_curve check in order to get ~20% more gain. Leaving the related issue and PR at zkcrypto side.

current k=16 ................................................................91.589ms
cyclone k=16 ..............................................................  55.399ms

current k=17 ................................................................156.792ms
cyclone k=17 ..............................................................  111.048ms

current k=18 ................................................................292.118ms
cyclone k=18 ..............................................................  199.327ms

current k=19 ................................................................571.160ms
cyclone k=19 ..............................................................  357.218ms

current k=20 ................................................................1.013s
cyclone k=20 ..............................................................  684.233ms

current k=21 ................................................................1.860s
cyclone k=21 ..............................................................  1.299s

current k=22 ................................................................3.623s
cyclone k=22 ..............................................................  2.651

jonathanpwang · 2024-01-23T17:27:03Z

Can you use something like this to get around the is_on_curve? https://github.com/axiom-crypto/halo2curves/blob/aa4d981060a16eab93ed7b977adc1b4fc3e203fb/src/arithmetic.rs#L76

kilic · 2024-01-23T20:04:49Z

Can you use something like this to get around the is_on_curve? https://github.com/axiom-crypto/halo2curves/blob/aa4d981060a16eab93ed7b977adc1b4fc3e203fb/src/arithmetic.rs#L76

@jonathanpwang It would work for curves in halo2curves but not for pasta_curves and also we might need expose CurveAffineExt as CurveAffine

jonathanpwang · 2024-01-23T22:25:30Z

ah, it is always pasta curves that's the problem...

You could have a default implementation of CurveAffineExt that falls back to the coordinates() function?

han0110

LGTM! Found 2 low-hanging fruits that speed up another ~10% on my machine (i9-13900K 24 cores).

Also it might be worth mentioning that if there is multiple same points in bases, there is a high probability to panic in batch_add, perhaps we could fallback to add into the jacobian bucket when batch_add fails, or we could leave this to user to make sure the bases shouldn't contain same point multiple times (but still there could have chance to panic).

src/msm.rs

han0110 · 2024-01-24T09:00:23Z

src/msm.rs

+    }
+
+    fn contains(&self, buck_idx: usize) -> bool {
+        self.set.iter().any(|sch| sch.buck_idx == buck_idx)


It seems buck_idx == 0 would never be added because the default buck_idx of SchedulePoint is also 0.

I tried to set it as max to avoid this like:

impl Default for SchedulePoint { fn default() -> Self { Self { base_idx: 0, buck_idx: usize::MAX, sign: false, } } }

but it actually doesn't affect the performance.

I think it never hits buck_idx == 0 since in the main loop we check as if buck_idx != 0 { ... . So do you still think I should make that change?

hmm but it seems we have set it to buck_idx - 1 after we've checked it's non-zero here https://github.com/privacy-scaling-explorations/halo2curves/pull/130/files#diff-ebe254da862cf489fe020d422527386871313f19c242195dfd55c4f1ac06b6e5R389

I think it's fine to keep it as is, since I didn't observe performance difference

kilic · 2024-01-24T12:10:05Z

Also it might be worth mentioning that if there is multiple same points in bases, there is a high probability to panic in batch_add

@han0110 do you think we should keep old implementation and prefix new one as independent_msm or something like it?

han0110

do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Sounds like a good idea, just in case in certain environments the older one is better.

han0110 · 2024-01-24T12:39:03Z

src/msm.rs

+    }
+
+    fn contains(&self, buck_idx: usize) -> bool {
+        self.set.iter().any(|sch| sch.buck_idx == buck_idx)


hmm but it seems we have set it to buck_idx - 1 after we've checked it's non-zero here https://github.com/privacy-scaling-explorations/halo2curves/pull/130/files#diff-ebe254da862cf489fe020d422527386871313f19c242195dfd55c4f1ac06b6e5R389

I think it's fine to keep it as is, since I didn't observe performance difference

mratsim · 2024-01-30T13:59:57Z

do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Sounds like a good idea, just in case in certain environments the older one is better.

What do you mean by environment, x86 vs ARM?

mratsim · 2024-01-30T14:02:40Z

src/msm.rs

+            *t = acc * (buckets[*buck_idx].y() - bases[*base_idx].y);
+        } else {
+            *t = acc * (buckets[*buck_idx].y() + bases[*base_idx].y);
+        }


Does this handle the case Q = -P or P = infinity?

No it does not it expects independent points. I aggree that we should change the name

mratsim · 2024-01-30T14:04:14Z

src/msm.rs

+        acc *= *z;
+    }
+
+    acc = acc.invert().unwrap();


If it doesn't, we might want to change the name to batch_add_nonexceptional otherwise we will likely divide by 0 and propagate 0 everywhere.

mratsim · 2024-01-30T14:17:14Z

src/msm.rs

-                    acc += bases[coeff_idx];
+pub fn best_multiexp<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C]) -> C::Curve {
+    // TODO: consider adjusting it with emprical data?
+    let batch_size = 64;


This is likely to be too small for large batches

Assuming c with the number of buckets being 2^(c-1), I use the formula: 4c² - 16*c - 128:

https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L260-L263

For 2^10 = 1024 points, c = 9 (in Constantine, formula for c is gives result lower than logarithm to account for memory bandwidth limits): this is a batch size of 52

For 2^26 = 67M points, c = 16, batch size of 640

4c² - 16*c - 128 makes this implementation much slower. I think it is because our parallelazing methods are different?

with k = 20

batch_size = 4c² - 16*c - 128 takes 618.459ms

batch_size = 64 takes 887.797ms

Are the numbers swapped? because 618.459ms is 69.7% of 887.797ms hence the compute is faster.

Yes it copy pasted wrongly

mratsim · 2024-01-30T14:28:12Z

src/msm.rs

+    let bases_local: Vec<_> = bases.par_iter().map(Affine::from).collect();
+
+    // number of windows
+    let number_of_windows = C::Scalar::NUM_BITS as usize / c + 1;


Parallelism for a 256 bits scalar will be limited to 256 / 16 = 16 cores here. We need another layer of parallelism which can easily be added by having schedulers be responsible for a range.

This can be added as a refinement, see https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L243-L250

Scheduler*[NumNZBuckets, QueueLen: static int, EC, ECaff] = object points: ptr UncheckedArray[ECaff] buckets*: ptr Buckets[NumNZBuckets, EC, ECaff] start, stopEx: int32 # Bucket range numScheduled, numCollisions: int32 collisionsMap: BigInt[NumNZBuckets] # We use a BigInt as a bitmap, when all you have is an axe ... queue: array[QueueLen, ScheduledPoint] collisions: array[QueueLen, ScheduledPoint]

The start, stopEx: int32 # bucket range fields

Then you can create as many schedulers as there are cores, and because the number of buckets is 2^(c-1) so often in the thousands to millions https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L115-L133 even with conservative c and can benefit from system with hundreds of cores.

Then we only need to change the if buck_idx != 0 { to if scheduler.start < buck_idx < scheduler.stop {

This will make each threads read the full data but it's linear reads so speed is OK.

@mratsim I tried to implement this strategy but results are not good and close to serial implementation. Probably I'm missing something. I didn't want to pollute this PR so you can give some feedbaack at there kilic/cyclone-msm-expreriment#1

mratsim · 2024-01-30T14:29:45Z

src/msm.rs

+        // jacobian buckets for already scheduled points
+        let mut j_bucks = vec![Bucket::<C>::None; 1 << (c - 1)];
+
+        // schedular for affine addition


In my benchmarks, affine is only worth it starting from c = 9 or ~1024 points.

mratsim · 2024-01-30T14:32:21Z

src/msm.rs

+        for (base_idx, coeff) in coeffs.iter().enumerate() {
+            let buck_idx = get_booth_index(w, c, coeff.as_ref());
+
+            if buck_idx != 0 {


Change here to if scheduler.start < buck_idx < scheduler.stop {, and have as many schedulers as there are threads, and partition the start/stop.

han0110 · 2024-01-31T03:24:14Z

do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Sounds like a good idea, just in case in certain environments the older one is better.

What do you mean by environment, x86 vs ARM?

I mean in certain environment we might not want the possibility for the msm to panic.

Co-authored-by: Han <[email protected]>

postfix new one as `_independent_points`

davidnevadoc · 2024-04-12T10:48:30Z

Ready to merge?
cc @kilic

ed255 · 2024-04-16T13:22:34Z

do you think we should keep old implementation and prefix new one as independent_msm or something like it?

Sounds like a good idea, just in case in certain environments the older one is better.

Recently I've been thinking about optimizations for commitments to small values, and I just found a case where the old implementation is faster.

My machine is AMD Ryzen 5 3600 6-Core Processor

I ran the same benchmark but the scalars were only 1 bit values (this would be the case for committing selector columns for example).

End:     cyclone k=14 ..............................................................11.269ms
End:     older k=14 ................................................................10.967ms
End:     cyclone k=15 ..............................................................19.787ms
End:     older k=15 ................................................................8.992ms
End:     cyclone k=16 ..............................................................34.523ms
End:     older k=16 ................................................................16.949ms
End:     cyclone k=17 ..............................................................54.059ms
End:     older k=17 ................................................................32.522ms
End:     cyclone k=18 ..............................................................85.195ms
End:     older k=18 ................................................................39.849ms
End:     cyclone k=19 ..............................................................142.293ms
End:     older k=19 ................................................................68.671ms
End:     cyclone k=20 ..............................................................245.669ms
End:     older k=20 ................................................................102.556ms
End:     cyclone k=21 ..............................................................481.905ms
End:     older k=21 ................................................................191.431ms
End:     cyclone k=22 ..............................................................966.360ms
End:     older k=22 ................................................................309.965ms

This is the patch I applied:

--- a/src/msm.rs
+++ b/src/msm.rs
@@ -481,7 +481,7 @@ mod test {
     use ff::{Field, PrimeField};
     use group::{Curve, Group};
     use pasta_curves::arithmetic::CurveAffine;
-    use rand_core::OsRng;
+    use rand_core::{OsRng, RngCore};
 
     #[test]
     fn test_booth_encoding() {
@@ -537,8 +537,10 @@ mod test {
         C::Curve::batch_normalize(&points[..], &mut affine_points[..]);
         let points = affine_points;
 
+        let bits = 1;
+        let max_val = 2u64.pow(bits);
         let scalars = (0..1 << max_k)
-            .map(|_| C::Scalar::random(OsRng))
+            .map(|_| C::Scalar::from(OsRng.next_u64() % max_val))
             .collect::<Vec<_>>();
 
         for k in min_k..=max_k {

For reference here is the benchmark on my machine with the original test:

End:     cyclone k=14 ..............................................................29.378ms
End:     older k=14 ................................................................57.678ms
End:     cyclone k=15 ..............................................................34.141ms
End:     older k=15 ................................................................64.278ms
End:     cyclone k=16 ..............................................................63.279ms
End:     older k=16 ................................................................113.628ms
End:     cyclone k=17 ..............................................................101.339ms
End:     older k=17 ................................................................202.703ms
End:     cyclone k=18 ..............................................................199.750ms
End:     older k=18 ................................................................407.586ms
End:     cyclone k=19 ..............................................................379.565ms
End:     older k=19 ................................................................745.613ms
End:     cyclone k=20 ..............................................................738.133ms
End:     older k=20 ................................................................1.346s
End:     cyclone k=21 ..............................................................1.276s
End:     older k=21 ................................................................2.465s
End:     cyclone k=22 ..............................................................2.665s
End:     older k=22 ................................................................4.919s

* impl msm with batch addition * bring back multiexp serial * parallelize coeffs to repr Co-authored-by: Han <[email protected]> * parallelize bases to affine Co-authored-by: Han <[email protected]> * add missing dependency * bring back old implementation postfix new one as `_independent_points` --------- Co-authored-by: Han <[email protected]>

- Leverage cyclone msm privacy-scaling-explorations/halo2curves#130 - Leverage improved FFT implementations - Much improved parallelism for mv-lookup and permutation commitment calcs - ASM in h2curves Results: 30-80% reduction in proving time for benchmark circuits

han0110 self-requested a review January 23, 2024 09:57

han0110 reviewed Jan 24, 2024

View reviewed changes

han0110 approved these changes Jan 24, 2024

View reviewed changes

mratsim reviewed Jan 30, 2024

View reviewed changes

mratsim approved these changes Feb 9, 2024

View reviewed changes

kilic and others added 6 commits February 19, 2024 11:53

impl msm with batch addition

6fd13c1

bring back multiexp serial

8a160e5

parallelize coeffs to repr

75a3cda

Co-authored-by: Han <[email protected]>

parallelize bases to affine

4011ed2

Co-authored-by: Han <[email protected]>

add missing dependency

c6c291a

bring back old implementation

a359481

postfix new one as `_independent_points`

kilic force-pushed the batch-add-msm branch from e4cb0fd to a359481 Compare February 19, 2024 08:54

mratsim mentioned this pull request Mar 4, 2024

feat: multiexp_serial skips doubling when all bits are zero privacy-scaling-explorations/halo2#202

Closed

duguorong009 mentioned this pull request Mar 9, 2024

Analyze and select optimizations to port from C++ port of Halo2 by kroma-network/tachyon privacy-scaling-explorations/halo2#293

Closed

kilic added this pull request to the merge queue Apr 12, 2024

Merged via the queue into privacy-scaling-explorations:main with commit 8af4f1e Apr 12, 2024
11 checks passed

mratsim mentioned this pull request Apr 29, 2024

Cyclone MSM panic #153

Closed

davidnevadoc mentioned this pull request Jul 11, 2024

Add edge case handling for batch_add #169

Merged

jonathanpwang mentioned this pull request Aug 13, 2024

feat: use halo2curves cycloneMSM axiom-crypto/halo2#36

Merged

alexander-camuto mentioned this pull request Aug 23, 2024

chore: performance improvements zkonduit/halo2#14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSM optimisations: CycloneMSM #130

MSM optimisations: CycloneMSM #130

kilic commented Jan 23, 2024 •

edited

Loading

jonathanpwang commented Jan 23, 2024

kilic commented Jan 23, 2024

jonathanpwang commented Jan 23, 2024 •

edited

Loading

han0110 left a comment •

edited

Loading

han0110 Jan 24, 2024

kilic Jan 24, 2024

han0110 Jan 24, 2024

kilic commented Jan 24, 2024

han0110 left a comment

han0110 Jan 24, 2024

mratsim commented Jan 30, 2024

mratsim Jan 30, 2024

kilic Feb 6, 2024

mratsim Jan 30, 2024

mratsim Jan 30, 2024

kilic Feb 6, 2024

mratsim Feb 9, 2024 •

edited

Loading

kilic Feb 19, 2024 •

edited

Loading

mratsim Jan 30, 2024

kilic Feb 6, 2024

mratsim Jan 30, 2024

mratsim Jan 30, 2024 •

edited

Loading

han0110 commented Jan 31, 2024

davidnevadoc commented Apr 12, 2024

ed255 commented Apr 16, 2024

MSM optimisations: CycloneMSM #130

MSM optimisations: CycloneMSM #130

Conversation

kilic commented Jan 23, 2024 • edited Loading

jonathanpwang commented Jan 23, 2024

kilic commented Jan 23, 2024

jonathanpwang commented Jan 23, 2024 • edited Loading

han0110 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kilic commented Jan 24, 2024

han0110 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mratsim commented Jan 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mratsim Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

kilic Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mratsim Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

han0110 commented Jan 31, 2024

davidnevadoc commented Apr 12, 2024

ed255 commented Apr 16, 2024

kilic commented Jan 23, 2024 •

edited

Loading

jonathanpwang commented Jan 23, 2024 •

edited

Loading

han0110 left a comment •

edited

Loading

mratsim Feb 9, 2024 •

edited

Loading

kilic Feb 19, 2024 •

edited

Loading

mratsim Jan 30, 2024 •

edited

Loading