-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSM optimisations: CycloneMSM #130
Conversation
Can you use something like this to get around the |
@jonathanpwang It would work for curves in |
ah, it is always pasta curves that's the problem... You could have a default implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Found 2 low-hanging fruits that speed up another ~10% on my machine (i9-13900K 24 cores).
Also it might be worth mentioning that if there is multiple same points in bases
, there is a high probability to panic in batch_add
, perhaps we could fallback to add into the jacobian bucket when batch_add
fails, or we could leave this to user to make sure the bases
shouldn't contain same point multiple times (but still there could have chance to panic).
} | ||
|
||
fn contains(&self, buck_idx: usize) -> bool { | ||
self.set.iter().any(|sch| sch.buck_idx == buck_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems buck_idx == 0
would never be added because the default buck_idx
of SchedulePoint
is also 0
.
I tried to set it as max to avoid this like:
impl Default for SchedulePoint {
fn default() -> Self {
Self {
base_idx: 0,
buck_idx: usize::MAX,
sign: false,
}
}
}
but it actually doesn't affect the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it never hits buck_idx == 0
since in the main loop we check as if buck_idx != 0 { ...
. So do you still think I should make that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm but it seems we have set it to buck_idx - 1
after we've checked it's non-zero here https://github.com/privacy-scaling-explorations/halo2curves/pull/130/files#diff-ebe254da862cf489fe020d422527386871313f19c242195dfd55c4f1ac06b6e5R389
I think it's fine to keep it as is, since I didn't observe performance difference
@han0110 do you think we should keep old implementation and prefix new one as independent_msm or something like it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should keep old implementation and prefix new one as independent_msm or something like it?
Sounds like a good idea, just in case in certain environments the older one is better.
} | ||
|
||
fn contains(&self, buck_idx: usize) -> bool { | ||
self.set.iter().any(|sch| sch.buck_idx == buck_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm but it seems we have set it to buck_idx - 1
after we've checked it's non-zero here https://github.com/privacy-scaling-explorations/halo2curves/pull/130/files#diff-ebe254da862cf489fe020d422527386871313f19c242195dfd55c4f1ac06b6e5R389
I think it's fine to keep it as is, since I didn't observe performance difference
What do you mean by environment, x86 vs ARM? |
*t = acc * (buckets[*buck_idx].y() - bases[*base_idx].y); | ||
} else { | ||
*t = acc * (buckets[*buck_idx].y() + bases[*base_idx].y); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle the case Q = -P or P = infinity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it does not it expects independent points. I aggree that we should change the name
acc *= *z; | ||
} | ||
|
||
acc = acc.invert().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't, we might want to change the name to batch_add_nonexceptional
otherwise we will likely divide by 0 and propagate 0 everywhere.
src/msm.rs
Outdated
acc += bases[coeff_idx]; | ||
pub fn best_multiexp<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C]) -> C::Curve { | ||
// TODO: consider adjusting it with emprical data? | ||
let batch_size = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to be too small for large batches
Assuming c
with the number of buckets being 2^(c-1), I use the formula: 4c² - 16*c - 128
:
- https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L260-L263
- For 2^10 = 1024 points, c = 9 (in Constantine, formula for c is gives result lower than logarithm to account for memory bandwidth limits): this is a batch size of 52
- For 2^26 = 67M points, c = 16, batch size of 640
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4c² - 16*c - 128
makes this implementation much slower. I think it is because our parallelazing methods are different?
with k = 20
batch_size = 4c² - 16*c - 128
takes 618.459msbatch_size = 64
takes 887.797ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the numbers swapped? because 618.459ms is 69.7% of 887.797ms hence the compute is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it copy pasted wrongly
let bases_local: Vec<_> = bases.par_iter().map(Affine::from).collect(); | ||
|
||
// number of windows | ||
let number_of_windows = C::Scalar::NUM_BITS as usize / c + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallelism for a 256 bits scalar will be limited to 256 / 16 = 16 cores here. We need another layer of parallelism which can easily be added by having schedulers be responsible for a range.
This can be added as a refinement, see https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L243-L250
Scheduler*[NumNZBuckets, QueueLen: static int, EC, ECaff] = object
points: ptr UncheckedArray[ECaff]
buckets*: ptr Buckets[NumNZBuckets, EC, ECaff]
start, stopEx: int32 # Bucket range
numScheduled, numCollisions: int32
collisionsMap: BigInt[NumNZBuckets] # We use a BigInt as a bitmap, when all you have is an axe ...
queue: array[QueueLen, ScheduledPoint]
collisions: array[QueueLen, ScheduledPoint]
The start, stopEx: int32 # bucket range
fields
Then you can create as many schedulers as there are cores, and because the number of buckets is 2^(c-1) so often in the thousands to millions https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L115-L133 even with conservative c
and can benefit from system with hundreds of cores.
Then we only need to change the if buck_idx != 0 {
to if scheduler.start < buck_idx < scheduler.stop {
This will make each threads read the full data but it's linear reads so speed is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mratsim I tried to implement this strategy but results are not good and close to serial implementation. Probably I'm missing something. I didn't want to pollute this PR so you can give some feedbaack at there kilic/cyclone-msm-expreriment#1
// jacobian buckets for already scheduled points | ||
let mut j_bucks = vec![Bucket::<C>::None; 1 << (c - 1)]; | ||
|
||
// schedular for affine addition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my benchmarks, affine is only worth it starting from c = 9 or ~1024 points.
for (base_idx, coeff) in coeffs.iter().enumerate() { | ||
let buck_idx = get_booth_index(w, c, coeff.as_ref()); | ||
|
||
if buck_idx != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change here to if scheduler.start < buck_idx < scheduler.stop {
, and have as many schedulers as there are threads, and partition the start/stop.
I mean in certain environment we might not want the possibility for the msm to panic. |
Co-authored-by: Han <[email protected]>
Co-authored-by: Han <[email protected]>
postfix new one as `_independent_points`
Ready to merge? |
8af4f1e
Recently I've been thinking about optimizations for commitments to small values, and I just found a case where the old implementation is faster. My machine is AMD Ryzen 5 3600 6-Core Processor I ran the same benchmark but the scalars were only 1 bit values (this would be the case for committing selector columns for example).
This is the patch I applied: --- a/src/msm.rs
+++ b/src/msm.rs
@@ -481,7 +481,7 @@ mod test {
use ff::{Field, PrimeField};
use group::{Curve, Group};
use pasta_curves::arithmetic::CurveAffine;
- use rand_core::OsRng;
+ use rand_core::{OsRng, RngCore};
#[test]
fn test_booth_encoding() {
@@ -537,8 +537,10 @@ mod test {
C::Curve::batch_normalize(&points[..], &mut affine_points[..]);
let points = affine_points;
+ let bits = 1;
+ let max_val = 2u64.pow(bits);
let scalars = (0..1 << max_k)
- .map(|_| C::Scalar::random(OsRng))
+ .map(|_| C::Scalar::from(OsRng.next_u64() % max_val))
.collect::<Vec<_>>();
for k in min_k..=max_k { For reference here is the benchmark on my machine with the original test:
|
* impl msm with batch addition * bring back multiexp serial * parallelize coeffs to repr Co-authored-by: Han <[email protected]> * parallelize bases to affine Co-authored-by: Han <[email protected]> * add missing dependency * bring back old implementation postfix new one as `_independent_points` --------- Co-authored-by: Han <[email protected]>
* impl msm with batch addition * bring back multiexp serial * parallelize coeffs to repr Co-authored-by: Han <[email protected]> * parallelize bases to affine Co-authored-by: Han <[email protected]> * add missing dependency * bring back old implementation postfix new one as `_independent_points` --------- Co-authored-by: Han <[email protected]>
- Leverage cyclone msm privacy-scaling-explorations/halo2curves#130 - Leverage improved FFT implementations - Much improved parallelism for mv-lookup and permutation commitment calcs - ASM in h2curves Results: 30-80% reduction in proving time for benchmark circuits
Batch affine addition with scheduler approach as in CycloneMSM is implemented. This PR also should cover some part of the optimisation suggestions in here
Batch addition looks like below for example for 16 additions.
And in affine coordinates we use a shared inversion for all add operation so that batch affine addition becomes cheaper than jacobian addition. In scheduler technique we update multiple buckets at once. Natuurally same buckets in the
BATCH_SIZE
range shouldn't be selected twice in order not to miss the update. So if we hit the same bucket index we delay this particular update.Other notes about the implementation:
number_of_threads
chunks. This approach also appears in gnark-crypto and ashWhiteHat-PRbatch_size
window if a bucket is not already selected by slice of scalar (booth index) we add the point into the affine scheduler. Otherwise point goes to corresponding jacobian bucket.Coordinates
API to keep current msm API and make it also work for pasta_curves. it means:Affine
struct and for each base point a uselessis_on_curve
check runsis_on_curve
check runs in aggregation phase for each affine bucketIf one wants to play with the implementation in a place that is more experimental I'd recommend to see https://github.com/kilic/cyclone-msm-expreriment
With benchmarks on M1 it seems like we achieve 30-40% gain even with coordinates API. Also we might want to file an RFC for zkcrypto to cheaper access to coordinates of
CurveAffine
withoutis_on_curve
check in order to get ~20% more gain. Leaving the related issue and PR at zkcrypto side.