Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize segment proof aggregation #410

Closed

Conversation

atanmarko
Copy link
Member

@atanmarko atanmarko commented Jul 19, 2024

Resolves #399
Resolves #403

@atanmarko atanmarko self-assigned this Jul 19, 2024
@github-actions github-actions bot added the crate: zero_bin Anything related to the zero-bin subcrates. label Jul 19, 2024
@atanmarko atanmarko force-pushed the feat/optimize-segment-proof-aggregation branch 3 times, most recently from 98c4750 to f01714a Compare July 19, 2024 14:23
@atanmarko atanmarko added the performance Performance improvement related changes label Jul 19, 2024
@atanmarko atanmarko added this to the Performance Tuning milestone Jul 19, 2024
@atanmarko atanmarko force-pushed the feat/optimize-segment-proof-aggregation branch from f497f6e to daf0e59 Compare July 19, 2024 14:41
@atanmarko
Copy link
Member Author

PR limits the number of the segment proofs that are proven in parallel to limit the memory consumption with the small penalty of async engine utilization. Chunk size could be tweaked if needed.

@atanmarko atanmarko force-pushed the feat/optimize-segment-proof-aggregation branch from c34a6da to 1a44121 Compare July 19, 2024 15:32
@atanmarko atanmarko marked this pull request as ready for review July 19, 2024 16:00
Copy link
Collaborator

@Nashtare Nashtare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this doesn't quite solve the problem.
The current "issue" causing large memory overhead with continuations, is that we want to process a large batch of txns at once, to reduce redundant work in between (re-calculations of intermediate trie roots).
Note that I admit the names here not being changed are confusing. When we iterate over txs, we don't have individual txn that are passed as inputs of the SegmentDataIterator but a Vec of txns, of length possibly greater than 1 given the above.

The most CPU cycle optimized approach to reduce the mentioned redundant computations is to have a single payload representing all txns of a block (by payload here, I mean the GenerationInputs fed to the prover).

As such, the below loop can be simplified at a high-level as:

            let tx_proof_futs: FuturesUnordered<_> = {
                let data_iterator = SegmentDataIterator {
                    partial_next_data: None,
                    inputs: single_payload, // <-- all txns are contained here!
                    max_cpu_len_log: Some(max_cpu_len_log),
                };

                Directive::map(IndexedStream::from(data_iterator), &seg_ops)
                    .fold(&agg_ops)
                    .run(runtime)
                    .map(move |e| {
                        e.map(|p| (idx, proof_gen::proof_types::TxnAggregatableProof::from(p)))
                    })
            };

As such, the problematic iterator here is data_iterator which is running unboundedly, generating possibly hundreds of segments that stay in memory until being consumed (because generating them is orders of magnitude faster than proving them).

@atanmarko
Copy link
Member Author

atanmarko commented Jul 21, 2024

At the moment, generate_next_segment is executed on GENERATION_INPUTS_AGG_CHUNK_SIZE_INNER GenerationInputs. 4 such jobs (so 16 GenerationInputs segments with SegmentProof proving) would run in parallel, and then be aggregated to chunk transaction proof.
Eventually all the chunk resulting transaction proofs would be aggregated to final transaction proof. Chunk sizes are there to limit memory consumption (we will keep in memory input GenerationSegmentData for GENERATION_INPUTS_AGG_CHUNK_SIZE GenerationInputs proving jobs). If that is not what we want, what is the alternative?

Original implementation was keeping all the GenerationSegmentData and executed first all the SegmentProof jobs before it switch to the proof aggregation (e.g.if block has 200 GenerationInpus then one data_iterator and SegmentProof for all segments in the block running in parallel), because paladin does not switch to other operation in chained operation setup before it finishes the previous one.

@Nashtare
Copy link
Collaborator

if block has 200 GenerationInpus then one data_iterator and SegmentProof for all segments in the block running in parallel

Put it differently, think that we now have a single GenerationInputs representing 200 txns. The issue is the internal SegmentDataIterator. There is only of them, but it is responsible for generating all segments, which cannot be processed as fast as they are being generated.

We still have the outer loop iterating over txns because we want the flexibility here. This is configured by the -b batch size argument when calling the proving mode. But theoretically, the best is when the batch covers a whole block at once, effectively removing the outer loop.

@atanmarko
Copy link
Member Author

Put it differently, think that we now have a single GenerationInputs representing 200 txns. The issue is the internal SegmentDataIterator. There is only of them, but it is responsible for generating all segments, which cannot be processed as fast as they are being generated.

We still have the outer loop iterating over txns because we want the flexibility here. This is configured by the -b batch size argument when calling the proving mode. But theoretically, the best is when the batch covers a whole block at once, effectively removing the outer loop.

Ok, I had wrong understanding of the problem. I'll look how to configurable limit data_iterator

@atanmarko atanmarko marked this pull request as draft July 22, 2024 08:40
@atanmarko atanmarko changed the title feat: optimize segment proof aggregation WIP: feat: optimize segment proof aggregation Jul 22, 2024
@github-actions github-actions bot added the crate: evm_arithmetization Anything related to the evm_arithmetization crate. label Jul 23, 2024
@atanmarko atanmarko changed the title WIP: feat: optimize segment proof aggregation feat: optimize segment proof aggregation Jul 23, 2024
@atanmarko atanmarko marked this pull request as ready for review July 23, 2024 14:53
atanmarko and others added 2 commits July 23, 2024 17:22
* feat: reorganize cli params

* fix: conficts

* fix: review

* fix: review 2
@atanmarko atanmarko marked this pull request as draft July 23, 2024 15:56
@atanmarko
Copy link
Member Author

Returned it to draft, CI tests seem pretty slow (maybe there is some other reason), but I need to test this PR more with the real prover tomorrow.

@atanmarko atanmarko force-pushed the feat/optimize-segment-proof-aggregation branch from 066d761 to 134cbe9 Compare July 24, 2024 10:58
@atanmarko atanmarko force-pushed the feat/optimize-segment-proof-aggregation branch from 542d5f3 to 8060288 Compare July 24, 2024 13:10
@github-actions github-actions bot added the crate: proof_gen Anything related to the proof_gen crate. label Jul 25, 2024
@atanmarko atanmarko force-pushed the feat/optimize-segment-proof-aggregation branch from eaaf614 to 5b92df1 Compare July 25, 2024 17:12
@atanmarko atanmarko marked this pull request as ready for review July 26, 2024 01:24
@atanmarko atanmarko force-pushed the feat/optimize-segment-proof-aggregation branch from db4367f to e45be0b Compare July 26, 2024 01:32
@atanmarko atanmarko requested a review from Nashtare July 26, 2024 09:18
@Nashtare Nashtare force-pushed the feat/optimize-segment-proof-aggregation branch from 56911c7 to 8e59e1c Compare August 10, 2024 16:24
@atanmarko
Copy link
Member Author

PR is tested. Currently we have fixed parameters for cpu len and batch size of 10, where premature segment chunk aggregation does not make a big improvement in memory used, but takes significant CPU performance hit. We give up this approach.
Alternative that would be better but would probably take much more effort (if possible) would be to change Paladin to be able to run chained operations (map/fold) without waiting for the first operation to finish from all the workers.

@atanmarko atanmarko closed this Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crate: evm_arithmetization Anything related to the evm_arithmetization crate. crate: proof_gen Anything related to the proof_gen crate. crate: zero_bin Anything related to the zero-bin subcrates. performance Performance improvement related changes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants