Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chunk Data Pack Pruner] Add Engine for pruning chunk data pack #6946

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

zhangchiqing
Copy link
Member

@zhangchiqing zhangchiqing commented Jan 28, 2025

This PR adds the engine for chunk data pack pruning. By default the chunk data pack pruning is disabled, unless flag to enable it is specified.

@zhangchiqing zhangchiqing changed the base branch from master to leo/cdp-prune-block-iterator-creator January 28, 2025 17:28
@codecov-commenter
Copy link

codecov-commenter commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 9.11681% with 319 lines in your changes missing coverage. Please review.

Project coverage is 40.92%. Comparing base (e833530) to head (fa1dfcf).
Report is 36 commits behind head on master.

Files with missing lines Patch % Lines
module/mock/block_iterator.go 0.00% 55 Missing ⚠️
module/mock/iterator_creator.go 0.00% 47 Missing ⚠️
...odule/block_iterator/latest/sealed_and_executed.go 0.00% 43 Missing ⚠️
module/mock/iterator_state.go 0.00% 39 Missing ⚠️
utils/unittest/mocks/protocol_state.go 0.00% 28 Missing ⚠️
module/mock/iterator_state_reader.go 0.00% 27 Missing ⚠️
cmd/execution_builder.go 0.00% 22 Missing ⚠️
module/mock/iterator_state_writer.go 0.00% 18 Missing ⚠️
module/block_iterator/executor/executor.go 64.51% 10 Missing and 1 partial ⚠️
module/metrics/execution.go 0.00% 9 Missing ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6946      +/-   ##
==========================================
- Coverage   41.15%   40.92%   -0.23%     
==========================================
  Files        2131     2100      -31     
  Lines      186855   183365    -3490     
==========================================
- Hits        76899    75044    -1855     
+ Misses     103516   102051    -1465     
+ Partials     6440     6270     -170     
Flag Coverage Δ
unittests 40.92% <9.11%> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhangchiqing zhangchiqing force-pushed the leo/cdp-prune-block-iterator-creator branch from e5cd5fc to 4ed349d Compare January 30, 2025 20:06
@zhangchiqing zhangchiqing force-pushed the leo/cdp-engine branch 2 times, most recently from 868b728 to baaafe2 Compare January 30, 2025 20:19
@zhangchiqing zhangchiqing force-pushed the leo/cdp-prune-block-iterator-creator branch from bf16f91 to a383308 Compare January 30, 2025 20:30
@zhangchiqing zhangchiqing force-pushed the leo/cdp-engine branch 2 times, most recently from 90d3ab1 to 0267794 Compare January 31, 2025 18:03
Base automatically changed from leo/cdp-prune-block-iterator-creator to master January 31, 2025 18:32
@zhangchiqing zhangchiqing force-pushed the leo/cdp-engine branch 2 times, most recently from 1dc0a52 to eb92dbd Compare January 31, 2025 21:12
@zhangchiqing zhangchiqing marked this pull request as ready for review January 31, 2025 21:13
@zhangchiqing zhangchiqing requested a review from a team as a code owner January 31, 2025 21:13
@zhangchiqing zhangchiqing force-pushed the leo/cdp-engine branch 2 times, most recently from 2ab97b2 to c1e1577 Compare February 7, 2025 19:40
Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this feature!

overall I think the approach is sound, however I find it complex with the multiple layers of wrappers and interfaces. reading it through, I've had a hard time convincing myself it's correct because there are a lot of moving pieces to keep in context. I worry that this will ballon as we add more data types to be pruned.

Is there a way to simplify this? to me it seems like there are 3 parts:

  1. The pruning logic that determine which blocks to prune when
  2. The height/view tracking logic that provides the next blockID to prune
  3. Individual data type logic that removes all of the data and indexes associated with a blockID.

Could this be broken into 3-4 modules, plus some data type specific logic that lives with the storage module? I think having the logic in fewer places, with fewer levels of abstraction would help with maintainability and review.

Comment on lines 25 to 27
log zerolog.Logger,
metrics module.ExecutionMetrics,
ctx context.Context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep ctx as a convention

Suggested change
log zerolog.Logger,
metrics module.ExecutionMetrics,
ctx context.Context,
ctx context.Context,
log zerolog.Logger,
metrics module.ExecutionMetrics,


var _ executor.IterationExecutor = (*ChunkDataPackPruner)(nil)

func NewChunKDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func NewChunKDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner {
func NewChunkDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner {

@zhangchiqing
Copy link
Member Author

I designed the implementation in a more abstracted manner, allowing key logic to be reused when developing other pruners. Some important aspects include:

  • Handling interruptions: Ensures the process can resume after restarts.
  • Ensuring full iteration: Every block from the root to the latest is guaranteed to be iterated at least once.
  • Managing a moving latest block: Since latest is dynamic, the iteration logic accounts for this.

Added Flexibility:

  • The executor (module/block_iterator/executor) and block iterator are decoupled, preventing the block iterator from being tightly coupled with storage.
  • Because the block iterator is storage-agnostic, it can be used for tasks beyond pruning, such as migrating data between databases (e.g., from Badger to Pebble).
  • The existing executor is not concurrency-safe, as the pruner doesn’t require it. However, a concurrent executor can be implemented separately while still leveraging the block iterator. This is possible because the block iterator does not persist progress automatically—instead, it leaves this responsibility to the caller (executor).

This structure can support building the following modules when reusing the block_iterator and creators:

  • Verification node’s approval pruner
  • Execution node’s execution data pruner
  • Collection node’s protocol data pruner

Regarding your question, @peterargue, I’ve added some comments—let me know if they provide the clarity you need.

}

func (l *LatestPrunable) Latest() (*flow.Header, error) {
return l.LatestSealedAndExecuted.BelowLatest(l.threshold)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me it seems like there are 3 parts:

  1. The pruning logic that determine which blocks to prune

@peterargue
This is done by specifying which block is the latest to the block iterator. The LatestSealedAndExecuted module returns the latest sealed and executed, but if we want to keep the last 1000 sealed and executed blocks and not to prune them, then the BelowLatest method here is useful, because if we provide 1000 as threshold, and use it as the latest for the block iterator, then block iterator will only iterate up to height latest - 1000. And combined with the pruning logic, then it means the prunable blocks is up to latest - 1000.

Note, the LatestSealedAndExecuted is in the block_iterator package, because it's not only for the pruner. We could use it for checker enginer for instance, which check if its own result is consistent with sealed result.

This is achieved by specifying which block is considered the latest for the block iterator. The LatestSealedAndExecuted module returns the most recently sealed and executed block. However, if we want to retain the last 1000 sealed and executed blocks without pruning them, the BelowLatest method becomes useful. By setting the threshold to 1000 and using it as the latest interface for the block iterator, the iterator will only process blocks up to latest - 1000. Combined with the pruning logic, this ensures that blocks eligible for pruning are those up to latest - 1000. This is how the pruning logic determine which blocks to prune

It’s important to note that LatestSealedAndExecuted is part of the block_iterator package, as its use extends beyond pruning. For example, it can be utilized in a checker engine to verify whether its results align with the sealed results with the Latest method.

And if we want to retain 1000 blocks for chunk data packs, and 5000 blocks for protocol state, we could define different threshold for the BelowLatest method.

chunksDB := pebbleimpl.ToDB(chunkDataPacksDB)
// the creator can be reused to create new block iterator that can iterate from the last
// checkpoint to the new latest (sealed) block.
creator, getNextAndLatest, err := makeBlockIteratorCreator(state, badgerDB, headers, chunksDB, config)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main function of the pruning logic. In order to prevent this function from being very long, I broke the logic into two functions: makeBlockIteratorCreator, and makeIterateAndPruneAll.


const NextHeightForUnprunedExecutionDataPackKey = "NextHeightForUnprunedExecutionDataPackKey"

func LoopPruneExecutionDataFromRootToLatestSealed(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me it seems like there are 3 parts:

  1. Individual data type logic that removes all of the data and indexes associated with a blockID.

@peterargue

This method is called PrunedExecutionData, which currently only prune chunk data packs. We also need to prune other data, such as execution results, and execution data for bitswap. I haven't decided whether to put them all here, since they are in different database. I'm thinking to use one engine for pruning each data, so that we can have separate past and config to prune different dataset.

)

type ChunkDataPackPruner struct {
*pruners.ChunkDataPackPruner
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pruners.ChunkDataPackPruner implements the actual pruning functions. And the ChunkDataPackPruner wraps it to make it as a executor, so that the pruner can be used as an executor for the block iterator to execute on each block.

I don't want the block iterator to call pruners.ChunkDataPack.PruneByBlockID directly, because the iterator is not made only for the pruning. That's why I abstract it into "executor", so that the pruner is one type of executor. And as an executor, it calls the underlying pruners'PruneByBlockID to do the actual pruning.

func (c *Creator) IteratorState() module.IteratorStateReader {
return c.progress
}

// NewHeightBasedCreator creates a block iterator that iterates through blocks
// from root to the latest (either finalized or sealed) by height.
func NewHeightBasedCreator(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to simplify this? to me it seems like there are 3 parts:
...
2. The height/view tracking logic that provides the next blockID to prune

@peterargue yes. that's why we have this NewHeightBasedCreator and NewViewBasedCreator function that build on top of NewCreator, they both take a root block and latest block as *flow.Header, and internally it pick different fields to be used as "index" for the creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants