[Chunk Data Pack Pruner] Add Engine for pruning chunk data pack #6946

zhangchiqing · 2025-01-28T17:28:49Z

This PR adds the engine for chunk data pack pruning. By default the chunk data pack pruning is disabled, unless flag to enable it is specified.

codecov-commenter · 2025-01-28T17:33:51Z

Codecov Report

Attention: Patch coverage is 9.11681% with 319 lines in your changes missing coverage. Please review.

Project coverage is 40.92%. Comparing base (e833530) to head (fa1dfcf).
Report is 36 commits behind head on master.

Files with missing lines	Patch %	Lines
module/mock/block_iterator.go	0.00%	55 Missing ⚠️
module/mock/iterator_creator.go	0.00%	47 Missing ⚠️
...odule/block_iterator/latest/sealed_and_executed.go	0.00%	43 Missing ⚠️
module/mock/iterator_state.go	0.00%	39 Missing ⚠️
utils/unittest/mocks/protocol_state.go	0.00%	28 Missing ⚠️
module/mock/iterator_state_reader.go	0.00%	27 Missing ⚠️
cmd/execution_builder.go	0.00%	22 Missing ⚠️
module/mock/iterator_state_writer.go	0.00%	18 Missing ⚠️
module/block_iterator/executor/executor.go	64.51%	10 Missing and 1 partial ⚠️
module/metrics/execution.go	0.00%	9 Missing ⚠️
... and 6 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6946      +/-   ##
==========================================
- Coverage   41.15%   40.92%   -0.23%     
==========================================
  Files        2131     2100      -31     
  Lines      186855   183365    -3490     
==========================================
- Hits        76899    75044    -1855     
+ Misses     103516   102051    -1465     
+ Partials     6440     6270     -170

Flag	Coverage Δ
unittests	`40.92% <9.11%> (-0.23%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

peterargue

Thanks for creating this feature!

overall I think the approach is sound, however I find it complex with the multiple layers of wrappers and interfaces. reading it through, I've had a hard time convincing myself it's correct because there are a lot of moving pieces to keep in context. I worry that this will ballon as we add more data types to be pruned.

Is there a way to simplify this? to me it seems like there are 3 parts:

The pruning logic that determine which blocks to prune when
The height/view tracking logic that provides the next blockID to prune
Individual data type logic that removes all of the data and indexes associated with a blockID.

Could this be broken into 3-4 modules, plus some data type specific logic that lives with the storage module? I think having the logic in fewer places, with fewer levels of abstraction would help with maintainability and review.

peterargue · 2025-02-06T23:05:58Z

engine/execution/pruner/core.go

+	log zerolog.Logger,
+	metrics module.ExecutionMetrics,
+	ctx context.Context,


let's keep ctx as a convention

Suggested change

log zerolog.Logger,

metrics module.ExecutionMetrics,

ctx context.Context,

ctx context.Context,

log zerolog.Logger,

metrics module.ExecutionMetrics,

peterargue · 2025-02-06T23:38:27Z

engine/execution/pruner/executor.go

+
+var _ executor.IterationExecutor = (*ChunkDataPackPruner)(nil)
+
+func NewChunKDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner {


Suggested change

func NewChunKDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner {

func NewChunkDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner {

zhangchiqing · 2025-02-08T00:01:49Z

I designed the implementation in a more abstracted manner, allowing key logic to be reused when developing other pruners. Some important aspects include:

Handling interruptions: Ensures the process can resume after restarts.
Ensuring full iteration: Every block from the root to the latest is guaranteed to be iterated at least once.
Managing a moving latest block: Since latest is dynamic, the iteration logic accounts for this.

Added Flexibility:

The executor (module/block_iterator/executor) and block iterator are decoupled, preventing the block iterator from being tightly coupled with storage.
Because the block iterator is storage-agnostic, it can be used for tasks beyond pruning, such as migrating data between databases (e.g., from Badger to Pebble).
The existing executor is not concurrency-safe, as the pruner doesn’t require it. However, a concurrent executor can be implemented separately while still leveraging the block iterator. This is possible because the block iterator does not persist progress automatically—instead, it leaves this responsibility to the caller (executor).

This structure can support building the following modules when reusing the block_iterator and creators:

Verification node’s approval pruner
Execution node’s execution data pruner
Collection node’s protocol data pruner

Regarding your question, @peterargue, I’ve added some comments—let me know if they provide the clarity you need.

zhangchiqing · 2025-02-07T23:35:03Z

engine/execution/pruner/prunable.go

+}
+
+func (l *LatestPrunable) Latest() (*flow.Header, error) {
+	return l.LatestSealedAndExecuted.BelowLatest(l.threshold)


to me it seems like there are 3 parts:

The pruning logic that determine which blocks to prune

@peterargue
This is done by specifying which block is the latest to the block iterator. The LatestSealedAndExecuted module returns the latest sealed and executed, but if we want to keep the last 1000 sealed and executed blocks and not to prune them, then the BelowLatest method here is useful, because if we provide 1000 as threshold, and use it as the latest for the block iterator, then block iterator will only iterate up to height latest - 1000. And combined with the pruning logic, then it means the prunable blocks is up to latest - 1000.

Note, the LatestSealedAndExecuted is in the block_iterator package, because it's not only for the pruner. We could use it for checker enginer for instance, which check if its own result is consistent with sealed result.

This is achieved by specifying which block is considered the latest for the block iterator. The LatestSealedAndExecuted module returns the most recently sealed and executed block. However, if we want to retain the last 1000 sealed and executed blocks without pruning them, the BelowLatest method becomes useful. By setting the threshold to 1000 and using it as the latest interface for the block iterator, the iterator will only process blocks up to latest - 1000. Combined with the pruning logic, this ensures that blocks eligible for pruning are those up to latest - 1000. This is how the pruning logic determine which blocks to prune

It’s important to note that LatestSealedAndExecuted is part of the block_iterator package, as its use extends beyond pruning. For example, it can be utilized in a checker engine to verify whether its results align with the sealed results with the Latest method.

And if we want to retain 1000 blocks for chunk data packs, and 5000 blocks for protocol state, we could define different threshold for the BelowLatest method.

zhangchiqing · 2025-02-07T23:42:22Z

engine/execution/pruner/core.go

+	chunksDB := pebbleimpl.ToDB(chunkDataPacksDB)
+	// the creator can be reused to create new block iterator that can iterate from the last
+	// checkpoint to the new latest (sealed) block.
+	creator, getNextAndLatest, err := makeBlockIteratorCreator(state, badgerDB, headers, chunksDB, config)


This is the main function of the pruning logic. In order to prevent this function from being very long, I broke the logic into two functions: makeBlockIteratorCreator, and makeIterateAndPruneAll.

zhangchiqing · 2025-02-07T23:48:19Z

engine/execution/pruner/core.go

+
+const NextHeightForUnprunedExecutionDataPackKey = "NextHeightForUnprunedExecutionDataPackKey"
+
+func LoopPruneExecutionDataFromRootToLatestSealed(


to me it seems like there are 3 parts:

Individual data type logic that removes all of the data and indexes associated with a blockID.

@peterargue

This method is called PrunedExecutionData, which currently only prune chunk data packs. We also need to prune other data, such as execution results, and execution data for bitswap. I haven't decided whether to put them all here, since they are in different database. I'm thinking to use one engine for pruning each data, so that we can have separate past and config to prune different dataset.

zhangchiqing · 2025-02-07T23:54:14Z

engine/execution/pruner/executor.go

+)
+
+type ChunkDataPackPruner struct {
+	*pruners.ChunkDataPackPruner


The pruners.ChunkDataPackPruner implements the actual pruning functions. And the ChunkDataPackPruner wraps it to make it as a executor, so that the pruner can be used as an executor for the block iterator to execute on each block.

I don't want the block iterator to call pruners.ChunkDataPack.PruneByBlockID directly, because the iterator is not made only for the pruning. That's why I abstract it into "executor", so that the pruner is one type of executor. And as an executor, it calls the underlying pruners'PruneByBlockID to do the actual pruning.

zhangchiqing · 2025-02-07T23:57:18Z

module/block_iterator/creator.go

+func (c *Creator) IteratorState() module.IteratorStateReader {
+	return c.progress
+}
+
 // NewHeightBasedCreator creates a block iterator that iterates through blocks
 // from root to the latest (either finalized or sealed) by height.
 func NewHeightBasedCreator(


Is there a way to simplify this? to me it seems like there are 3 parts:
...
2. The height/view tracking logic that provides the next blockID to prune

@peterargue yes. that's why we have this NewHeightBasedCreator and NewViewBasedCreator function that build on top of NewCreator, they both take a root block and latest block as *flow.Header, and internally it pick different fields to be used as "index" for the creator.

zhangchiqing changed the base branch from master to leo/cdp-prune-block-iterator-creator January 28, 2025 17:28

zhangchiqing force-pushed the leo/cdp-prune-block-iterator-creator branch from e5cd5fc to 4ed349d Compare January 30, 2025 20:06

zhangchiqing force-pushed the leo/cdp-engine branch 2 times, most recently from 868b728 to baaafe2 Compare January 30, 2025 20:19

zhangchiqing force-pushed the leo/cdp-prune-block-iterator-creator branch from bf16f91 to a383308 Compare January 30, 2025 20:30

zhangchiqing force-pushed the leo/cdp-engine branch 2 times, most recently from 90d3ab1 to 0267794 Compare January 31, 2025 18:03

Base automatically changed from leo/cdp-prune-block-iterator-creator to master January 31, 2025 18:32

zhangchiqing force-pushed the leo/cdp-engine branch 2 times, most recently from 1dc0a52 to eb92dbd Compare January 31, 2025 21:12

zhangchiqing marked this pull request as ready for review January 31, 2025 21:13

zhangchiqing requested a review from a team as a code owner January 31, 2025 21:13

zhangchiqing requested review from fxamacker, janezpodhostnik and peterargue January 31, 2025 21:14

zhangchiqing force-pushed the leo/cdp-engine branch 2 times, most recently from 2ab97b2 to c1e1577 Compare February 7, 2025 19:40

zhangchiqing added 12 commits February 7, 2025 13:22

add executor aggregator

e1d6940

refactor constructor

cfd5ba8

add prunable

1ebb051

add executor

2ac1c3a

add config

1c0d203

add engine core

f45a2ff

add engine

723d363

add latest sealed and executed

62ff1d4

add chunk data pack pruner engine

8bb6830

add flag to control pruning sleep after iteration

cdd50af

remove callback

a70b738

add more flags

e97aaa6

zhangchiqing added 13 commits February 7, 2025 13:22

fix lint

22e54d6

handle no block to iterate

2a1d735

use the real chunk data pruner

c3f2fd7

add logging

779dfe9

update default value

8aba979

add logging

fe64766

update log

34dddbd

update config

8571f1a

add metrics

b581d0d

add test cases to chunk data pack pruner core

f9f7ee7

add not found check

7f4586b

add IteratorState to IteratorCreator

f7d349d

update mocks

f9e0a63

zhangchiqing force-pushed the leo/cdp-engine branch from 8e25d9a to f9e0a63 Compare February 7, 2025 21:24

peterargue reviewed Feb 7, 2025

View reviewed changes

zhangchiqing commented Feb 8, 2025

View reviewed changes

zhangchiqing added 2 commits February 7, 2025 16:07

address review comments

4c91379

add comments

fa1dfcf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chunk Data Pack Pruner] Add Engine for pruning chunk data pack #6946

[Chunk Data Pack Pruner] Add Engine for pruning chunk data pack #6946

zhangchiqing commented Jan 28, 2025 •

edited

Loading

codecov-commenter commented Jan 28, 2025 •

edited

Loading

peterargue left a comment

peterargue Feb 6, 2025

peterargue Feb 6, 2025

zhangchiqing commented Feb 8, 2025

zhangchiqing Feb 7, 2025

zhangchiqing Feb 7, 2025

zhangchiqing Feb 7, 2025

zhangchiqing Feb 7, 2025

zhangchiqing Feb 7, 2025


		var _ executor.IterationExecutor = (*ChunkDataPackPruner)(nil)

		func NewChunKDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner {


		const NextHeightForUnprunedExecutionDataPackKey = "NextHeightForUnprunedExecutionDataPackKey"

		func LoopPruneExecutionDataFromRootToLatestSealed(

[Chunk Data Pack Pruner] Add Engine for pruning chunk data pack #6946

Are you sure you want to change the base?

[Chunk Data Pack Pruner] Add Engine for pruning chunk data pack #6946

Conversation

zhangchiqing commented Jan 28, 2025 • edited Loading

codecov-commenter commented Jan 28, 2025 • edited Loading

Codecov Report

peterargue left a comment

Choose a reason for hiding this comment

peterargue Feb 6, 2025

Choose a reason for hiding this comment

peterargue Feb 6, 2025

Choose a reason for hiding this comment

zhangchiqing commented Feb 8, 2025

zhangchiqing Feb 7, 2025

Choose a reason for hiding this comment

zhangchiqing Feb 7, 2025

Choose a reason for hiding this comment

zhangchiqing Feb 7, 2025

Choose a reason for hiding this comment

zhangchiqing Feb 7, 2025

Choose a reason for hiding this comment

zhangchiqing Feb 7, 2025

Choose a reason for hiding this comment

zhangchiqing commented Jan 28, 2025 •

edited

Loading

codecov-commenter commented Jan 28, 2025 •

edited

Loading