-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Chunk Data Pack Pruner] Add Engine for pruning chunk data pack #6946
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6946 +/- ##
==========================================
- Coverage 41.15% 40.92% -0.23%
==========================================
Files 2131 2100 -31
Lines 186855 183365 -3490
==========================================
- Hits 76899 75044 -1855
+ Misses 103516 102051 -1465
+ Partials 6440 6270 -170
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
e5cd5fc
to
4ed349d
Compare
868b728
to
baaafe2
Compare
bf16f91
to
a383308
Compare
90d3ab1
to
0267794
Compare
1dc0a52
to
eb92dbd
Compare
2ab97b2
to
c1e1577
Compare
8e25d9a
to
f9e0a63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this feature!
overall I think the approach is sound, however I find it complex with the multiple layers of wrappers and interfaces. reading it through, I've had a hard time convincing myself it's correct because there are a lot of moving pieces to keep in context. I worry that this will ballon as we add more data types to be pruned.
Is there a way to simplify this? to me it seems like there are 3 parts:
- The pruning logic that determine which blocks to prune when
- The height/view tracking logic that provides the next blockID to prune
- Individual data type logic that removes all of the data and indexes associated with a blockID.
Could this be broken into 3-4 modules, plus some data type specific logic that lives with the storage module? I think having the logic in fewer places, with fewer levels of abstraction would help with maintainability and review.
engine/execution/pruner/core.go
Outdated
log zerolog.Logger, | ||
metrics module.ExecutionMetrics, | ||
ctx context.Context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep ctx
as a convention
log zerolog.Logger, | |
metrics module.ExecutionMetrics, | |
ctx context.Context, | |
ctx context.Context, | |
log zerolog.Logger, | |
metrics module.ExecutionMetrics, |
engine/execution/pruner/executor.go
Outdated
|
||
var _ executor.IterationExecutor = (*ChunkDataPackPruner)(nil) | ||
|
||
func NewChunKDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func NewChunKDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner { | |
func NewChunkDataPackPruner(chunkDataPacks storage.ChunkDataPacks, results storage.ExecutionResults) *ChunkDataPackPruner { |
I designed the implementation in a more abstracted manner, allowing key logic to be reused when developing other pruners. Some important aspects include:
Added Flexibility:
This structure can support building the following modules when reusing the block_iterator and creators:
Regarding your question, @peterargue, I’ve added some comments—let me know if they provide the clarity you need. |
} | ||
|
||
func (l *LatestPrunable) Latest() (*flow.Header, error) { | ||
return l.LatestSealedAndExecuted.BelowLatest(l.threshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me it seems like there are 3 parts:
- The pruning logic that determine which blocks to prune
@peterargue
This is done by specifying which block is the latest
to the block iterator. The LatestSealedAndExecuted
module returns the latest sealed and executed, but if we want to keep the last 1000 sealed and executed blocks and not to prune them, then the BelowLatest
method here is useful, because if we provide 1000
as threshold, and use it as the latest
for the block iterator, then block iterator will only iterate up to height latest - 1000
. And combined with the pruning logic, then it means the prunable blocks is up to latest - 1000
.
Note, the LatestSealedAndExecuted
is in the block_iterator package, because it's not only for the pruner. We could use it for checker enginer for instance, which check if its own result is consistent with sealed result.
This is achieved by specifying which block is considered the latest
for the block iterator. The LatestSealedAndExecuted
module returns the most recently sealed and executed block. However, if we want to retain the last 1000
sealed and executed blocks without pruning them, the BelowLatest
method becomes useful. By setting the threshold to 1000 and using it as the latest
interface for the block iterator, the iterator will only process blocks up to latest - 1000
. Combined with the pruning logic, this ensures that blocks eligible for pruning are those up to latest - 1000
. This is how the pruning logic determine which blocks to prune
It’s important to note that LatestSealedAndExecuted
is part of the block_iterator package, as its use extends beyond pruning. For example, it can be utilized in a checker engine to verify whether its results align with the sealed results with the Latest
method.
And if we want to retain 1000 blocks for chunk data packs, and 5000
blocks for protocol state, we could define different threshold for the BelowLatest
method.
chunksDB := pebbleimpl.ToDB(chunkDataPacksDB) | ||
// the creator can be reused to create new block iterator that can iterate from the last | ||
// checkpoint to the new latest (sealed) block. | ||
creator, getNextAndLatest, err := makeBlockIteratorCreator(state, badgerDB, headers, chunksDB, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main function of the pruning logic. In order to prevent this function from being very long, I broke the logic into two functions: makeBlockIteratorCreator
, and makeIterateAndPruneAll
.
|
||
const NextHeightForUnprunedExecutionDataPackKey = "NextHeightForUnprunedExecutionDataPackKey" | ||
|
||
func LoopPruneExecutionDataFromRootToLatestSealed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me it seems like there are 3 parts:
- Individual data type logic that removes all of the data and indexes associated with a blockID.
This method is called PrunedExecutionData
, which currently only prune chunk data packs. We also need to prune other data, such as execution results, and execution data for bitswap. I haven't decided whether to put them all here, since they are in different database. I'm thinking to use one engine for pruning each data, so that we can have separate past and config to prune different dataset.
) | ||
|
||
type ChunkDataPackPruner struct { | ||
*pruners.ChunkDataPackPruner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pruners.ChunkDataPackPruner
implements the actual pruning functions. And the ChunkDataPackPruner
wraps it to make it as a executor, so that the pruner can be used as an executor for the block iterator to execute on each block.
I don't want the block iterator to call pruners.ChunkDataPack.PruneByBlockID
directly, because the iterator is not made only for the pruning. That's why I abstract it into "executor", so that the pruner is one type of executor. And as an executor, it calls the underlying pruners'PruneByBlockID
to do the actual pruning.
func (c *Creator) IteratorState() module.IteratorStateReader { | ||
return c.progress | ||
} | ||
|
||
// NewHeightBasedCreator creates a block iterator that iterates through blocks | ||
// from root to the latest (either finalized or sealed) by height. | ||
func NewHeightBasedCreator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to simplify this? to me it seems like there are 3 parts:
...
2. The height/view tracking logic that provides the next blockID to prune
@peterargue yes. that's why we have this NewHeightBasedCreator
and NewViewBasedCreator
function that build on top of NewCreator
, they both take a root
block and latest
block as *flow.Header
, and internally it pick different fields to be used as "index" for the creator.
This PR adds the engine for chunk data pack pruning. By default the chunk data pack pruning is disabled, unless flag to enable it is specified.