Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify archiver to support fast-sync. #2722

Closed
wants to merge 5 commits into from
Closed

Conversation

shamil-gadelshin
Copy link
Member

This PR modifies Subspace archiver in preparation for the "fast-sync" algorithm. Fast-sync will download blocks from a random point and the current archiver will crush because its state demands importing blocks in proper sequence. The solution is to create an option to initialize the archiver on the fly similar to its initialization with the existing state on restart. To achieve that we add an option to reinitialize archiver when it encounters an expected block number. The solution is split in two commits for review convenience - the first commit refactors the existing code to get a separate archive_block function that allows to archive blocks from the regular notifications as well as the "problematic block notification" from the failed block archiving on the previous "initialization loop" iteration.

Code contributor checklist:

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the approach taken, but I don't believe it is working quite the way it should. Archiver should likely be reinitialized normally (without any overrides) and simple but revealing approach (in a sense that rather than ignoring reinitialization errors it would exit the node due to implementation error) would be to ensure that after such reinitialization last archived block is exactly before teh block for which we have saved block import notitication. This would mean that we only expect blocks to be manually imported blocks at the beginning of the archived segment and not completely arbitrarily.

let block_number_to_archive = match block_number.checked_sub(&confirmation_depth_k.into()) {
Some(block_number_to_archive) => block_number_to_archive,
None => {
return Ok(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: in this refactoring PR the boolean returned is always true and no documentation why is it even there or what it means. Since this boolean is only meant to be useful later I probably wouldn't introduce it in the first commit to prevent confusion an only introduce in the second where reviewer can see how it is used and why it is needed.

Comment on lines 855 to 858
let error = format!(
"Failed to recover the archiver for block number: {}",
block_import_notification.block_number
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need to log an error here if we exit with it anyway. The whole node will crash and I believe we will log the error at higher level anyway.

*best_archived_block_number = block_number_to_archive;
let maybe_block_hash = client.hash(block_number_to_archive)?;

let Some(block_hash) = maybe_block_hash else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the correct or maybe exhausive condition. The fact that we don't have a block number to archive doesn't mean we can restart archiver either.

What I think this should check is whether the gap between last archived block and current block to archive is not 1. If it isn't, archiver state will be inconsistent even if block to archive exists (which is theoretically possible). From there we need to retry archiver initialization until it succeeds (because again it is not guaranteed to in case block was imported in such a way that archiver can't be initialized).

The logic here is very fragile and has implicit assumptions that are not obvious, the main of which is that the block import notitification will be about the first block in the segment header or else archiver will either fail to initialize or initialize in the wrong state (not sure which one and lazy to analyze all the code paths right now).

I actually don't think this will work for fast sync from what I recall because in fast sync the first block we import manually bypassing all the checks is the block at which archiver should be initialized and block import notification will be fired with the block that follows. So you have to let archiver pick last archived block and re-initialize itself properly instead of overriding last archived block like done in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think this should check is whether the gap between last archived block and current block to archive is not 1. If it isn't, archiver state will be inconsistent even if block to archive exists (which is theoretically possible).

I agree here - this would be an improvement. I'll change it when we agree on other points.

.. From there we need to retry archiver initialization until it succeeds (because again it is not guaranteed to in case block was imported in such a way that archiver can't be initialized).

I don't think I understand here. Why do you think we need to try to reinitialize the archiver until it succeeds when it fails after the first reinitialization? The current loop allows failing exactly once for each block import attempt because subsequent initialization won't change anything and it's better to fail fast here.

The logic here is very fragile and has implicit assumptions that are not obvious, the main of which is that the block import notitification will be about the first block in the segment header or else archiver will either fail to initialize or initialize in the wrong state (not sure which one and lazy to analyze all the code paths right now).

This confuses me a lot because it's pretty much my own argument when we discussed this approach in contrast to an explicit reinitialization of the previous version.

I actually don't think this will work for fast sync from what I recall because in fast sync the first block we import manually bypassing all the checks is the block at which archiver should be initialized and block import notification will be fired with the block that follows. So you have to let archiver pick last archived block and re-initialize itself properly instead of overriding last archived block like done in this PR.

I tested the PR by applying all the rest of the fast-sync solution and it works as expected. The overriding emerges when we need to deal with the confirmation_depth_k subtraction - I don't see a better way to work around this operation and am happy to implement it differently.

Overall after the last two refactoring PRs the current solution is very close to what we discussed previously (at least from my perspective) as an alternative to the event-based explicit initialization. The deviation from that (best block to archive override and saved block import notification) emerged with the practical implementation of the original sketch. Please, let me know what you think is missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand here. Why do you think we need to try to reinitialize the archiver until it succeeds when it fails after the first reinitialization? The current loop allows failing exactly once for each block import attempt because subsequent initialization won't change anything and it's better to fail fast here.

I should have mentioned that each new attempt should be made after new important blocks. Does it make more sense with this context?

This confuses me a lot because it's pretty much my own argument when we discussed this approach in contrast to an explicit reinitialization of the previous version.

Ideally we would have neither explicit reinitialization nor the issues mentioned and I do believe it is possible.

I tested the PR by applying all the rest of the fast-sync solution and it works as expected. The overriding emerges when we need to deal with the confirmation_depth_k subtraction - I don't see a better way to work around this operation and am happy to implement it differently.

I suspect it worked as expected until it didn't. It would have failed on the next segment header that would happen at a different point/with a different state. Can be verified by modifying your fast sync text to import block from pre-last segment instead so you can check if the next segment is processed correctly. I bet it will not succeed.

Overall after the last two refactoring PRs the current solution is very close to what we discussed previously (at least from my perspective) as an alternative to the event-based explicit initialization. The deviation from that (best block to archive override and saved block import notification) emerged with the practical implementation of the original sketch. Please, let me know what you think is missing.

I agree, just trying to analyze the code path and see if there are issues/improvements with what I see.

It would be great to have tests here to check such cases, but there are quite a few bounds in this function that makes it difficult.

crates/sc-consensus-subspace/src/archiver.rs Outdated Show resolved Hide resolved
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really appreciate if you could extract archive_block function refactoring into a separate PR (with suggested return type applied, though it will be without Option<> wrapper in that PR) because with several incremental commits it is getting harder to distinguish between what has changed and what didn't.

best_block_number.saturating_sub(confirmation_depth_k.into()),
)?;
// Trying to get the "best block to archive" in both cases: regular and fast sync.
let mut best_block_to_archive = best_block_number + 1u32.into();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically this is nonsense, it is not possible to archive block higher than the best block number that exists. And you had to add unnecessary +1 below just to compensate for this issue that wouldn't exist otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best_block_to_archive is an unchanged variable name that is used within find_last_archived_block method only and it is likely a confusing one. The previous pseudocode we agreed on - didn't work. Feel free to suggest a better solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't work how and why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the loop snippet that confuses both of us back to the previous version with override. I left a TODO if we want to return to this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why? Override is a bad and redundant API. How exactly did it not work without +1?

Comment on lines 545 to 549
&& client
.hash(block_number_candidate.saturating_sub(1u32.into()))
.ok()
.flatten()
.is_some()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of tricky code should have comments. Even with deep knowledge of the code I do not understand why you need to also check the parent just from reading the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a reference to fast-sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// We might add the block on the fast sync and we need to check for the parent block
// availability as well to continue iteration.

I am still missing a link between fast sync and parent block. I do not understand why fast sync automatically implies the need to check for parent block. In fact I expect we may not have it once we restart node after single block insertion at the beginning of the segment. There will not be a parent block in that case, but it is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the loop snippet that confuses both of us back to the previous version with override. I left a TODO if we want to return to this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This is exactly the kind of technical debt that will live in the code for years because no one will understand why it was done this way and not in more straightforward way. I'd reather understand why something that should have worked didn't.

crates/sc-consensus-subspace/src/archiver.rs Outdated Show resolved Hide resolved
crates/sc-consensus-subspace/src/archiver.rs Outdated Show resolved Hide resolved
crates/sc-consensus-subspace/src/archiver.rs Outdated Show resolved Hide resolved
block_import_notification.block_number
);
error!(error);

return Err(sp_blockchain::Error::Consensus(sp_consensus::Error::Other(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation was that archiver would just try again. For that to happen we don't need to save block import notification, we can simply wait for next notification before doing re-initialization. And try that in a loop until it succeeds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, if we don't save the import notification - we skip it and fail later. The current loop structure was agreed previously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure was primarily needed for override. Off top of my head I don't see the reason why this is needed and it does make code much harder to follow even if it works correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this thread's question, please rephrase - meanwhile, I'll try to explain how it works:
When the archiver detects a gap between blocks it returns the control to the calling function which in turn saves the current block notification. After that, it reinitializes archiver (via initialization loop) and uses the saved block notification to archive a block again. After that, it resumes the normal process of the block notification processing. If we don't save the problematic block notification then it would be lost and the archiving process would be broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't save the problematic block notification then it would be lost and the archiving process would be broken.

It will not be if we wait for the next block import as I suggested. Archiver will restart in fully deterministic way and will continue operation just fine.

The issue with this line I commented on is that it returns an error, meaning archiver will exit and node will crash with an error. While I think archiver should just restart in a loop until it succeeds.

crates/sc-consensus-subspace/src/archiver.rs Outdated Show resolved Hide resolved
crates/sc-consensus-subspace/src/archiver.rs Outdated Show resolved Hide resolved
Copy link
Member Author

@shamil-gadelshin shamil-gadelshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really appreciate if you could extract archive_block function refactoring into a separate PR

It would be the fourth PR extracted from the first PR-part of the original PR. I do believe it is an overkill - please, let's proceed as it is.

best_block_number.saturating_sub(confirmation_depth_k.into()),
)?;
// Trying to get the "best block to archive" in both cases: regular and fast sync.
let mut best_block_to_archive = best_block_number + 1u32.into();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best_block_to_archive is an unchanged variable name that is used within find_last_archived_block method only and it is likely a confusing one. The previous pseudocode we agreed on - didn't work. Feel free to suggest a better solution.

Comment on lines 545 to 549
&& client
.hash(block_number_candidate.saturating_sub(1u32.into()))
.ok()
.flatten()
.is_some()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a reference to fast-sync.

crates/sc-consensus-subspace/src/archiver.rs Outdated Show resolved Hide resolved
block_import_notification.block_number
);
error!(error);

return Err(sp_blockchain::Error::Consensus(sp_consensus::Error::Other(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, if we don't save the import notification - we skip it and fail later. The current loop structure was agreed previously.

@shamil-gadelshin shamil-gadelshin added the need to audit This change needs to be audited label May 6, 2024
@nazar-pc nazar-pc mentioned this pull request May 7, 2024
1 task
@shamil-gadelshin
Copy link
Member Author

Superseded by #2744 and #2748

@nazar-pc nazar-pc deleted the modify-archiver3 branch May 9, 2024 09:31
@vanhauser-thc
Copy link
Collaborator

this is a closed PR, so why the needs-audit flag?

@vanhauser-thc vanhauser-thc removed the need to audit This change needs to be audited label Jun 7, 2024
@nazar-pc
Copy link
Member

nazar-pc commented Jun 7, 2024

I think in this case you're supposed to look at successors: #2744 and #2748

@nazar-pc
Copy link
Member

nazar-pc commented Jun 7, 2024

I also just labeled a few PRs related to Snap sync for audit. That new sync implementation is what we want to make the default for farmers (though it is not yet the case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants