Modify archiver to support fast-sync. #2722

shamil-gadelshin · 2024-04-26T14:07:23Z

This PR modifies Subspace archiver in preparation for the "fast-sync" algorithm. Fast-sync will download blocks from a random point and the current archiver will crush because its state demands importing blocks in proper sequence. The solution is to create an option to initialize the archiver on the fly similar to its initialization with the existing state on restart. To achieve that we add an option to reinitialize archiver when it encounters an expected block number. The solution is split in two commits for review convenience - the first commit refactors the existing code to get a separate archive_block function that allows to archive blocks from the regular notifications as well as the "problematic block notification" from the failed block archiving on the previous "initialization loop" iteration.

Code contributor checklist:

I have read, understood and followed contributing guide

nazar-pc

I think I understand the approach taken, but I don't believe it is working quite the way it should. Archiver should likely be reinitialized normally (without any overrides) and simple but revealing approach (in a sense that rather than ignoring reinitialization errors it would exit the node due to implementation error) would be to ensure that after such reinitialization last archived block is exactly before teh block for which we have saved block import notitication. This would mean that we only expect blocks to be manually imported blocks at the beginning of the archived segment and not completely arbitrarily.

nazar-pc · 2024-04-27T00:22:50Z

crates/sc-consensus-subspace/src/archiver.rs

+    let block_number_to_archive = match block_number.checked_sub(&confirmation_depth_k.into()) {
+        Some(block_number_to_archive) => block_number_to_archive,
+        None => {
+            return Ok(true);


NIT: in this refactoring PR the boolean returned is always true and no documentation why is it even there or what it means. Since this boolean is only meant to be useful later I probably wouldn't introduce it in the first commit to prevent confusion an only introduce in the second where reviewer can see how it is used and why it is needed.

nazar-pc · 2024-04-27T00:27:07Z

crates/sc-consensus-subspace/src/archiver.rs

+                    let error = format!(
+                        "Failed to recover the archiver for block number: {}",
+                        block_import_notification.block_number
+                    );


Not sure we need to log an error here if we exit with it anyway. The whole node will crash and I believe we will log the error at higher level anyway.

nazar-pc · 2024-04-27T00:36:55Z

crates/sc-consensus-subspace/src/archiver.rs

-    *best_archived_block_number = block_number_to_archive;
+    let maybe_block_hash = client.hash(block_number_to_archive)?;
+
+    let Some(block_hash) = maybe_block_hash else {


I'm not sure this is the correct or maybe exhausive condition. The fact that we don't have a block number to archive doesn't mean we can restart archiver either.

What I think this should check is whether the gap between last archived block and current block to archive is not 1. If it isn't, archiver state will be inconsistent even if block to archive exists (which is theoretically possible). From there we need to retry archiver initialization until it succeeds (because again it is not guaranteed to in case block was imported in such a way that archiver can't be initialized).

The logic here is very fragile and has implicit assumptions that are not obvious, the main of which is that the block import notitification will be about the first block in the segment header or else archiver will either fail to initialize or initialize in the wrong state (not sure which one and lazy to analyze all the code paths right now).

I actually don't think this will work for fast sync from what I recall because in fast sync the first block we import manually bypassing all the checks is the block at which archiver should be initialized and block import notification will be fired with the block that follows. So you have to let archiver pick last archived block and re-initialize itself properly instead of overriding last archived block like done in this PR.

What I think this should check is whether the gap between last archived block and current block to archive is not 1. If it isn't, archiver state will be inconsistent even if block to archive exists (which is theoretically possible).

I agree here - this would be an improvement. I'll change it when we agree on other points.

.. From there we need to retry archiver initialization until it succeeds (because again it is not guaranteed to in case block was imported in such a way that archiver can't be initialized).

I don't think I understand here. Why do you think we need to try to reinitialize the archiver until it succeeds when it fails after the first reinitialization? The current loop allows failing exactly once for each block import attempt because subsequent initialization won't change anything and it's better to fail fast here.

The logic here is very fragile and has implicit assumptions that are not obvious, the main of which is that the block import notitification will be about the first block in the segment header or else archiver will either fail to initialize or initialize in the wrong state (not sure which one and lazy to analyze all the code paths right now).

This confuses me a lot because it's pretty much my own argument when we discussed this approach in contrast to an explicit reinitialization of the previous version.

I actually don't think this will work for fast sync from what I recall because in fast sync the first block we import manually bypassing all the checks is the block at which archiver should be initialized and block import notification will be fired with the block that follows. So you have to let archiver pick last archived block and re-initialize itself properly instead of overriding last archived block like done in this PR.

I tested the PR by applying all the rest of the fast-sync solution and it works as expected. The overriding emerges when we need to deal with the confirmation_depth_k subtraction - I don't see a better way to work around this operation and am happy to implement it differently.

Overall after the last two refactoring PRs the current solution is very close to what we discussed previously (at least from my perspective) as an alternative to the event-based explicit initialization. The deviation from that (best block to archive override and saved block import notification) emerged with the practical implementation of the original sketch. Please, let me know what you think is missing.

I don't think I understand here. Why do you think we need to try to reinitialize the archiver until it succeeds when it fails after the first reinitialization? The current loop allows failing exactly once for each block import attempt because subsequent initialization won't change anything and it's better to fail fast here.

I should have mentioned that each new attempt should be made after new important blocks. Does it make more sense with this context?

This confuses me a lot because it's pretty much my own argument when we discussed this approach in contrast to an explicit reinitialization of the previous version.

Ideally we would have neither explicit reinitialization nor the issues mentioned and I do believe it is possible.

I tested the PR by applying all the rest of the fast-sync solution and it works as expected. The overriding emerges when we need to deal with the confirmation_depth_k subtraction - I don't see a better way to work around this operation and am happy to implement it differently.

I suspect it worked as expected until it didn't. It would have failed on the next segment header that would happen at a different point/with a different state. Can be verified by modifying your fast sync text to import block from pre-last segment instead so you can check if the next segment is processed correctly. I bet it will not succeed.

Overall after the last two refactoring PRs the current solution is very close to what we discussed previously (at least from my perspective) as an alternative to the event-based explicit initialization. The deviation from that (best block to archive override and saved block import notification) emerged with the practical implementation of the original sketch. Please, let me know what you think is missing.

I agree, just trying to analyze the code path and see if there are issues/improvements with what I see.

It would be great to have tests here to check such cases, but there are quite a few bounds in this function that makes it difficult.

crates/sc-consensus-subspace/src/archiver.rs

nazar-pc

I'd really appreciate if you could extract archive_block function refactoring into a separate PR (with suggested return type applied, though it will be without Option<> wrapper in that PR) because with several incremental commits it is getting harder to distinguish between what has changed and what didn't.

nazar-pc · 2024-05-06T08:14:31Z

crates/sc-consensus-subspace/src/archiver.rs

-        best_block_number.saturating_sub(confirmation_depth_k.into()),
-    )?;
+    // Trying to get the "best block to archive" in both cases: regular and fast sync.
+    let mut best_block_to_archive = best_block_number + 1u32.into();


Logically this is nonsense, it is not possible to archive block higher than the best block number that exists. And you had to add unnecessary +1 below just to compensate for this issue that wouldn't exist otherwise.

best_block_to_archive is an unchanged variable name that is used within find_last_archived_block method only and it is likely a confusing one. The previous pseudocode we agreed on - didn't work. Feel free to suggest a better solution.

Didn't work how and why?

I replaced the loop snippet that confuses both of us back to the previous version with override. I left a TODO if we want to return to this issue.

But why? Override is a bad and redundant API. How exactly did it not work without +1?

nazar-pc · 2024-05-06T08:21:56Z

crates/sc-consensus-subspace/src/archiver.rs

+            && client
+                .hash(block_number_candidate.saturating_sub(1u32.into()))
+                .ok()
+                .flatten()
+                .is_some()


This kind of tricky code should have comments. Even with deep knowledge of the code I do not understand why you need to also check the parent just from reading the code.

Added a reference to fast-sync.

// We might add the block on the fast sync and we need to check for the parent block
// availability as well to continue iteration.

I am still missing a link between fast sync and parent block. I do not understand why fast sync automatically implies the need to check for parent block. In fact I expect we may not have it once we restart node after single block insertion at the beginning of the segment. There will not be a parent block in that case, but it is fine.

I replaced the loop snippet that confuses both of us back to the previous version with override. I left a TODO if we want to return to this issue.

Why? This is exactly the kind of technical debt that will live in the code for years because no one will understand why it was done this way and not in more straightforward way. I'd reather understand why something that should have worked didn't.

crates/sc-consensus-subspace/src/archiver.rs

nazar-pc · 2024-05-06T08:30:56Z

crates/sc-consensus-subspace/src/archiver.rs

-                        block_import_notification.block_number
-                    );
-                    error!(error);
-
                    return Err(sp_blockchain::Error::Consensus(sp_consensus::Error::Other(


My expectation was that archiver would just try again. For that to happen we don't need to save block import notification, we can simply wait for next notification before doing re-initialization. And try that in a loop until it succeeds.

Not sure I follow, if we don't save the import notification - we skip it and fail later. The current loop structure was agreed previously.

This structure was primarily needed for override. Off top of my head I don't see the reason why this is needed and it does make code much harder to follow even if it works correctly.

I don't understand this thread's question, please rephrase - meanwhile, I'll try to explain how it works:
When the archiver detects a gap between blocks it returns the control to the calling function which in turn saves the current block notification. After that, it reinitializes archiver (via initialization loop) and uses the saved block notification to archive a block again. After that, it resumes the normal process of the block notification processing. If we don't save the problematic block notification then it would be lost and the archiving process would be broken.

If we don't save the problematic block notification then it would be lost and the archiving process would be broken.

It will not be if we wait for the next block import as I suggested. Archiver will restart in fully deterministic way and will continue operation just fine.

The issue with this line I commented on is that it returns an error, meaning archiver will exit and node will crash with an error. While I think archiver should just restart in a loop until it succeeds.

crates/sc-consensus-subspace/src/archiver.rs

shamil-gadelshin

I'd really appreciate if you could extract archive_block function refactoring into a separate PR

It would be the fourth PR extracted from the first PR-part of the original PR. I do believe it is an overkill - please, let's proceed as it is.

shamil-gadelshin · 2024-05-06T10:52:16Z

crates/sc-consensus-subspace/src/archiver.rs

-        best_block_number.saturating_sub(confirmation_depth_k.into()),
-    )?;
+    // Trying to get the "best block to archive" in both cases: regular and fast sync.
+    let mut best_block_to_archive = best_block_number + 1u32.into();


best_block_to_archive is an unchanged variable name that is used within find_last_archived_block method only and it is likely a confusing one. The previous pseudocode we agreed on - didn't work. Feel free to suggest a better solution.

shamil-gadelshin · 2024-05-06T10:52:43Z

crates/sc-consensus-subspace/src/archiver.rs

+            && client
+                .hash(block_number_candidate.saturating_sub(1u32.into()))
+                .ok()
+                .flatten()
+                .is_some()


Added a reference to fast-sync.

crates/sc-consensus-subspace/src/archiver.rs

shamil-gadelshin · 2024-05-06T10:56:56Z

crates/sc-consensus-subspace/src/archiver.rs

-                        block_import_notification.block_number
-                    );
-                    error!(error);
-
                    return Err(sp_blockchain::Error::Consensus(sp_consensus::Error::Other(


Not sure I follow, if we don't save the import notification - we skip it and fail later. The current loop structure was agreed previously.

shamil-gadelshin · 2024-05-09T09:26:47Z

Superseded by #2744 and #2748

vanhauser-thc · 2024-06-07T10:43:30Z

this is a closed PR, so why the needs-audit flag?

nazar-pc · 2024-06-07T10:46:11Z

I think in this case you're supposed to look at successors: #2744 and #2748

nazar-pc · 2024-06-07T10:48:58Z

I also just labeled a few PRs related to Snap sync for audit. That new sync implementation is what we want to make the default for farmers (though it is not yet the case).

shamil-gadelshin added 2 commits April 26, 2024 20:47

Move out an archiver function

ca639f0

Add initialization loop to archiver.

99a7f20

shamil-gadelshin self-assigned this Apr 26, 2024

shamil-gadelshin requested review from nazar-pc and rg3l3dr as code owners April 26, 2024 14:07

shamil-gadelshin mentioned this pull request Apr 26, 2024

Modify archiver to support fast-sync. #2684

Closed

1 task

nazar-pc requested changes Apr 27, 2024

View reviewed changes

Modify archiver.

8652c20

shamil-gadelshin requested a review from nazar-pc May 6, 2024 08:11

nazar-pc reviewed May 6, 2024

View reviewed changes

Modify archiver.

6c3fe15

shamil-gadelshin commented May 6, 2024

View reviewed changes

shamil-gadelshin requested a review from nazar-pc May 6, 2024 13:00

shamil-gadelshin added the need to audit This change needs to be audited label May 6, 2024

Restore previous modification.

6d2b029

nazar-pc mentioned this pull request May 7, 2024

Extract archive block #2744

Merged

1 task

shamil-gadelshin closed this May 9, 2024

nazar-pc deleted the modify-archiver3 branch May 9, 2024 09:31

vanhauser-thc removed the need to audit This change needs to be audited label Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify archiver to support fast-sync. #2722

Modify archiver to support fast-sync. #2722

shamil-gadelshin commented Apr 26, 2024

nazar-pc left a comment

nazar-pc Apr 27, 2024

nazar-pc Apr 27, 2024

nazar-pc Apr 27, 2024

shamil-gadelshin Apr 29, 2024

nazar-pc Apr 29, 2024

nazar-pc left a comment

nazar-pc May 6, 2024

shamil-gadelshin May 6, 2024

nazar-pc May 6, 2024

shamil-gadelshin May 7, 2024

nazar-pc May 7, 2024

nazar-pc May 6, 2024

shamil-gadelshin May 6, 2024

nazar-pc May 6, 2024

shamil-gadelshin May 7, 2024

nazar-pc May 7, 2024

nazar-pc May 6, 2024

shamil-gadelshin May 6, 2024

nazar-pc May 6, 2024

shamil-gadelshin May 7, 2024

nazar-pc May 7, 2024

shamil-gadelshin left a comment

shamil-gadelshin May 6, 2024

shamil-gadelshin May 6, 2024

shamil-gadelshin May 6, 2024

shamil-gadelshin commented May 9, 2024

vanhauser-thc commented Jun 7, 2024

nazar-pc commented Jun 7, 2024

nazar-pc commented Jun 7, 2024

Modify archiver to support fast-sync. #2722

Modify archiver to support fast-sync. #2722

Conversation

shamil-gadelshin commented Apr 26, 2024

Code contributor checklist:

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin commented May 9, 2024

vanhauser-thc commented Jun 7, 2024

nazar-pc commented Jun 7, 2024

nazar-pc commented Jun 7, 2024