Modify archiver to support fast-sync. #2684

shamil-gadelshin · 2024-04-16T07:27:27Z

This PR modifies Subspace archiver in preparation for the "fast-sync" algorithm. Fast-sync will download blocks from a random point and the current archiver will crush because its state demands importing blocks in proper sequence. The solution is to create an option to initialize the archiver on the fly similar to its initialization with the existing state on restart. To achieve that we add an additional event to SubspaceLink and try reading this event during the block importing loop as the first stage. The archiver.rs changes are mostly due to new spacing, main changes are - event reading, refactored update_segment_headers_for_archived_block, and some new logs.

Code contributor checklist:

I have read, understood and followed contributing guide

nazar-pc

Two thoughts so far:

I don't think this handles the case of let's say one block imported with fast sync in the middle of nowhere and node shutting down right after that. Archiver will be unable to restart in such situation.
Notification is a bit ugly and I'm wondering if it should instead detect that there is a gap with missing blocks on its own and re-initialize automatically without the need of explicit notification. It might be even combined with 1) above to restart in a loop until it can initialize successfully depending on the desired behavior.

nazar-pc · 2024-04-16T07:48:49Z

crates/sc-consensus-subspace/src/archiver.rs

@@ -683,7 +697,7 @@ fn finalize_block<Block, Backend, Client>(
                error
            })?;

-        debug!("Finalizing blocks up to ({:?}, {})", number, hash);
+        info!("Finalizing blocks up to ({:?}, {})", number, hash);


Are you sure this should be in info level message?

nazar-pc · 2024-04-16T07:49:32Z

crates/sc-consensus-subspace/src/archiver.rs

@@ -917,7 +970,7 @@ async fn send_archived_segment_notification(
    archived_segment_notification_sender.notify(move || archived_segment_notification);

    while acknowledgement_receiver.next().await.is_some() {
-        debug!(
+        info!(


This will be very annoying for farmers with hundreds of farms connected to the node. Lets change back to debug.

nazar-pc · 2024-04-16T07:50:55Z

crates/sc-consensus-subspace/src/notification.rs

@@ -29,7 +29,7 @@ type SharedNotificationSenders<T> = Arc<Mutex<Vec<TracingUnboundedSender<T>>>>;

 /// The sending half of the Subspace notification channel(s).
 #[derive(Clone)]
-pub(crate) struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
+pub struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {


Why is this and notify method are public?

Is there a way to fire an event otherwise? I needed to fire an event from other crates and made this method public.

If you need to fire an event from other places, you probably should use a different API and not this one. This one was designed to be private.

nazar-pc · 2024-04-16T07:51:10Z

crates/subspace-archiving/src/archiver.rs

@@ -840,6 +840,10 @@ impl Archiver {
            object_mapping,
        }
    }
+
+    pub fn kzg(&self) -> Kzg {


Why? This is an internal property and shouldn't be exposed.

To quickly obtain Kzg to reinitilize the archiver. I'm open to a better way to get this data.

If you just need Kzg instance you can acquire it the same exact way as it is done during original archiver initialization. It is quite ugly to just expose random internals because you need them elsewhere. There should be a clean (as much as possible) dependency injection tree, etc.

Correct me if I'm wrong but Kzg is a wrapper around kzg settings and a cache with more settings protected by mutex judging by the implementation. I treat this structure as a configuration. If this view is not correct and it is unique private information then it should be kept encapsulated.

I'm not sure what you wanted to say with this comment. Kzg instance is the same everywhere. Yes, it does have some cache for efficiency reasons, but it is not unique or special in any way. We just clone Kzg and pass it along. There is no need to expose it from archiver itself.

nazar-pc · 2024-04-16T07:52:21Z

crates/sc-consensus-subspace/src/archiver.rs

+                        last_archived_segment,
+                        &last_archived_block_encoded,
+                        last_archived_block.2,
+                    ).expect("Invalid initial archival state should stop the application to prevent further losses.");


rustfmt is not able to format code if you have too long strings. Split it into more than one line and it'll be able to successfully reformat the code. It looks a bit off right now.

shamil-gadelshin

The "force imported" blocks don't go through the archiver and won't affect the archiver on restart. Did you mean this first block or another one?
I'm not sure I follow here. The archiver can only be reinitialized at the beginning of the segment (not at any point) and its ability to make this decision on its own will be used very rarely (it will crush on the attempts on every other occasion). Also it can only autoinitialize on the fly when it will have all the necessary data (last archived segment header, last encoded imported block, object mappings) - do you mean passing to archiver constructor a provider for this data actual on the current block? The event passes the necessary information valid at the exact point.

shamil-gadelshin · 2024-04-16T13:22:36Z

crates/subspace-archiving/src/archiver.rs

@@ -840,6 +840,10 @@ impl Archiver {
            object_mapping,
        }
    }
+
+    pub fn kzg(&self) -> Kzg {


To quickly obtain Kzg to reinitilize the archiver. I'm open to a better way to get this data.

shamil-gadelshin · 2024-04-16T13:23:58Z

crates/sc-consensus-subspace/src/notification.rs

@@ -29,7 +29,7 @@ type SharedNotificationSenders<T> = Arc<Mutex<Vec<TracingUnboundedSender<T>>>>;

 /// The sending half of the Subspace notification channel(s).
 #[derive(Clone)]
-pub(crate) struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
+pub struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {


Is there a way to fire an event otherwise? I needed to fire an event from other crates and made this method public.

nazar-pc

The "force imported" blocks don't go through the archiver and won't affect the archiver on restart. Did you mean this first block or another one?

Archiver on node restart will read some blocks to reconstruct its state. If there are just random blocks inserted with blocks missing to the left and right, archiver will be unable to restart. Specifically, after step 4/5 in subspace/protocol-specs#26 and before step 6 node is in effectively broken state from which archiver will not be able to recover with code written right now unless I miss something.

I'm not sure I follow here. The archiver can only be reinitialized at the beginning of the segment (not at any point) and its ability to make this decision on its own will be used very rarely (it will crush on the attempts on every other occasion). Also it can only autoinitialize on the fly when it will have all the necessary data (last archived segment header, last encoded imported block, object mappings) - do you mean passing to archiver constructor a provider for this data actual on the current block? The event passes the necessary information valid at the exact point.

Archiver can and is reinitialized at arbitrary point right now, it automatically finds correct block and segment header to start from and moves from there without issues. I can imagine that it might do something like detecting missing block that it can't continue from, wait for the next block to be imported and trying again from there until it succeeds instead of exiting with error and bringing down the whole node with it. We basically have new assumptions that old design didn't support and we need to fix that.

In the process of doing this I think we need to remove or at least redesign SubspaceLink.segment_headers to not fill it from archiver, but rather query necessary info from SegmentHeadersStore that is present and contains data even before archiver is initialized, which might be important in case archiver wasn't able to start successfully and you are trying to import some blocks that expect inherents in them. This will also likely remove the need to run synchronous code in create_subspace_archiver before returning future from it, which in turn will likely allow to just add loop{} in there and allow it to restart as necessary in full, simplifying reinitialization logic.

nazar-pc · 2024-04-16T15:23:23Z

crates/sc-consensus-subspace/src/notification.rs

@@ -29,7 +29,7 @@ type SharedNotificationSenders<T> = Arc<Mutex<Vec<TracingUnboundedSender<T>>>>;

 /// The sending half of the Subspace notification channel(s).
 #[derive(Clone)]
-pub(crate) struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
+pub struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {


If you need to fire an event from other places, you probably should use a different API and not this one. This one was designed to be private.

nazar-pc · 2024-04-16T15:25:28Z

crates/subspace-archiving/src/archiver.rs

@@ -840,6 +840,10 @@ impl Archiver {
            object_mapping,
        }
    }
+
+    pub fn kzg(&self) -> Kzg {


If you just need Kzg instance you can acquire it the same exact way as it is done during original archiver initialization. It is quite ugly to just expose random internals because you need them elsewhere. There should be a clean (as much as possible) dependency injection tree, etc.

shamil-gadelshin

Archiver on node restart will read some blocks to reconstruct its state. If there are just random blocks inserted with blocks missing to the left and right, archiver will be unable to restart. Specifically, after step 4/5 in subspace/protocol-specs#26 and before step 6 node is in effectively broken state from which archiver will not be able to recover with code written right now unless I miss something.

I don't think I understand this statement. After steps 4-5 the archiver will remain untouched because the block will be "force imported" bypassing all the checks and all the subspace code. During the step 6 it will start to import blocks in normal mode and will be able to recover as it was before. I think misunderstanding here causes the "force import" concept.

Archiver can and is reinitialized at arbitrary point right now, it automatically finds correct block and segment header to start from and moves from there without issues.

Achiver can be renitialized at arbitrary point when it has access to the full correct history of blocks which is not the case for fast sync. In case of fast sync in can do that at specific points (beginning of the segment) with additional data provided for this "event".

I can imagine that it might do something like detecting missing block that it can't continue from, wait for the next block to be imported and trying again from there until it succeeds instead of exiting with error and bringing down the whole node with it.

Is there an option to skip blocks imported by archiver? I thought that archiver import should be infallible.

We basically have new assumptions that old design didn't support and we need to fix that.

This is exactly true because we have a new feature that the old design didn't expect. Did you mean assumptions that must be untouched because of the outer context?

instead of exiting with error and bringing down the whole node with it.

It should be the case when archiver receives incorrect data. Am I missing an option when we can proceed without a working archiver?

I don't fully grasp your suggestions however at this moment I can imagine that we'll need some kind of oracle to get the necessary data (last archived segment header, last encoded imported block, object mappings) in case of proposed autodetection of incorrect archiver state.

In the process of doing this I think we need to remove or at least redesign SubspaceLink.segment_headers to not fill it from archiver, but rather query necessary info from SegmentHeadersStore that is present and contains data even before archiver is initialized,

Did you mean traversing the segment headers and returning a block number by last_archived_block from segment headers?

shamil-gadelshin · 2024-04-17T15:32:44Z

crates/subspace-archiving/src/archiver.rs

@@ -840,6 +840,10 @@ impl Archiver {
            object_mapping,
        }
    }
+
+    pub fn kzg(&self) -> Kzg {


Correct me if I'm wrong but Kzg is a wrapper around kzg settings and a cache with more settings protected by mutex judging by the implementation. I treat this structure as a configuration. If this view is not correct and it is unique private information then it should be kept encapsulated.

nazar-pc

I don't think I understand this statement. After steps 4-5 the archiver will remain untouched because the block will be "force imported" bypassing all the checks and all the subspace code. During the step 6 it will start to import blocks in normal mode and will be able to recover as it was before. I think misunderstanding here causes the "force import" concept.

Now imagine node restarts right after step 6, do you see the issue now?

Achiver can be renitialized at arbitrary point when it has access to the full correct history of blocks which is not the case for fast sync. In case of fast sync in can do that at specific points (beginning of the segment) with additional data provided for this "event".

Which is exactly the issue I described above and one that I think we should tackle.

Is there an option to skip blocks imported by archiver? I thought that archiver import should be infallible.

Adding blocks to already instantiated archiver is infallible of course. I was talking about archiving erroring out as described above and the need to recover from such situations. One way is to simply try again archiver initialization.

This is exactly true because we have a new feature that the old design didn't expect. Did you mean assumptions that must be untouched because of the outer context?

Not sure what you mean by "must be untouched". I'm advocating for touching even more code to handle more cases and make implementation cleaner at the same time. The first step there is to remove SubspaceLink.segment_headers.

Did you mean traversing the segment headers and returning a block number by last_archived_block from segment headers?

Hopefully not traversing too much, but yes

nazar-pc · 2024-04-17T16:58:33Z

crates/subspace-archiving/src/archiver.rs

@@ -840,6 +840,10 @@ impl Archiver {
            object_mapping,
        }
    }
+
+    pub fn kzg(&self) -> Kzg {


I'm not sure what you wanted to say with this comment. Kzg instance is the same everywhere. Yes, it does have some cache for efficiency reasons, but it is not unique or special in any way. We just clone Kzg and pass it along. There is no need to expose it from archiver itself.

shamil-gadelshin · 2024-04-18T07:34:41Z

The first step there is to remove SubspaceLink.segment_headers.

It seems to be out of the scope of this PR. I will create a new PR and then rework this PR on top of it.

shamil-gadelshin · 2024-04-26T14:07:43Z

Superseded by #2722

node: Modify archiver to support fast-sync.

0c170ba

shamil-gadelshin self-assigned this Apr 16, 2024

shamil-gadelshin requested review from nazar-pc and rg3l3dr as code owners April 16, 2024 07:27

nazar-pc reviewed Apr 16, 2024

View reviewed changes

shamil-gadelshin commented Apr 16, 2024

View reviewed changes

nazar-pc reviewed Apr 16, 2024

View reviewed changes

shamil-gadelshin commented Apr 17, 2024

View reviewed changes

nazar-pc reviewed Apr 17, 2024

View reviewed changes

This was referenced Apr 22, 2024

Refactor archiver. #2695

Merged

Refactor archiver initialization. #2708

Merged

shamil-gadelshin closed this Apr 26, 2024

nazar-pc deleted the modify-archiver branch September 26, 2024 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify archiver to support fast-sync. #2684

Modify archiver to support fast-sync. #2684

shamil-gadelshin commented Apr 16, 2024

nazar-pc left a comment

nazar-pc Apr 16, 2024

nazar-pc Apr 16, 2024

nazar-pc Apr 16, 2024

shamil-gadelshin Apr 16, 2024

nazar-pc Apr 16, 2024

nazar-pc Apr 16, 2024

shamil-gadelshin Apr 16, 2024

nazar-pc Apr 16, 2024

shamil-gadelshin Apr 17, 2024

nazar-pc Apr 17, 2024

nazar-pc Apr 16, 2024

shamil-gadelshin left a comment

shamil-gadelshin Apr 16, 2024

shamil-gadelshin Apr 16, 2024

nazar-pc left a comment

nazar-pc Apr 16, 2024

nazar-pc Apr 16, 2024

shamil-gadelshin left a comment

shamil-gadelshin Apr 17, 2024

nazar-pc left a comment

nazar-pc Apr 17, 2024

shamil-gadelshin commented Apr 18, 2024

shamil-gadelshin commented Apr 26, 2024

Modify archiver to support fast-sync. #2684

Modify archiver to support fast-sync. #2684

Conversation

shamil-gadelshin commented Apr 16, 2024

Code contributor checklist:

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin commented Apr 18, 2024

shamil-gadelshin commented Apr 26, 2024