Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify archiver to support fast-sync. #2684

Closed
wants to merge 1 commit into from
Closed

Conversation

shamil-gadelshin
Copy link
Member

This PR modifies Subspace archiver in preparation for the "fast-sync" algorithm. Fast-sync will download blocks from a random point and the current archiver will crush because its state demands importing blocks in proper sequence. The solution is to create an option to initialize the archiver on the fly similar to its initialization with the existing state on restart. To achieve that we add an additional event to SubspaceLink and try reading this event during the block importing loop as the first stage. The archiver.rs changes are mostly due to new spacing, main changes are - event reading, refactored update_segment_headers_for_archived_block, and some new logs.

Code contributor checklist:

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts so far:

  1. I don't think this handles the case of let's say one block imported with fast sync in the middle of nowhere and node shutting down right after that. Archiver will be unable to restart in such situation.
  2. Notification is a bit ugly and I'm wondering if it should instead detect that there is a gap with missing blocks on its own and re-initialize automatically without the need of explicit notification. It might be even combined with 1) above to restart in a loop until it can initialize successfully depending on the desired behavior.

@@ -683,7 +697,7 @@ fn finalize_block<Block, Backend, Client>(
error
})?;

debug!("Finalizing blocks up to ({:?}, {})", number, hash);
info!("Finalizing blocks up to ({:?}, {})", number, hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this should be in info level message?

@@ -917,7 +970,7 @@ async fn send_archived_segment_notification(
archived_segment_notification_sender.notify(move || archived_segment_notification);

while acknowledgement_receiver.next().await.is_some() {
debug!(
info!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very annoying for farmers with hundreds of farms connected to the node. Lets change back to debug.

@@ -29,7 +29,7 @@ type SharedNotificationSenders<T> = Arc<Mutex<Vec<TracingUnboundedSender<T>>>>;

/// The sending half of the Subspace notification channel(s).
#[derive(Clone)]
pub(crate) struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
pub struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this and notify method are public?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to fire an event otherwise? I needed to fire an event from other crates and made this method public.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to fire an event from other places, you probably should use a different API and not this one. This one was designed to be private.

@@ -840,6 +840,10 @@ impl Archiver {
object_mapping,
}
}

pub fn kzg(&self) -> Kzg {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This is an internal property and shouldn't be exposed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To quickly obtain Kzg to reinitilize the archiver. I'm open to a better way to get this data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just need Kzg instance you can acquire it the same exact way as it is done during original archiver initialization. It is quite ugly to just expose random internals because you need them elsewhere. There should be a clean (as much as possible) dependency injection tree, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but Kzg is a wrapper around kzg settings and a cache with more settings protected by mutex judging by the implementation. I treat this structure as a configuration. If this view is not correct and it is unique private information then it should be kept encapsulated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you wanted to say with this comment. Kzg instance is the same everywhere. Yes, it does have some cache for efficiency reasons, but it is not unique or special in any way. We just clone Kzg and pass it along. There is no need to expose it from archiver itself.

last_archived_segment,
&last_archived_block_encoded,
last_archived_block.2,
).expect("Invalid initial archival state should stop the application to prevent further losses.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rustfmt is not able to format code if you have too long strings. Split it into more than one line and it'll be able to successfully reformat the code. It looks a bit off right now.

Copy link
Member Author

@shamil-gadelshin shamil-gadelshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The "force imported" blocks don't go through the archiver and won't affect the archiver on restart. Did you mean this first block or another one?
  2. I'm not sure I follow here. The archiver can only be reinitialized at the beginning of the segment (not at any point) and its ability to make this decision on its own will be used very rarely (it will crush on the attempts on every other occasion). Also it can only autoinitialize on the fly when it will have all the necessary data (last archived segment header, last encoded imported block, object mappings) - do you mean passing to archiver constructor a provider for this data actual on the current block? The event passes the necessary information valid at the exact point.

@@ -840,6 +840,10 @@ impl Archiver {
object_mapping,
}
}

pub fn kzg(&self) -> Kzg {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To quickly obtain Kzg to reinitilize the archiver. I'm open to a better way to get this data.

@@ -29,7 +29,7 @@ type SharedNotificationSenders<T> = Arc<Mutex<Vec<TracingUnboundedSender<T>>>>;

/// The sending half of the Subspace notification channel(s).
#[derive(Clone)]
pub(crate) struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
pub struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to fire an event otherwise? I needed to fire an event from other crates and made this method public.

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "force imported" blocks don't go through the archiver and won't affect the archiver on restart. Did you mean this first block or another one?

Archiver on node restart will read some blocks to reconstruct its state. If there are just random blocks inserted with blocks missing to the left and right, archiver will be unable to restart. Specifically, after step 4/5 in subspace/protocol-specs#26 and before step 6 node is in effectively broken state from which archiver will not be able to recover with code written right now unless I miss something.

I'm not sure I follow here. The archiver can only be reinitialized at the beginning of the segment (not at any point) and its ability to make this decision on its own will be used very rarely (it will crush on the attempts on every other occasion). Also it can only autoinitialize on the fly when it will have all the necessary data (last archived segment header, last encoded imported block, object mappings) - do you mean passing to archiver constructor a provider for this data actual on the current block? The event passes the necessary information valid at the exact point.

Archiver can and is reinitialized at arbitrary point right now, it automatically finds correct block and segment header to start from and moves from there without issues. I can imagine that it might do something like detecting missing block that it can't continue from, wait for the next block to be imported and trying again from there until it succeeds instead of exiting with error and bringing down the whole node with it. We basically have new assumptions that old design didn't support and we need to fix that.

In the process of doing this I think we need to remove or at least redesign SubspaceLink.segment_headers to not fill it from archiver, but rather query necessary info from SegmentHeadersStore that is present and contains data even before archiver is initialized, which might be important in case archiver wasn't able to start successfully and you are trying to import some blocks that expect inherents in them. This will also likely remove the need to run synchronous code in create_subspace_archiver before returning future from it, which in turn will likely allow to just add loop{} in there and allow it to restart as necessary in full, simplifying reinitialization logic.

@@ -29,7 +29,7 @@ type SharedNotificationSenders<T> = Arc<Mutex<Vec<TracingUnboundedSender<T>>>>;

/// The sending half of the Subspace notification channel(s).
#[derive(Clone)]
pub(crate) struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
pub struct SubspaceNotificationSender<T: Clone + Send + Sync + fmt::Debug + 'static> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to fire an event from other places, you probably should use a different API and not this one. This one was designed to be private.

@@ -840,6 +840,10 @@ impl Archiver {
object_mapping,
}
}

pub fn kzg(&self) -> Kzg {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just need Kzg instance you can acquire it the same exact way as it is done during original archiver initialization. It is quite ugly to just expose random internals because you need them elsewhere. There should be a clean (as much as possible) dependency injection tree, etc.

Copy link
Member Author

@shamil-gadelshin shamil-gadelshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Archiver on node restart will read some blocks to reconstruct its state. If there are just random blocks inserted with blocks missing to the left and right, archiver will be unable to restart. Specifically, after step 4/5 in subspace/protocol-specs#26 and before step 6 node is in effectively broken state from which archiver will not be able to recover with code written right now unless I miss something.

I don't think I understand this statement. After steps 4-5 the archiver will remain untouched because the block will be "force imported" bypassing all the checks and all the subspace code. During the step 6 it will start to import blocks in normal mode and will be able to recover as it was before. I think misunderstanding here causes the "force import" concept.

Archiver can and is reinitialized at arbitrary point right now, it automatically finds correct block and segment header to start from and moves from there without issues.

Achiver can be renitialized at arbitrary point when it has access to the full correct history of blocks which is not the case for fast sync. In case of fast sync in can do that at specific points (beginning of the segment) with additional data provided for this "event".

I can imagine that it might do something like detecting missing block that it can't continue from, wait for the next block to be imported and trying again from there until it succeeds instead of exiting with error and bringing down the whole node with it.

Is there an option to skip blocks imported by archiver? I thought that archiver import should be infallible.

We basically have new assumptions that old design didn't support and we need to fix that.

This is exactly true because we have a new feature that the old design didn't expect. Did you mean assumptions that must be untouched because of the outer context?

instead of exiting with error and bringing down the whole node with it.

It should be the case when archiver receives incorrect data. Am I missing an option when we can proceed without a working archiver?

I don't fully grasp your suggestions however at this moment I can imagine that we'll need some kind of oracle to get the necessary data (last archived segment header, last encoded imported block, object mappings) in case of proposed autodetection of incorrect archiver state.

In the process of doing this I think we need to remove or at least redesign SubspaceLink.segment_headers to not fill it from archiver, but rather query necessary info from SegmentHeadersStore that is present and contains data even before archiver is initialized,

Did you mean traversing the segment headers and returning a block number by last_archived_block from segment headers?

@@ -840,6 +840,10 @@ impl Archiver {
object_mapping,
}
}

pub fn kzg(&self) -> Kzg {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but Kzg is a wrapper around kzg settings and a cache with more settings protected by mutex judging by the implementation. I treat this structure as a configuration. If this view is not correct and it is unique private information then it should be kept encapsulated.

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this statement. After steps 4-5 the archiver will remain untouched because the block will be "force imported" bypassing all the checks and all the subspace code. During the step 6 it will start to import blocks in normal mode and will be able to recover as it was before. I think misunderstanding here causes the "force import" concept.

Now imagine node restarts right after step 6, do you see the issue now?

Achiver can be renitialized at arbitrary point when it has access to the full correct history of blocks which is not the case for fast sync. In case of fast sync in can do that at specific points (beginning of the segment) with additional data provided for this "event".

Which is exactly the issue I described above and one that I think we should tackle.

Is there an option to skip blocks imported by archiver? I thought that archiver import should be infallible.

Adding blocks to already instantiated archiver is infallible of course. I was talking about archiving erroring out as described above and the need to recover from such situations. One way is to simply try again archiver initialization.

This is exactly true because we have a new feature that the old design didn't expect. Did you mean assumptions that must be untouched because of the outer context?

Not sure what you mean by "must be untouched". I'm advocating for touching even more code to handle more cases and make implementation cleaner at the same time. The first step there is to remove SubspaceLink.segment_headers.

Did you mean traversing the segment headers and returning a block number by last_archived_block from segment headers?

Hopefully not traversing too much, but yes

@@ -840,6 +840,10 @@ impl Archiver {
object_mapping,
}
}

pub fn kzg(&self) -> Kzg {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you wanted to say with this comment. Kzg instance is the same everywhere. Yes, it does have some cache for efficiency reasons, but it is not unique or special in any way. We just clone Kzg and pass it along. There is no need to expose it from archiver itself.

@shamil-gadelshin
Copy link
Member Author

The first step there is to remove SubspaceLink.segment_headers.

It seems to be out of the scope of this PR. I will create a new PR and then rework this PR on top of it.

This was referenced Apr 22, 2024
@shamil-gadelshin
Copy link
Member Author

Superseded by #2722

@nazar-pc nazar-pc deleted the modify-archiver branch September 26, 2024 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants