Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fast-sync algorithm. #16

Merged
merged 4 commits into from
May 9, 2024

Conversation

shamil-gadelshin
Copy link
Member

This PR adds support for Subspace fast-sync.

The first two commits fixes found issues from the current polkadot-sdk branch (subspace-v4).

The solution itself is split into several parts (with related commits):

  • new syncing engine
  • modification of the existing infrastructure for block import
  • adding a sync restart method for the original syncing engine

New syncing engine

The new syncing engine (FastSyncEngine) is a simplified version of the existing syncing engine that uses existing code for state sync. It has an additional SyncService but with a single API for now that starts the sync process. FastSyncEngine doesn't contain peer management options as its original and those peers must be provided from the outside (I exported currently connected peers from the network for this reason). FastSyncEngine downloads a state for the given block from the given peer and imports it into the blockchain.

Modification of the existing infrastructure for block import

Several new features:

  • We needed to enable importing blocks bypassing the checks into the blockchain (import_raw_block API). It uses a native way to import the block (lock_import_and_run and apply_block) with a writing zero weight for the block and demanding the current sync mode set to Light.

  • New API clear_block_gap was added to mitigate the block gap which is set on block import operation. We needed that to disable the sync of the previous blocks that happens otherwise.

  • New API open_peers returns the currently connected peers to enable fast sync requests. The naming was derived from the method of the network crate that provides the actual data.

Sync restart

After the initial fast sync is done we need to switch to the Full sync mode to enable importing full blocks from peers. Also, we update the common block number for connected peers to reduce the number of incorrectly requested blocks. This feature is actually optional because chain sync will do both automatically but it will produce weird error messages. Both changes are UX improvements.

@nazar-pc
Copy link
Member

Each version of the PR gets better, but still mostly describes what is done and now why. Left some comments below.

Did you try to upstream any of this yet? This is a third PR with fast sync opened in this repo.

BTW first two commits are non-controversial and better be sent as a separate PR that we can merge quickly.


New syncing engine

Why do we need a new syncing engine? It is not like we need to "sync" anything, we just need to download state for a know block so we can import the block "directly".

We needed to enable importing blocks bypassing the checks into the blockchain (import_raw_block API). It uses a native way to import the block (lock_import_and_run and apply_block) with a writing zero weight for the block and demanding the current sync mode set to Light.

lock_import_and_run is definitely public and available to us externally, I suspect .apply_block is as well. Why do we need import_raw_block in Substrate then instead of just in Subspace?

New API clear_block_gap was added to mitigate the block gap which is set on block import operation. We needed that to disable the sync of the previous blocks that happens otherwise.

Why does it happen though? If we have a pruned node and restart it then gap sync doesn't happen, so what is different when we just insert a block bypassing checks?

New API open_peers returns the currently connected peers to enable fast sync requests. The naming was derived from the method of the network crate that provides the actual data.

Not clear how does it enable fast sync requests, or rather how other sync strategies ever worked without such an API.

@shamil-gadelshin
Copy link
Member Author

shamil-gadelshin commented Apr 30, 2024

Did you try to upstream any of this yet? This is a third PR with fast sync opened in this repo.

As discussed offline, I can try to upstream it after this review - because the changes(requirements) for the upstream are not finalized yet.

BTW first two commits are non-controversial and better be sent as a separate PR that we can merge quickly.

Please, confirm that you want to have it as a separate PR - I doubt that subsequent rebase will succeed. Otherwise, I'd need to recreate this PR which can be confusing.

Why do we need a new syncing engine? It is not like we need to "sync" anything, we just need to download state for a know block so we can import the block "directly".

It is called syncing engine similar to the original one. Even though we used a subset of the functionality - our syncing engine provides a state machine for StateSync - it issues state requests, handles responses, accumulates the state, and handles the errors. It uses several private APIs otherwise it can be moved to Subspace code.

lock_import_and_run is definitely public and available to us externally, I suspect .apply_block is as well. Why do we need import_raw_block in Substrate then instead of just in Subspace?

apply_block is private. However, if we make it public - it seems we are able to move the code to Subspace. Please, let me know if you prefer this way.

Why does it happen though? If we have a pruned node and restart it then gap sync doesn't happen, so what is different when we just insert a block bypassing checks?

It seems we hit this condition: https://github.com/subspace/polkadot-sdk/blob/61be78c621ab2fa390cd3bfc79c8307431d0ea90/substrate/client/db/src/lib.rs#L1688

Not clear how does it enable fast sync requests, or rather how other sync strategies ever worked without such an API.

Requests are initiated using just PeerId and the networking layer picks the correct current connection. The original syncing engine manages the current connected peer set for multiple reasons (using the networking layer notifications). It makes sense to maintain that because the engine handles block announcements as well as recurrent syncing. Our use case however requires a single peer (several peers actually to handle possible errors) and performs the sync once - this is why I provided currently connected peer list on fast-sync attempt. Initially, I wanted to add peer management similar to the original engine as a further improvement - I'm not sure we need it though.

@nazar-pc
Copy link
Member

Please, confirm that you want to have it as a separate PR - I doubt that subsequent rebase will succeed. Otherwise, I'd need to recreate this PR which can be confusing.

Yes. And I'm not sure why would rebase not succeed if parent contains the same exact commits.

It is called syncing engine similar to the original one. Even though we used a subset of the functionality - our syncing engine provides a state machine for StateSync - it issues state requests, handles responses, accumulates the state, and handles the errors. It uses several private APIs otherwise it can be moved to Subspace code.

It doesn't sound appropriate to use sync engine to download a single block worth of state, though I might be wrong. I would have expected a long and productive discussion upstream long before this PR is opened.

apply_block is private. However, if we make it public - it seems we are able to move the code to Subspace. Please, let me know if you prefer this way.

Again, I would have expected such PR upstream long time ago and probably already in latest polkadot-sdk release/our fork by now.

It seems we hit this condition:

Something to discuss upstream then.

Requests are initiated using just PeerId and the networking layer picks the correct current connection. The original syncing engine manages the current connected peer set for multiple reasons (using the networking layer notifications). It makes sense to maintain that because the engine handles block announcements as well as recurrent syncing. Our use case however requires a single peer (several peers actually to handle possible errors) and performs the sync once - this is why I provided currently connected peer list on fast-sync attempt. Initially, I wanted to add peer management similar to the original engine as a further improvement - I'm not sure we need it though.

Again, something to discuss with upstream folks. But current implmenetation seems very invasive and not clear if it is actually necessary/the best way forward.

@shamil-gadelshin
Copy link
Member Author

But current implmenetation seems very invasive...

It is invasive indeed. I will try to move the absolute majority of the new code into our repository.

@shamil-gadelshin
Copy link
Member Author

Relates to #17

@shamil-gadelshin
Copy link
Member Author

The last update removes "raw block import" and fast-sync code from the polkadot-sdk. It also makes several entities public to support moving fast-sync to Subspace code as well as exports apply_block method to move out "raw block import".

@nazar-pc
Copy link
Member

nazar-pc commented May 1, 2024

Could you squash corresponding commits? Looks like a much less involved set of changes, thanks!

substrate/client/service/src/client/client.rs Outdated Show resolved Hide resolved
substrate/client/network/src/service.rs Outdated Show resolved Hide resolved
substrate/client/network/src/service.rs Outdated Show resolved Hide resolved
Self: ProvideRuntimeApi<Block>,
<Self as ProvideRuntimeApi<Block>>::Api: CoreApi<Block> + ApiExt<Block>,
{
self.apply_block(operation, import_block, storage_changes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably move implementation here instead (though I understand that you didn't in order to decrease the diff)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the next code update should be easier this way in case of apply_block change.

}

fn clear_block_gap(&self) {
self.backend.blockchain().clear_block_gap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we call it directly somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find the easier way. Otherwise, I wouldn't introduce a stub in in_mem implementation. Happy to change it if there is actually a way.

@@ -91,6 +101,14 @@ impl<B: BlockT> SyncingService<B> {
rx.await
}

/// Restart the synchronization with new arguments.
pub async fn restart(&self, sync_restart_args: SyncRestartArgs<B>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no fast sync anymore, do we still need to restart sync? If so then why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll start the fast-sync using SyncMode::Light configuration to allow the block insertion without the sequence check. Later we need to switch to the SyncMode::Full to download the full blocks on a regular sync. As I mentioned in the description, Substrate sync can switch to the full mode automatically but it will generate weird messages. It is a UX improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand why SyncMode::Light is necessary in the first place. We don't want to sync, we just want to insert one block with its state into the database.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact error is InvalidBlockNumber within NonCanonicalOverlay struct (insert method). It seems to trigger when LAST_CANONICAL db meta entry is set which in turn seems to be set when we invoke try_commit_operation (canonicalize_block() when operation.commit_state is true). Setting LightState sets commit_state initially to false when we invoke set_genesis_state having

/// Returns true if the genesis state writing will be skipped while initializing the genesis
/// block.
pub fn no_genesis(&self) -> bool {
   matches!(self.network.sync_mode, SyncMode::LightState { .. } | SyncMode::Warp { .. })
}

This way we disable overlay related checks when we initially don't commit state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how do we achieve the same without light state sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely possible by hijacking lock_import_and_run method and parameterizing commit_state variable but it would be very invasive within the current Substrate architecture. Switching the sync mode before passing the control is much less "foreign" than the proposed change if I understand you correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what upstream suggested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I did this call stack research based on your Thursday's request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be completely honest and transparent I'm getting annoyed and frustrated that it takes forever to convince you to talk to upstream 😕

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to clarify my priorities then. @dariolina told me that we'll likely have time for this.

@shamil-gadelshin shamil-gadelshin merged commit 8082697 into subspace-v5 May 9, 2024
8 of 12 checks passed
@nazar-pc nazar-pc deleted the subspace-fast-sync-new4 branch May 9, 2024 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants