Support fast-sync algorithm. #16

shamil-gadelshin · 2024-04-29T13:12:56Z

This PR adds support for Subspace fast-sync.

The first two commits fixes found issues from the current polkadot-sdk branch (subspace-v4).

The solution itself is split into several parts (with related commits):

new syncing engine
modification of the existing infrastructure for block import
adding a sync restart method for the original syncing engine

New syncing engine

The new syncing engine (FastSyncEngine) is a simplified version of the existing syncing engine that uses existing code for state sync. It has an additional SyncService but with a single API for now that starts the sync process. FastSyncEngine doesn't contain peer management options as its original and those peers must be provided from the outside (I exported currently connected peers from the network for this reason). FastSyncEngine downloads a state for the given block from the given peer and imports it into the blockchain.

Modification of the existing infrastructure for block import

Several new features:

We needed to enable importing blocks bypassing the checks into the blockchain (import_raw_block API). It uses a native way to import the block (lock_import_and_run and apply_block) with a writing zero weight for the block and demanding the current sync mode set to Light.
New API clear_block_gap was added to mitigate the block gap which is set on block import operation. We needed that to disable the sync of the previous blocks that happens otherwise.
New API open_peers returns the currently connected peers to enable fast sync requests. The naming was derived from the method of the network crate that provides the actual data.

Sync restart

After the initial fast sync is done we need to switch to the Full sync mode to enable importing full blocks from peers. Also, we update the common block number for connected peers to reduce the number of incorrectly requested blocks. This feature is actually optional because chain sync will do both automatically but it will produce weird error messages. Both changes are UX improvements.

nazar-pc · 2024-04-30T08:04:25Z

Each version of the PR gets better, but still mostly describes what is done and now why. Left some comments below.

Did you try to upstream any of this yet? This is a third PR with fast sync opened in this repo.

BTW first two commits are non-controversial and better be sent as a separate PR that we can merge quickly.

New syncing engine

Why do we need a new syncing engine? It is not like we need to "sync" anything, we just need to download state for a know block so we can import the block "directly".

We needed to enable importing blocks bypassing the checks into the blockchain (import_raw_block API). It uses a native way to import the block (lock_import_and_run and apply_block) with a writing zero weight for the block and demanding the current sync mode set to Light.

lock_import_and_run is definitely public and available to us externally, I suspect .apply_block is as well. Why do we need import_raw_block in Substrate then instead of just in Subspace?

New API clear_block_gap was added to mitigate the block gap which is set on block import operation. We needed that to disable the sync of the previous blocks that happens otherwise.

Why does it happen though? If we have a pruned node and restart it then gap sync doesn't happen, so what is different when we just insert a block bypassing checks?

New API open_peers returns the currently connected peers to enable fast sync requests. The naming was derived from the method of the network crate that provides the actual data.

Not clear how does it enable fast sync requests, or rather how other sync strategies ever worked without such an API.

shamil-gadelshin · 2024-04-30T09:32:47Z

Did you try to upstream any of this yet? This is a third PR with fast sync opened in this repo.

As discussed offline, I can try to upstream it after this review - because the changes(requirements) for the upstream are not finalized yet.

BTW first two commits are non-controversial and better be sent as a separate PR that we can merge quickly.

Please, confirm that you want to have it as a separate PR - I doubt that subsequent rebase will succeed. Otherwise, I'd need to recreate this PR which can be confusing.

Why do we need a new syncing engine? It is not like we need to "sync" anything, we just need to download state for a know block so we can import the block "directly".

It is called syncing engine similar to the original one. Even though we used a subset of the functionality - our syncing engine provides a state machine for StateSync - it issues state requests, handles responses, accumulates the state, and handles the errors. It uses several private APIs otherwise it can be moved to Subspace code.

lock_import_and_run is definitely public and available to us externally, I suspect .apply_block is as well. Why do we need import_raw_block in Substrate then instead of just in Subspace?

apply_block is private. However, if we make it public - it seems we are able to move the code to Subspace. Please, let me know if you prefer this way.

Why does it happen though? If we have a pruned node and restart it then gap sync doesn't happen, so what is different when we just insert a block bypassing checks?

It seems we hit this condition: https://github.com/subspace/polkadot-sdk/blob/61be78c621ab2fa390cd3bfc79c8307431d0ea90/substrate/client/db/src/lib.rs#L1688

Not clear how does it enable fast sync requests, or rather how other sync strategies ever worked without such an API.

Requests are initiated using just PeerId and the networking layer picks the correct current connection. The original syncing engine manages the current connected peer set for multiple reasons (using the networking layer notifications). It makes sense to maintain that because the engine handles block announcements as well as recurrent syncing. Our use case however requires a single peer (several peers actually to handle possible errors) and performs the sync once - this is why I provided currently connected peer list on fast-sync attempt. Initially, I wanted to add peer management similar to the original engine as a further improvement - I'm not sure we need it though.

nazar-pc · 2024-04-30T09:39:29Z

Please, confirm that you want to have it as a separate PR - I doubt that subsequent rebase will succeed. Otherwise, I'd need to recreate this PR which can be confusing.

Yes. And I'm not sure why would rebase not succeed if parent contains the same exact commits.

It is called syncing engine similar to the original one. Even though we used a subset of the functionality - our syncing engine provides a state machine for StateSync - it issues state requests, handles responses, accumulates the state, and handles the errors. It uses several private APIs otherwise it can be moved to Subspace code.

It doesn't sound appropriate to use sync engine to download a single block worth of state, though I might be wrong. I would have expected a long and productive discussion upstream long before this PR is opened.

apply_block is private. However, if we make it public - it seems we are able to move the code to Subspace. Please, let me know if you prefer this way.

Again, I would have expected such PR upstream long time ago and probably already in latest polkadot-sdk release/our fork by now.

It seems we hit this condition:

Something to discuss upstream then.

Requests are initiated using just PeerId and the networking layer picks the correct current connection. The original syncing engine manages the current connected peer set for multiple reasons (using the networking layer notifications). It makes sense to maintain that because the engine handles block announcements as well as recurrent syncing. Our use case however requires a single peer (several peers actually to handle possible errors) and performs the sync once - this is why I provided currently connected peer list on fast-sync attempt. Initially, I wanted to add peer management similar to the original engine as a further improvement - I'm not sure we need it though.

Again, something to discuss with upstream folks. But current implmenetation seems very invasive and not clear if it is actually necessary/the best way forward.

shamil-gadelshin · 2024-04-30T10:59:23Z

But current implmenetation seems very invasive...

It is invasive indeed. I will try to move the absolute majority of the new code into our repository.

shamil-gadelshin · 2024-04-30T11:05:19Z

Relates to #17

shamil-gadelshin · 2024-05-01T07:29:03Z

The last update removes "raw block import" and fast-sync code from the polkadot-sdk. It also makes several entities public to support moving fast-sync to Subspace code as well as exports apply_block method to move out "raw block import".

nazar-pc · 2024-05-01T07:43:50Z

Could you squash corresponding commits? Looks like a much less involved set of changes, thanks!

substrate/client/service/src/client/client.rs

substrate/client/network/src/service.rs

nazar-pc · 2024-05-01T09:09:45Z

substrate/client/service/src/client/client.rs

+		Self: ProvideRuntimeApi<Block>,
+		<Self as ProvideRuntimeApi<Block>>::Api: CoreApi<Block> + ApiExt<Block>,
+	{
+		self.apply_block(operation, import_block, storage_changes)


I'd probably move implementation here instead (though I understand that you didn't in order to decrease the diff)

Yes, the next code update should be easier this way in case of apply_block change.

nazar-pc · 2024-05-01T09:11:39Z

substrate/client/service/src/client/client.rs

+	}
+
+	fn clear_block_gap(&self) {
+		self.backend.blockchain().clear_block_gap();


Can't we call it directly somehow?

I didn't find the easier way. Otherwise, I wouldn't introduce a stub in in_mem implementation. Happy to change it if there is actually a way.

nazar-pc · 2024-05-01T09:12:42Z

substrate/client/network/sync/src/service/syncing_service.rs

@@ -91,6 +101,14 @@ impl<B: BlockT> SyncingService<B> {
 		rx.await
 	}

+	/// Restart the synchronization with new arguments.
+	pub async fn restart(&self, sync_restart_args: SyncRestartArgs<B>) {


If there is no fast sync anymore, do we still need to restart sync? If so then why?

We'll start the fast-sync using SyncMode::Light configuration to allow the block insertion without the sequence check. Later we need to switch to the SyncMode::Full to download the full blocks on a regular sync. As I mentioned in the description, Substrate sync can switch to the full mode automatically but it will generate weird messages. It is a UX improvement.

I guess I don't understand why SyncMode::Light is necessary in the first place. We don't want to sync, we just want to insert one block with its state into the database.

The exact error is InvalidBlockNumber within NonCanonicalOverlay struct (insert method). It seems to trigger when LAST_CANONICAL db meta entry is set which in turn seems to be set when we invoke try_commit_operation (canonicalize_block() when operation.commit_state is true). Setting LightState sets commit_state initially to false when we invoke set_genesis_state having

/// Returns true if the genesis state writing will be skipped while initializing the genesis /// block. pub fn no_genesis(&self) -> bool { matches!(self.network.sync_mode, SyncMode::LightState { .. } | SyncMode::Warp { .. }) }

This way we disable overlay related checks when we initially don't commit state.

And how do we achieve the same without light state sync?

It's likely possible by hijacking lock_import_and_run method and parameterizing commit_state variable but it would be very invasive within the current Substrate architecture. Switching the sync mode before passing the control is much less "foreign" than the proposed change if I understand you correctly.

Is that what upstream suggested?

No. I did this call stack research based on your Thursday's request.

To be completely honest and transparent I'm getting annoyed and frustrated that it takes forever to convince you to talk to upstream 😕

I need to clarify my priorities then. @dariolina told me that we'll likely have time for this.

- export apply_block method. - introduce clear_gap API

shamil-gadelshin requested a review from nazar-pc April 29, 2024 13:12

shamil-gadelshin self-assigned this Apr 29, 2024

shamil-gadelshin mentioned this pull request Apr 29, 2024

Support fast-sync algorithm #15

Closed

shamil-gadelshin force-pushed the subspace-fast-sync-new4 branch from 4c5d2f5 to e85d5c1 Compare May 1, 2024 07:28

shamil-gadelshin force-pushed the subspace-fast-sync-new4 branch from e85d5c1 to 24a8efa Compare May 1, 2024 08:59

nazar-pc reviewed May 1, 2024

View reviewed changes

shamil-gadelshin requested a review from nazar-pc May 6, 2024 09:31

shamil-gadelshin added 4 commits May 7, 2024 18:50

Introduce connected_peers() API

88831de

Add sync strategy restart.

9e8cff3

Modify internal infrastructure for fast-sync.

dadfe49

- export apply_block method. - introduce clear_gap API

Make entities public to support fast-sync.

d16ad68

shamil-gadelshin force-pushed the subspace-fast-sync-new4 branch from be247a2 to d16ad68 Compare May 7, 2024 11:53

nazar-pc approved these changes May 9, 2024

View reviewed changes

shamil-gadelshin merged commit 8082697 into subspace-v5 May 9, 2024
8 of 12 checks passed

nazar-pc deleted the subspace-fast-sync-new4 branch May 9, 2024 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fast-sync algorithm. #16

Support fast-sync algorithm. #16

shamil-gadelshin commented Apr 29, 2024

nazar-pc commented Apr 30, 2024

shamil-gadelshin commented Apr 30, 2024 •

edited

Loading

nazar-pc commented Apr 30, 2024

shamil-gadelshin commented Apr 30, 2024

shamil-gadelshin commented Apr 30, 2024

shamil-gadelshin commented May 1, 2024

nazar-pc commented May 1, 2024

nazar-pc May 1, 2024

shamil-gadelshin May 1, 2024

nazar-pc May 1, 2024

shamil-gadelshin May 1, 2024

nazar-pc May 1, 2024

shamil-gadelshin May 1, 2024

nazar-pc May 1, 2024

shamil-gadelshin May 6, 2024

nazar-pc May 6, 2024

shamil-gadelshin May 6, 2024

nazar-pc May 6, 2024

shamil-gadelshin May 6, 2024

nazar-pc May 6, 2024

shamil-gadelshin May 6, 2024

Support fast-sync algorithm. #16

Support fast-sync algorithm. #16

Conversation

shamil-gadelshin commented Apr 29, 2024

New syncing engine

Modification of the existing infrastructure for block import

Sync restart

nazar-pc commented Apr 30, 2024

shamil-gadelshin commented Apr 30, 2024 • edited Loading

nazar-pc commented Apr 30, 2024

shamil-gadelshin commented Apr 30, 2024

shamil-gadelshin commented Apr 30, 2024

shamil-gadelshin commented May 1, 2024

nazar-pc commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin commented Apr 30, 2024 •

edited

Loading