-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support fast-sync algorithm. #16
Conversation
Each version of the PR gets better, but still mostly describes what is done and now why. Left some comments below. Did you try to upstream any of this yet? This is a third PR with fast sync opened in this repo. BTW first two commits are non-controversial and better be sent as a separate PR that we can merge quickly.
Why do we need a new syncing engine? It is not like we need to "sync" anything, we just need to download state for a know block so we can import the block "directly".
Why does it happen though? If we have a pruned node and restart it then gap sync doesn't happen, so what is different when we just insert a block bypassing checks?
Not clear how does it enable fast sync requests, or rather how other sync strategies ever worked without such an API. |
As discussed offline, I can try to upstream it after this review - because the changes(requirements) for the upstream are not finalized yet.
Please, confirm that you want to have it as a separate PR - I doubt that subsequent rebase will succeed. Otherwise, I'd need to recreate this PR which can be confusing.
It is called
It seems we hit this condition: https://github.com/subspace/polkadot-sdk/blob/61be78c621ab2fa390cd3bfc79c8307431d0ea90/substrate/client/db/src/lib.rs#L1688
Requests are initiated using just PeerId and the networking layer picks the correct current connection. The original syncing engine manages the current connected peer set for multiple reasons (using the networking layer notifications). It makes sense to maintain that because the engine handles block announcements as well as recurrent syncing. Our use case however requires a single peer (several peers actually to handle possible errors) and performs the sync once - this is why I provided currently connected peer list on fast-sync attempt. Initially, I wanted to add peer management similar to the original engine as a further improvement - I'm not sure we need it though. |
Yes. And I'm not sure why would rebase not succeed if parent contains the same exact commits.
It doesn't sound appropriate to use sync engine to download a single block worth of state, though I might be wrong. I would have expected a long and productive discussion upstream long before this PR is opened.
Again, I would have expected such PR upstream long time ago and probably already in latest polkadot-sdk release/our fork by now.
Something to discuss upstream then.
Again, something to discuss with upstream folks. But current implmenetation seems very invasive and not clear if it is actually necessary/the best way forward. |
It is invasive indeed. I will try to move the absolute majority of the new code into our repository. |
Relates to #17 |
4c5d2f5
to
e85d5c1
Compare
The last update removes "raw block import" and fast-sync code from the polkadot-sdk. It also makes several entities public to support moving fast-sync to Subspace code as well as exports |
Could you squash corresponding commits? Looks like a much less involved set of changes, thanks! |
e85d5c1
to
24a8efa
Compare
Self: ProvideRuntimeApi<Block>, | ||
<Self as ProvideRuntimeApi<Block>>::Api: CoreApi<Block> + ApiExt<Block>, | ||
{ | ||
self.apply_block(operation, import_block, storage_changes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably move implementation here instead (though I understand that you didn't in order to decrease the diff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the next code update should be easier this way in case of apply_block
change.
} | ||
|
||
fn clear_block_gap(&self) { | ||
self.backend.blockchain().clear_block_gap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we call it directly somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find the easier way. Otherwise, I wouldn't introduce a stub in in_mem
implementation. Happy to change it if there is actually a way.
@@ -91,6 +101,14 @@ impl<B: BlockT> SyncingService<B> { | |||
rx.await | |||
} | |||
|
|||
/// Restart the synchronization with new arguments. | |||
pub async fn restart(&self, sync_restart_args: SyncRestartArgs<B>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no fast sync anymore, do we still need to restart sync? If so then why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll start the fast-sync using SyncMode::Light
configuration to allow the block insertion without the sequence check. Later we need to switch to the SyncMode::Full
to download the full blocks on a regular sync. As I mentioned in the description, Substrate sync can switch to the full mode automatically but it will generate weird messages. It is a UX improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand why SyncMode::Light
is necessary in the first place. We don't want to sync, we just want to insert one block with its state into the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact error is InvalidBlockNumber
within NonCanonicalOverlay
struct (insert
method). It seems to trigger when LAST_CANONICAL
db meta entry is set which in turn seems to be set when we invoke try_commit_operation
(canonicalize_block() when operation.commit_state is true). Setting LightState
sets commit_state
initially to false when we invoke set_genesis_state
having
/// Returns true if the genesis state writing will be skipped while initializing the genesis
/// block.
pub fn no_genesis(&self) -> bool {
matches!(self.network.sync_mode, SyncMode::LightState { .. } | SyncMode::Warp { .. })
}
This way we disable overlay related checks when we initially don't commit state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how do we achieve the same without light state sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's likely possible by hijacking lock_import_and_run
method and parameterizing commit_state
variable but it would be very invasive within the current Substrate architecture. Switching the sync mode before passing the control is much less "foreign" than the proposed change if I understand you correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what upstream suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I did this call stack research based on your Thursday's request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be completely honest and transparent I'm getting annoyed and frustrated that it takes forever to convince you to talk to upstream 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to clarify my priorities then. @dariolina told me that we'll likely have time for this.
- export apply_block method. - introduce clear_gap API
be247a2
to
d16ad68
Compare
This PR adds support for Subspace fast-sync.
The first two commits fixes found issues from the current
polkadot-sdk
branch (subspace-v4).The solution itself is split into several parts (with related commits):
New syncing engine
The new syncing engine (
FastSyncEngine
) is a simplified version of the existing syncing engine that uses existing code for state sync. It has an additionalSyncService
but with a single API for now that starts the sync process.FastSyncEngine
doesn't contain peer management options as its original and those peers must be provided from the outside (I exported currently connected peers from the network for this reason).FastSyncEngine
downloads a state for the given block from the given peer and imports it into the blockchain.Modification of the existing infrastructure for block import
Several new features:
We needed to enable importing blocks bypassing the checks into the blockchain (
import_raw_block
API). It uses a native way to import the block (lock_import_and_run
andapply_block
) with a writing zero weight for the block and demanding the current sync mode set toLight
.New API
clear_block_gap
was added to mitigate the block gap which is set on block import operation. We needed that to disable the sync of the previous blocks that happens otherwise.New API
open_peers
returns the currently connected peers to enable fast sync requests. The naming was derived from the method of the network crate that provides the actual data.Sync restart
After the initial fast sync is done we need to switch to the
Full
sync mode to enable importing full blocks from peers. Also, we update the common block number for connected peers to reduce the number of incorrectly requested blocks. This feature is actually optional because chain sync will do both automatically but it will produce weird error messages. Both changes are UX improvements.