-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full domain chain snap sync (v2). #3196
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge the main
branch into this branch and resolve conflicts?
Otherwise we’re going to have to review this PR again before it merges.
(Anyone who wants to compare with the original PR, or review without the merge, can do that up to the last current commit, which is 21ee3e0.)
Better rebase since no one reviewed it yet |
- add custom `build-network` method - add last domain block execution receipt protocol (moved from consensus chain)
21ee3e0
to
cc6572a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I also confirmed it’s essentially the same as your previous PR, here’s the compare view:
full-domain-snap-sync2...full-domain-snap-sync
Sorry to bring this up late in the process, but there’s no test code or test changes here. I understand that syncing code can be hard to test using unit tests. What’s the test plan for this feature?
Oh and it looks like there’s a minor error handling conflict, can you resolve that using a merge?
client.clone(), | ||
num_peer_hint, | ||
); | ||
spawn_handle.spawn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren’t any of the tasks in this function essential?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a copy of build_network()
code from Substrate. The logic seems to be that any network protocol task failure shouldn't bring the whole process down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I think I’m missing some context here.
What’s the impact on Subspace if each of these tasks fails?
- last-domain-execution-receipt-request-handler
- block-request-handler
- syncing
Let’s assume a failure on most or all of the nodes on the network, because that could easily happen in the logs and we wouldn’t know about it. (Because the node won’t go down, and most people don’t read logs.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require more work to structure it in a more streamlined way.
There is still too much interleaving and coupling between components that should not know about each other existence.
From consensus knowing about domain sync to domain operator depending on the whole consensus chain service.
I see you have updated MMR stuff, that should have been one PR.
Then I see you want to reuse some sync implementation components from consensus in domains, the first step there would be to move those components into their own crate (long overdue, something like sc-subspace-sync
, probably one PR) and then probably moving some generic stuff (like sc-subspace-sync-common
, either second PR or together with the previous one).
I understand that simply making things public where they are and adding a dependency is quick and easy, but with this approach we will quickly get into recursive dependencies and there should be some reasonable hierarchy to dependencies.
After sync is a separate crate that doesn't depend on quite literally the whole subspace node implementation (minus CLI), this PR can be rebased, then we can get rid of domain_snap_sync_finished
, SnapSyncOrchestrator
, ConsensusChainSyncParams
and I think it will be okay on consensus side.
@@ -106,6 +107,11 @@ where | |||
// (default state), it also accounts for DSN sync | |||
(!self.force_authoring && self.inner.is_major_syncing()) | |||
|| self.pause_sync.load(Ordering::Acquire) | |||
|| self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This oracle is about consensus chain sync. It shouldn't know or care about domain sync state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was under impression that you suggested the very same idea as Ning which I implemented. What did I miss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that this was needed in domains, in which case you create a domain sync oracle with necessary custom logic there, but consensus sync oracle doesn't need to be modified
use tracing::debug; | ||
|
||
/// Synchronizes consensus and domain chain snap sync. | ||
pub struct SnapSyncOrchestrator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed that this orchestrator is not needed, yet it still exists. It should be removed as unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed that we need to provide only two shared variables an atomic bool for the sync oracle and a receiver for the target block. The rest was removed. The current orchestrator is just a container for those variables that are packed together most of the time. It is just more convenient to pass those variables in the same struct even if it is possible to remove it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a channel in it. I would have expected to remove this orchestrator completely, have some config data structure in domains if necessary and generally not have methods like unblock_consensus_snap_sync
that need to be called in strategic moments.
domains/service/Cargo.toml
Outdated
@@ -63,8 +65,10 @@ sp-subspace-mmr = { version = "0.1.0", path = "../../crates/sp-subspace-mmr" } | |||
sp-transaction-pool = { git = "https://github.com/subspace/polkadot-sdk", rev = "5871818e1d736f1843eb9078f886290695165c42" } | |||
subspace-core-primitives = { version = "0.1.0", path = "../../crates/subspace-core-primitives" } | |||
subspace-runtime-primitives = { version = "0.1.0", path = "../../crates/subspace-runtime-primitives" } | |||
subspace-service = { version = "0.1.0", path = "../../crates/subspace-service" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very problematic. You add a gigantic crate as a dependency to another gigantic crate. And the only purpose is to access ConsensusChainSyncParams
, which most likely doesn't belong to subspace_service
in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get rid of this dependency by moving the struct. However, I would need to add sc-consensus-subspace
dependency to domain-operator
(block import streams dependency in the struct). Is it better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, domains should not rely on Subspace consensus code either
@@ -39,6 +44,7 @@ sp-trie = { git = "https://github.com/subspace/polkadot-sdk", rev = "5871818e1d7 | |||
sp-weights = { git = "https://github.com/subspace/polkadot-sdk", rev = "5871818e1d736f1843eb9078f886290695165c42" } | |||
subspace-core-primitives = { version = "0.1.0", path = "../../../crates/subspace-core-primitives" } | |||
subspace-runtime-primitives = { version = "0.1.0", path = "../../../crates/subspace-runtime-primitives" } | |||
subspace-service = { version = "0.1.0", default-features = false, path = "../../../crates/subspace-service" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, this is way more problematic than the other one. This is a gigantic dependency added to a lower-level crate, which will significantly worsen compile times and will serialize compilation order even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
domain-operator
hosts domain snap sync feature
it requires both mmr_sync
and consensus snap sync
features from subspace_service
. There are at least two ways to mitigate the existing dependency: either create a separate crate for sync features (both subspace_service
and domain_operator
will depend on it) or introduce several traits and wrapping structures to pass incapsulated functions to domains (domain_operator
contains traits which are used by subspace_node
to pass wrapping structures to domains). Either way, it seems more complicated than keeping an existing dependency. Do you see a simpler way or do you think the new complexity is justified from a dependency graph perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I certainly expect a separate crate (or several) for sync, we'll have even more code once we upgrade our sync implementation to implement custom SyncStrategy
after latest Substrate upgrade. But ideally operator wouldn't depend on consensus sync either, but rather on some shared crate with common primitives/utilities.
|
||
impl<Block, Client, NR> LastDomainBlockInfoReceiver<Block, Client, NR> | ||
/// Provides parameters for domain snap sync synchronization with the consensus chain snap sync. | ||
pub struct ConsensusChainSyncParams<Block, CNR> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything included in this struct is already public and can be accessed in subspace-node
, this struct should not exist in subspace-service
, it doesn't belong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I answered above: the question remains about the dependency for block import notification stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stream is just a stream. Domains already use some of the streams that are derived from consensus logic, yet they don't depend (I hope) on consensus crate for this.
I don't think the comparison is correct, for example: SubspaceSyncOracle is the same in both branches but it showed as different.
We test such code manually and this is the reason why I will merge the main in the end - the testing stand requires a non-trivial setup each time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these tasks should be essential?
client.clone(), | ||
num_peer_hint, | ||
); | ||
spawn_handle.spawn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I think I’m missing some context here.
What’s the impact on Subspace if each of these tasks fails?
- last-domain-execution-receipt-request-handler
- block-request-handler
- syncing
Let’s assume a failure on most or all of the nodes on the network, because that could easily happen in the logs and we wouldn’t know about it. (Because the node won’t go down, and most people don’t read logs.)
Failure of any of these tasks is not catastrophic for the working node after the domain sync. However, total network failure will result in failed domain snap sync attempts (full sync will be possible though). |
The last commit batch contains changes related to PR comments and offline discussion with Nazar:
|
Ok, so I guess the overall question is: Do we have tests that make sure all these tasks run properly, and don’t quickly error out? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks a bit clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to me.
I’m happy to leave the essential task question to another ticket or PR, but we’ll need to resolve conflicts for this to merge.
shared/subspace-sync/Cargo.toml
Outdated
@@ -0,0 +1,24 @@ | |||
[package] | |||
name = "subspace-sync" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name = "subspace-sync" | |
name = "sc-subspace-sync-common" |
let consensus_network_sync_oracle_wrapper = Arc::new(ConsensusChainSyncOracleWrapper::new( | ||
consensus_network_sync_oracle, | ||
consensus_chain_sync_params | ||
.as_ref() | ||
.map(|params| params.snap_sync_orchestrator.domain_snap_sync_finished()), | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still don't really understand why consensus syncing oracle needs to check if domain snap sync hash finished...
b9fbdc3
to
959a1a9
Compare
The last commit batch contains:
|
This PR introduces snap sync for the domain chain (#3026).
It supersedes the previous PR with squashed review commits into the original code to enable clean review (requested by @nazar-pc here)
The main changes from the original PR:
SubspaceSyncOracle
was modified to support domain snap syncSnapSyncOrchestrator
was simplified significantlySnapSyncOrchestrator
to block import notificationsCode contributor checklist: