Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full domain chain snap sync (v2). #3196

Merged
merged 21 commits into from
Nov 8, 2024
Merged

Conversation

shamil-gadelshin
Copy link
Member

This PR introduces snap sync for the domain chain (#3026).

It supersedes the previous PR with squashed review commits into the original code to enable clean review (requested by @nazar-pc here)

The main changes from the original PR:

  • the last confirmed domain block execution receipt protocol was moved to domain network from consensus network
  • the last confirmed domain block execution receipt sync algorithm was changed to support possible change due to network delays
  • SubspaceSyncOracle was modified to support domain snap sync
  • SnapSyncOrchestrator was simplified significantly
  • consensus and domain chains synchronization was refactored from SnapSyncOrchestrator to block import notifications

Code contributor checklist:

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge the main branch into this branch and resolve conflicts?
Otherwise we’re going to have to review this PR again before it merges.

(Anyone who wants to compare with the original PR, or review without the merge, can do that up to the last current commit, which is 21ee3e0.)

@nazar-pc
Copy link
Member

Better rebase since no one reviewed it yet

teor2345
teor2345 previously approved these changes Oct 31, 2024
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I also confirmed it’s essentially the same as your previous PR, here’s the compare view:
full-domain-snap-sync2...full-domain-snap-sync

Sorry to bring this up late in the process, but there’s no test code or test changes here. I understand that syncing code can be hard to test using unit tests. What’s the test plan for this feature?

Oh and it looks like there’s a minor error handling conflict, can you resolve that using a merge?

client.clone(),
num_peer_hint,
);
spawn_handle.spawn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren’t any of the tasks in this function essential?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a copy of build_network() code from Substrate. The logic seems to be that any network protocol task failure shouldn't bring the whole process down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I think I’m missing some context here.

What’s the impact on Subspace if each of these tasks fails?

  • last-domain-execution-receipt-request-handler
  • block-request-handler
  • syncing

Let’s assume a failure on most or all of the nodes on the network, because that could easily happen in the logs and we wouldn’t know about it. (Because the node won’t go down, and most people don’t read logs.)

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require more work to structure it in a more streamlined way.
There is still too much interleaving and coupling between components that should not know about each other existence.
From consensus knowing about domain sync to domain operator depending on the whole consensus chain service.

I see you have updated MMR stuff, that should have been one PR.
Then I see you want to reuse some sync implementation components from consensus in domains, the first step there would be to move those components into their own crate (long overdue, something like sc-subspace-sync, probably one PR) and then probably moving some generic stuff (like sc-subspace-sync-common, either second PR or together with the previous one).
I understand that simply making things public where they are and adding a dependency is quick and easy, but with this approach we will quickly get into recursive dependencies and there should be some reasonable hierarchy to dependencies.

After sync is a separate crate that doesn't depend on quite literally the whole subspace node implementation (minus CLI), this PR can be rebased, then we can get rid of domain_snap_sync_finished, SnapSyncOrchestrator, ConsensusChainSyncParams and I think it will be okay on consensus side.

@@ -106,6 +107,11 @@ where
// (default state), it also accounts for DSN sync
(!self.force_authoring && self.inner.is_major_syncing())
|| self.pause_sync.load(Ordering::Acquire)
|| self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This oracle is about consensus chain sync. It shouldn't know or care about domain sync state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was under impression that you suggested the very same idea as Ning which I implemented. What did I miss?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that this was needed in domains, in which case you create a domain sync oracle with necessary custom logic there, but consensus sync oracle doesn't need to be modified

crates/subspace-node/src/commands/run.rs Outdated Show resolved Hide resolved
use tracing::debug;

/// Synchronizes consensus and domain chain snap sync.
pub struct SnapSyncOrchestrator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed that this orchestrator is not needed, yet it still exists. It should be removed as unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed that we need to provide only two shared variables an atomic bool for the sync oracle and a receiver for the target block. The rest was removed. The current orchestrator is just a container for those variables that are packed together most of the time. It is just more convenient to pass those variables in the same struct even if it is possible to remove it completely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a channel in it. I would have expected to remove this orchestrator completely, have some config data structure in domains if necessary and generally not have methods like unblock_consensus_snap_sync that need to be called in strategic moments.

@@ -63,8 +65,10 @@ sp-subspace-mmr = { version = "0.1.0", path = "../../crates/sp-subspace-mmr" }
sp-transaction-pool = { git = "https://github.com/subspace/polkadot-sdk", rev = "5871818e1d736f1843eb9078f886290695165c42" }
subspace-core-primitives = { version = "0.1.0", path = "../../crates/subspace-core-primitives" }
subspace-runtime-primitives = { version = "0.1.0", path = "../../crates/subspace-runtime-primitives" }
subspace-service = { version = "0.1.0", path = "../../crates/subspace-service" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very problematic. You add a gigantic crate as a dependency to another gigantic crate. And the only purpose is to access ConsensusChainSyncParams, which most likely doesn't belong to subspace_service in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get rid of this dependency by moving the struct. However, I would need to add sc-consensus-subspace dependency to domain-operator (block import streams dependency in the struct). Is it better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, domains should not rely on Subspace consensus code either

crates/subspace-node/src/commands/run.rs Outdated Show resolved Hide resolved
@@ -39,6 +44,7 @@ sp-trie = { git = "https://github.com/subspace/polkadot-sdk", rev = "5871818e1d7
sp-weights = { git = "https://github.com/subspace/polkadot-sdk", rev = "5871818e1d736f1843eb9078f886290695165c42" }
subspace-core-primitives = { version = "0.1.0", path = "../../../crates/subspace-core-primitives" }
subspace-runtime-primitives = { version = "0.1.0", path = "../../../crates/subspace-runtime-primitives" }
subspace-service = { version = "0.1.0", default-features = false, path = "../../../crates/subspace-service" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, this is way more problematic than the other one. This is a gigantic dependency added to a lower-level crate, which will significantly worsen compile times and will serialize compilation order even more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domain-operator hosts domain snap sync feature it requires both mmr_sync and consensus snap sync features from subspace_service. There are at least two ways to mitigate the existing dependency: either create a separate crate for sync features (both subspace_service and domain_operator will depend on it) or introduce several traits and wrapping structures to pass incapsulated functions to domains (domain_operator contains traits which are used by subspace_node to pass wrapping structures to domains). Either way, it seems more complicated than keeping an existing dependency. Do you see a simpler way or do you think the new complexity is justified from a dependency graph perspective?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly expect a separate crate (or several) for sync, we'll have even more code once we upgrade our sync implementation to implement custom SyncStrategy after latest Substrate upgrade. But ideally operator wouldn't depend on consensus sync either, but rather on some shared crate with common primitives/utilities.


impl<Block, Client, NR> LastDomainBlockInfoReceiver<Block, Client, NR>
/// Provides parameters for domain snap sync synchronization with the consensus chain snap sync.
pub struct ConsensusChainSyncParams<Block, CNR>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything included in this struct is already public and can be accessed in subspace-node, this struct should not exist in subspace-service, it doesn't belong here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I answered above: the question remains about the dependency for block import notification stream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream is just a stream. Domains already use some of the streams that are derived from consensus logic, yet they don't depend (I hope) on consensus crate for this.

@shamil-gadelshin
Copy link
Member Author

@teor2345

Looks good to me. I also confirmed it’s essentially the same as your previous PR, here’s the compare view:
full-domain-snap-sync2...full-domain-snap-sync

I don't think the comparison is correct, for example: SubspaceSyncOracle is the same in both branches but it showed as different.

Sorry to bring this up late in the process, but there’s no test code or test changes here. I understand that syncing code can be hard to test using unit tests. What’s the test plan for this feature?
Oh and it looks like there’s a minor error handling conflict, can you resolve that using a merge?

We test such code manually and this is the reason why I will merge the main in the end - the testing stand requires a non-trivial setup each time.

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these tasks should be essential?

client.clone(),
num_peer_hint,
);
spawn_handle.spawn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I think I’m missing some context here.

What’s the impact on Subspace if each of these tasks fails?

  • last-domain-execution-receipt-request-handler
  • block-request-handler
  • syncing

Let’s assume a failure on most or all of the nodes on the network, because that could easily happen in the logs and we wouldn’t know about it. (Because the node won’t go down, and most people don’t read logs.)

@shamil-gadelshin
Copy link
Member Author

What’s the impact on Subspace if each of these tasks fails?

  • last-domain-execution-receipt-request-handler
  • block-request-handler
  • syncing
    Let’s assume a failure on most or all of the nodes on the network, because that could easily happen in the logs and we wouldn’t know about it. (Because the node won’t go down, and most people don’t read logs.)

Failure of any of these tasks is not catastrophic for the working node after the domain sync. However, total network failure will result in failed domain snap sync attempts (full sync will be possible though).

@shamil-gadelshin
Copy link
Member Author

The last commit batch contains changes related to PR comments and offline discussion with Nazar:

  • new separate crate for SubspaceSyncEngine for consensus and domain snap sync
  • remove subspace-service dependencies from domain service and operator crates
  • refactor block notification streams to remove dependency from sc-consensus-subspace
  • move snap sync structures from subspace-service to domain-client-operator: SnapSyncOrchestrator and ConsensusChainSyncParams
  • remove MMR-sync from domain snap-sync because it was integrated into consensus snap-sync

@teor2345
Copy link
Contributor

teor2345 commented Nov 6, 2024

What’s the impact on Subspace if each of these tasks fails?

  • last-domain-execution-receipt-request-handler
  • block-request-handler
  • syncing
    Let’s assume a failure on most or all of the nodes on the network, because that could easily happen in the logs and we wouldn’t know about it. (Because the node won’t go down, and most people don’t read logs.)

Failure of any of these tasks is not catastrophic for the working node after the domain sync. However, total network failure will result in failed domain snap sync attempts (full sync will be possible though).

Ok, so I guess the overall question is:
If any of these tasks fail across the Subspace network, do we want them to exit (and hopefully restart)?
Or do we want them to keep on going, and no-one is able to snap sync?

Do we have tests that make sure all these tasks run properly, and don’t quickly error out?

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks a bit clearer.

domains/client/domain-operator/src/snap_sync.rs Outdated Show resolved Hide resolved
domains/client/domain-operator/src/snap_sync.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me.

I’m happy to leave the essential task question to another ticket or PR, but we’ll need to resolve conflicts for this to merge.

domains/client/domain-operator/src/lib.rs Outdated Show resolved Hide resolved
crates/subspace-service/Cargo.toml Outdated Show resolved Hide resolved
@@ -0,0 +1,24 @@
[package]
name = "subspace-sync"
Copy link
Member

@nazar-pc nazar-pc Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name = "subspace-sync"
name = "sc-subspace-sync-common"

domains/client/domain-operator/src/snap_sync.rs Outdated Show resolved Hide resolved
Comment on lines 480 to 484
let consensus_network_sync_oracle_wrapper = Arc::new(ConsensusChainSyncOracleWrapper::new(
consensus_network_sync_oracle,
consensus_chain_sync_params
.as_ref()
.map(|params| params.snap_sync_orchestrator.domain_snap_sync_finished()),
));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't really understand why consensus syncing oracle needs to check if domain snap sync hash finished...

@shamil-gadelshin
Copy link
Member Author

The last commit batch contains:

  • merge commit with main
  • MMR target block fix
  • new subspace sync renaming

@shamil-gadelshin shamil-gadelshin added this pull request to the merge queue Nov 8, 2024
Merged via the queue into main with commit f26446d Nov 8, 2024
10 checks passed
@shamil-gadelshin shamil-gadelshin deleted the full-domain-snap-sync2 branch November 8, 2024 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants