Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fraud proof: Add Host function and update timestamp extrinsic derivation #2046

Merged
merged 4 commits into from
Oct 10, 2023

Conversation

vedhavyas
Copy link
Contributor

@vedhavyas vedhavyas commented Oct 3, 2023

This PR adds a new sp-domains-fraud-proof crate that adds a few host functions we need to verify the Fraud proofs. The host functions introduced in this PR are general enough and veriifcation logic is still part of the runtime and can be upgraded through a runtime upgrade.

It turns out we cannot remove the storage proofs from the Fraud proof since the Consensus node, which verifies the fraud proof, will be just a Full node at the best. So it is not gauranted that required state such as timestamp and block_randomness will exist. This is because while consensus node continues to produce blocks and prune older blocks, domain may be halted for some reason. There is a attack vector when if there is no honest operator in the network to submit the fraud proofs while the consensus state is available and they come back up after the state is pruned, verifier relaying on the state to fetch timestamp and block_randomness will fail and there by unable to verify the fraud proof. So we still need the storage proofs and pallet-domain will store the required consensus state root through host function and use the same during the verification.

Note: sp-domains still holds the verification logic of other fraud proof and that should ideally move to new sp-domains-fraud-proof crate. Since that is a pure refactor, I will handle in a separate refactoring PR. This makes the changes in this PR easy to understand. We also can remove sc-consensus-fraud-proof crate in the refactoring PR as that would be useless going forward.

Code contributor checklist:

@nazar-pc
Copy link
Member

nazar-pc commented Oct 4, 2023

There is a attack vector when if there is no honest operator in the network to submit the fraud proofs while the consensus state is available

Are you sure this is within our security model? Because I'm fairly certain we have a fundamental security assumption that there is at least one honest operator online at any time. And challenge period should be aligned with consensus chain in such a way that all challenges point to non-pruned state.

@vedhavyas
Copy link
Contributor Author

Our assumption is for one honest operator exists in the network but that does not mean they are offline for a period of time. I'm mostly talking about the case when they are offline and comeback to submit a fraud proof during which consensus chain would have moved much forward than that domain itself.
Well maybe that would not be the case at all but this is certainly possible. That is one reason why we still store a list consensus block hashes as part of the pallet-domains state.

And challenge period should be aligned with consensus chain in such a way that all challenges point to non-pruned state.

Well, there is no guarantee that every consensus block will have a domain bundle that leads a new domain block. There can definitely be some gaps. Though I remember there was some plan to not have such a gap but that is not currently done Iirc

@nazar-pc
Copy link
Member

nazar-pc commented Oct 4, 2023

Our assumption is for one honest operator exists in the network but that does not mean they are offline for a period of time. I'm mostly talking about the case when they are offline and comeback to submit a fraud proof during which consensus chain would have moved much forward than that domain itself.

That is really odd sounding to me. If they "exist", but "offline", they might as well be considered as non-existing 🤷‍♂️

Well, there is no guarantee that every consensus block will have a domain bundle that leads a new domain block. There can definitely be some gaps. Though I remember there was some plan to not have such a gap but that is not currently done Iirc

Is challenge period not defined in consensus blocks? If not I think it is a problem and should be changed to be measured in consensus blocks, which I think addresses your comment.

@NingLin-P
Copy link
Member

Is challenge period not defined in consensus blocks? If not I think it is a problem and should be changed to be measured in consensus blocks, which I think addressed your comment.

The challenge period of ER is measured in domain block, to my understanding there are several reasons:

  • When using consensus block, for fraudulent ER submitted in block #N, the fraud proof must be included before block #N + challenge_period, all the domain instances are racing to submit their fraud proof within this window while there are only limited number of fraud proof can be included in challenge_period number of consensus block, some proof may miss it and break the security assumption. When using domain block, the domain block only produces if there is a bundle included in the consensus block, and fraud proof has a higher priority to be included than bundle, so it is okay for fraud proof/bundle to miss (not be included in) some consensus blocks.
  • Different domains may have different workload, which means they may have different bundle/block intervals, using consensus block forces all domains to use the same challenge period. For extreme domains, it may take longer time than the fixed challenge period to execute their block and hence produce their ER, so malicious operators can craft a fraudulent ER (without executing the block) and get it pass the challenge period before other operators can submit fraud proof (which need their local ER to be produced first)
  • Using consensus block also means the challenge period needs to be moving forward for every domain when a consensus block is initialized, which may cause a huge amount of weight and exceed the max block weight limit since the number of domains is unbounded. With domain block, the challenge period moves forward when a new ER is submitted (on the submit_bundle call) so weight is bounded and accountable

As for the reason why we still need storage proof, I think it is because the non-archival node will prune the historical state, without storage proof only the consensus node with all the historical state of the challenge period can verify and include the fraud proof in their block (the challenge period can be days or weeks in the production environment), with storage proof any node have the block header can do that.

@dariolina
Copy link
Member

dariolina commented Oct 5, 2023

The expectation is that there is an online honest operator that watches his domain at all times. When that is true, they will notice right away when some incorrect ER is submitted and have plenty of time.

all the domain instances are racing to submit their fraud proof within this window while there are only limited number of fraud proof can be included in challenge_period number of consensus block

This depends on our assumptions. It sounds like you assume an unbounded number of domains that are always producing bundles. If the subsequent assumption is that consensus chain can handle inclusion of all of those bundles, then it can handle all the fraud proofs as well.

For extreme domains, it may take longer time than the fixed challenge period to execute their block and hence produce their ER

Do we expect something like that? Follow-up question: can we have a different "consensus block" challenge period for different domains based on their requirements? (like L2s on Ethereum do, so some domain is ok with 10 min and some like Arbitrum want a week) We would need to keep storage proof in that case I think

Using consensus block also means the challenge period needs to be moving forward for every domain when a consensus block is initialized

With an unbounded number of domains and high demand this statement will be true even if domain blocks are used for challenge period.

@NingLin-P
Copy link
Member

The expectation is that there is an online honest operator that watches his domain at all times. When that is true, they will notice right away when some incorrect ER is submitted and have plenty of time.

Yeah, it is true in theory and if nothing goes wrong a fraud proof should be submitted within a few seconds, but in practice the challenge period is sometime set to days or weeks since the hackers are incentivized to make things go wrong.

It sounds like you assume an unbounded number of domains that are always producing bundles. If the subsequent assumption is that consensus chain can handle inclusion of all of those bundles, then it can handle all the fraud proofs as well.

It is more about the inclusion of fraud proof, with consensus block we also need to take the network congestion into account for the challenge period, with domain block the challenge period is more about the time it take to notice the fraud and generate the fraud proof.

Do we expect something like that? Follow-up question: can we have a different "consensus block" challenge period for different domains based on their requirements?

It is just an edge case, but consider a domain that takes 10 consensus blocks to execute its domain block, its challenge period is reduced by 9 consensus blocks compared to those domains that take 1 consensus blocks to execute, while the challenge period is really about the time to notice and generate fraud proof. For the question, Yes, different "consensus block" challenge period is feasible.

With an unbounded number of domains and high demand this statement will be true even if domain blocks are used for challenge period.

Not really, when there is high demand the bundle/ER may not be included in the consensus block at all, so the challenge period doesn't move forward, it is kind of a lazy approach. With consensus block, the challenge period of all the domains is moving forward as long as a consensus block is produced.

crates/pallet-domains/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 1493 to 1495
let consensus_block_hash = bad_receipt.consensus_block_hash;
let verification_info = get_invalid_domain_extrinsic_root_info(
H256::from_slice(consensus_block_hash.as_ref()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that is not correct. We need to use the consensus_clock_hash from which this ER is derived from so that Fraud proof verifier can also derive same things. For example, timestamp, block_randomness etc..

If you talking about the discussion we had on domain_runtime_code runtime api returning upgraded runtime instead of correct runtime, that is out of scope for this PR and needs to be fixed in a different PR

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't check verification logic in detail, but the approach makes sense to me.

What I'd probably like to see is instead of get_invalid_domain_extrinsic_root_info and many more to come, have just one with both input and output as enums, so we can extend it without changing host function API. I'm not sure how good idea it is, it seems awkward that we will be sending enum as an arguent one way and (potentially) get unexpected enum back, but it seems to be more flexible as we need to extend the supported variants. Thoughts?

crates/pallet-domains/src/tests.rs Outdated Show resolved Hide resolved
/// Domain fraud proof related runtime interface
#[runtime_interface]
pub trait FraudProofRuntimeInterface {
fn get_invalid_domain_extrinsic_root_info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be description for such host functions. Also the name is odd, is it really about invalid domain extrinsic root or is it about any, it just happens to be used for invalid extrinsics by the runtime?

@vedhavyas
Copy link
Contributor Author

What I'd probably like to see is instead of get_invalid_domain_extrinsic_root_info and many more to come, have just one with both input and output as enums, so we can extend it without changing host function API. I'm not sure how good idea it is, it seems awkward that we will be sending enum as an arguent one way and (potentially) get unexpected enum back, but it seems to be more flexible as we need to extend the supported variants. Thoughts?

Actually, I like the idea of making this host function more generic that supports all fraud proof veriifcation and easier to add new types. Will update this

Comment on lines 16 to 17
consensus_block_hash: H256,
domain_id: DomainId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure these are the only arguments that will be necessary for any kind of fraud proof. I'm thinking whether they should be moved under enum or deserve to be here.

/// Request type to fetch required verification information for fraud proof through Host function.
#[derive(Debug, Decode, Encode, TypeInfo, PartialEq, Eq, Clone)]
pub enum FraudProofVerificationInfoRequest {
InvalidDomainExtrinsicRoot,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docs for all enum variants on newly introduced data structures. And the name is odd, are you sure this is a request for invalid domain extrinsic root? Or do we request valid domain extrinsic root for the purposes of verifying an invalid one.

And for these data structures in sp-* it would be preferable to specify SCALE codec indices explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the name is odd, are you sure this is a request for invalid domain extrinsic root? Or do we request valid domain extrinsic root for the purposes of verifying an invalid one.

I do not have a better name for this unfortunately.
We request the information used to verify the invalid domain extrinsic fraud proof through host function. The variant signifies the which fraud proof type we need to send the verification info back. The root derivation still happens on the runtime and this host function just uses the consensus state to return the verification information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think the variant and its comment should indicate what information you're requesting from the client, not what fraud proof verification will use it. You may have multiple fraud proofs requesting the same data.

Copy link
Contributor Author

@vedhavyas vedhavyas Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think the variant and its comment should indicate what information you're requesting from the client, not what fraud proof verification will use it. You may have multiple fraud proofs requesting the same data.

Updated the comment specific to the variant. I dont think any other fraud proof type would require this info apart from this Fraud proof type. Do you disagree ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not the point, it was just a possibility. It is confusing if you don't call things what they are. From last update and comment it seems that maybe it should have already been two variants: for randomness and timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be two host functions while this could simply with one call that returns the both and are used at the same time. Sure I agree that it makes the host calls more generic but returning these info bundled together would be nice though I may not be thinking much further ahead and I dont have strong opinion on this. If you think this makes it easier to understand and follow the code path, I willing to change though even though I wouldn't like it much :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can easily imagine randomness to be useful for some other fraud proofs for example. And going forward for other fraud proofs there might be overlapping pieces of information necessary. The approach where enum variants are saying what they request seems to scale more gracefully than variants that are purpose built for one fraud proof, even though it might be a bit more convenient to handle.

@NingLin-P @ParthDesai do you have an opinion on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folks, I would really appreciate if we could move forward and make a decision on this so that we can actually start building the rest of the fraud proofs using the host function. I dont mind taking the route @nazar-pc suggested at this point but we really needs to get feedback and move on. @NingLin-P @ParthDesai

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer @nazar-pc 's approach since there are overlapping data (i.e. domain runtime code, bundle) required by multiple fraud proof.

Besides we will need the host function to not only fetch data but also actually perform some of the verification steps (where domain runtime API is involved) do we also make them as a variant here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides we will need the host function to not only fetch data but also actually perform some of the verification steps (where domain runtime API is involved) do we also make them as a variant here?

I think so, everything that runtime can't do will go as request to host and back as response, we just need to ensure we are pricing it correctly.

/// Request type to fetch required verification information for fraud proof through Host function.
#[derive(Debug, Decode, Encode, TypeInfo, PartialEq, Eq, Clone)]
pub enum FraudProofVerificationInfoRequest {
InvalidDomainExtrinsicRoot,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the name is odd, are you sure this is a request for invalid domain extrinsic root? Or do we request valid domain extrinsic root for the purposes of verifying an invalid one.

I do not have a better name for this unfortunately.
We request the information used to verify the invalid domain extrinsic fraud proof through host function. The variant signifies the which fraud proof type we need to send the verification info back. The root derivation still happens on the runtime and this host function just uses the consensus state to return the verification information.

&mut self,
consensus_block_hash: H256,
domain_id: DomainId,
fraud_proof_verification_req: FraudProofVerificationInfoRequest,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #2046 (comment)

These are minimal required info I believe is common for all fraud proof variants. Anything specific should be within the variant

use std::sync::Arc;

/// Trait to query and verify Domains Fraud proof.
pub trait FraudProofHostFunctions: Send + Sync {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This host function expects the verifier node to contain all the necessary data for verifying the fraud proof, it won't be held if we still allow customizing the --state-pruning and --block-pruning options for the non-archival consensus node, I remember @nazar-pc had mentioned the archiving process also have the similar requirement, we should make that requirement more explicit or plz leave a TODO if it is not going handles in this PR.

NOTE: ---block-pruning is necessary since we need the block body to get the bundle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For obvious reasons we will not allow it, it is not really supported on Gemini 3f already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was it approached? I still see the --state-pruning and --block-pruning options in the gemini-3f-2023-oct-06 snapshot. Do we use a constant value for these options and ignore what the user pass in or make all consensus node to be archival nodes?

Copy link
Member

@nazar-pc nazar-pc Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you attempt to use it, things will break in a bad way. #1727 will address it by replacing Substrate CLI with our own CLI that will have more meaningful arguments (--farmer and --operator instead of --validator) and arguments that are not supported/not applicable will be removed.

Block pruning right now is already dependent on archiving, so it will not prune recent blocks even if argument is set to do so. State pruning should be left as archival though, but I believe it should be pruned automatically when block is pruned (if not, I didn't check, then we need to make it so).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but we may need the pruning dependent on the challenge period at least for nodes that turn on domain, in case archiving has a shorter requirement than the challenge period.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Practically though it is not a concern yet, blocks are pruned so much later that it will not cause issues for now.

@vedhavyas
Copy link
Contributor Author

The PR should be updated to have individual host functions to fetch required info through host function.

As for the naming, if you disagree with any, please also provide potential alternatives as I do not have enough brain power to come up with any new :)

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an opportunity to generalize response error handling in the future, but otherwse approach makes sense to me.

@vedhavyas vedhavyas added this pull request to the merge queue Oct 10, 2023
Merged via the queue into main with commit f567ca7 Oct 10, 2023
10 checks passed
@vedhavyas vedhavyas deleted the fraud_proof/host_function branch October 10, 2023 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants