Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify is_stable to be indexer running and latest block timestamp not behind #887

Merged
merged 8 commits into from
Oct 30, 2024

Conversation

ppca
Copy link
Contributor

@ppca ppca commented Oct 10, 2024

indexer progressing = local indexer block height last update timestamp is within threshold
indexer caught up = my block height >= latest height from near rpc endpoint - 50

This will fix the case where some nodes are catching up but still not update to date, and they got involved in the signature generation

Copy link
Member

@ChaoticTempest ChaoticTempest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part, but we really should not be changing the field names for messages

chain-signatures/node/src/web/mod.rs Outdated Show resolved Hide resolved
chain-signatures/node/src/mesh/mod.rs Outdated Show resolved Hide resolved
chain-signatures/node/src/mesh/connection.rs Outdated Show resolved Hide resolved
chain-signatures/node/src/mesh/connection.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@volovyks volovyks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phuong's comments are spot on. Overall, it looks good. However, we should be mindful that a single node could potentially disrupt the system by reporting a block height that is higher than the actual value.

@ppca
Copy link
Contributor Author

ppca commented Oct 11, 2024

@ChaoticTempest @volovyks I changed the logic:

  1. get max block height from near rpc instead of from participants, this means as long as the node is honest, it will be compared against the true latest block height;
  2. block_height_lag_threshold will be one field in indexer and is configurable thru environment variable;
  3. instead of changing the definition of is_participant_stable(), I added a function is_stable() in indexer, that checks a) block heights has been updated lately, b) I am not more than block_height_lag_threshold behind the max block height form near rpc;
  4. given 3), the is_stable field in StateView is directly updated to be is_stable = self.indexer.is_stable().await, which will encapsulate both progressing and height being caught up

@ppca
Copy link
Contributor Author

ppca commented Oct 11, 2024

I do want to add a timeout to all the near rpc calls we have in our code, is there an easy way to do it for near_fetch::Client? @ChaoticTempest

chain-signatures/node/src/indexer.rs Outdated Show resolved Hide resolved
@@ -53,6 +54,14 @@ pub struct Options {
/// The threshold in seconds to check if the indexer needs to be restarted due to it stalling.
#[clap(long, env("MPC_INDEXER_RUNNING_THRESHOLD"), default_value = "300")]
pub running_threshold: u64,

/// The threshold in block height lag to check if the indexer has caught up.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss a strategy for configuration at our next meeting. Do we want to put everything in the contract?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we should move all these indexer configurations into the contract to make it easier to configure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking the same when adding timeout options today.

@ChaoticTempest
Copy link
Member

I do want to add a timeout to all the near rpc calls we have in our code, is there an easy way to do it for near_fetch::Client? @ChaoticTempest

You have to use either transact_async and do the check manually, or keep transact and use tokio::time::timeout

@ppca
Copy link
Contributor Author

ppca commented Oct 16, 2024

When doing this PR, I actually realized we only need to proposer to be stable. This is because 1) proposer is the one starting the signature protocol, other nodes just joining and they don't check if they have that request locally; 2) proposer is deterministic for each sign request, so there's no risk of unstable nodes later re-starting protocol for the same signature request, because it will only be that same proposer always.
So by limiting lag from rpc latest block to 200, we make sure proposers will see the sign request within 3.5 mins, and have another 1 min to respond().
If that lag is too big, that node will not make it in time anyways.
[100-200] might all be good.
@ChaoticTempest @volovyks

@ppca
Copy link
Contributor Author

ppca commented Oct 17, 2024

I also realized with our current implementation, when a node catches up on heights, it is likely to start protocols to generate signature for an old sign request, because the stable participants set change, and if you look at logic here:

let subset = stable.keys().choose_multiple(&mut rng, threshold);

The subset and proposer for the signature can be different from last time and thus this node who just caught up could end up being the proposer and respond() again.
Of course the range of sign requests will be limited to whatever lag threshold we allow in this PR.

@ChaoticTempest
Copy link
Member

wait, we don't need to keep track of the block height from rpc vs our current block height from indexer, we can just use the block timestamp to check how far behind we are in comparison with our current time. So we don't need to add this fetching of the block from RPC which can also be delayed by a couple seconds

When doing this PR, I actually realized we only need to proposer to be stable. This is because 1) proposer is the one starting the signature protocol, other nodes just joining and they don't check if they have that request locally; 2) proposer is deterministic for each sign request, so there's no risk of unstable nodes later re-starting protocol for the same signature request, because it will only be that same proposer always. So by limiting lag from rpc latest block to 200, we make sure proposers will see the sign request within 3.5 mins, and have another 1 min to respond(). If that lag is too big, that node will not make it in time anyways. [100-200] might all be good.

Even less strict -- it doesn't have to be the proposer, it can be any stable participant because it doesn't matter who responds with the signature.

I also realized with our current implementation, when a node catches up on heights, it is likely to start protocols to generate signature for an old sign request, because the stable participants set change, and if you look at logic here:

let subset = stable.keys().choose_multiple(&mut rng, threshold);

The subset and proposer for the signature can be different from last time and thus this node who just caught up could end up being the proposer and respond() again.
Of course the range of sign requests will be limited to whatever lag threshold we allow in this PR.

Yeah, we can just reject a signature request ourselves based on our threshold timing with the block timestamp

@ppca
Copy link
Contributor Author

ppca commented Oct 17, 2024

we can just use the block timestamp to check how far behind we are in comparison with our current time

That's cool! How can I do that? @ChaoticTempest

@ChaoticTempest
Copy link
Member

@ppca in handle_block, it should be block.header().timestamp_nanosec()

@ppca ppca changed the title modify is_stable to be indexer progressing and height caught up modify is_stable to be indexer running and latest block timestamp not behind Oct 30, 2024
@volovyks volovyks merged commit 922642c into develop Oct 30, 2024
3 checks passed
@volovyks volovyks deleted the xiangyi/modify_is_stable branch October 30, 2024 10:08
@ppca ppca linked an issue Oct 30, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

is_stable is true for nodes with lagging block heights
3 participants