Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tenure extend when the previous tenure is bad #5452

Open
wants to merge 20 commits into
base: develop
Choose a base branch
from

Conversation

obycode
Copy link
Contributor

@obycode obycode commented Nov 12, 2024

Description

Updates the miner to attempt to continue its tenure if the winner of the next tenure commits to the wrong block or if it mines no blocks within some grace period.

Applicable issues

Additional info (benefits, drawbacks, caveats)

Checklist

  • Test coverage for new or modified code paths
  • Changelog is updated
  • Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
  • New clarity functions have corresponding PR in clarity-benchmarking repo
  • New integration test(s) added to bitcoin-tests.yml

@obycode
Copy link
Contributor Author

obycode commented Nov 13, 2024

Ok, the case where the winning miner has committed to the wrong parent tenure is complete. Now I just need to figure out a clean way to attempt a tenure extend after some time has passed and the new miner has not mined a block.

@obycode
Copy link
Contributor Author

obycode commented Nov 14, 2024

There is a problem that I discovered which is unrelated to these changes, but becomes a challenge due to these changes.

pub static STACKER_DB_CHANNEL: StackerDBChannel = StackerDBChannel::new();
/// This struct receives StackerDB event callbacks without registering
/// over the JSON/RPC interface. To ensure that any event observer
/// uses the same channel, we use a lazy_static global for the channel (this
/// implements a singleton using STACKER_DB_CHANNEL).
///
/// This is in place because a Nakamoto miner needs to receive
/// StackerDB events. It could either poll the database (seems like a
/// bad idea) or listen for events. Registering for RPC callbacks
/// seems bad. So instead, it uses a singleton sync channel.
pub struct StackerDBChannel {
sender_info: Mutex<Option<InnerStackerDBChannel>>,
}

This global becomes a problem when we have tests that run multiple miners. If multiple miners attempt to propose a block at the same time, then we end up having problems.

@kantai
Copy link
Member

kantai commented Nov 14, 2024

This global becomes a problem when we have tests that run multiple miners. If multiple miners attempt to propose a block at the same time, then we end up having problems.

The only clear option to me is to make this a Arc+Mutex an instance variable in the EventObserver struct. This requires that the EventObserver only be constructed once, and then cloned whenever copies are needed. I think that wouldn't be too hard (there's already a lot of passing around of the object I think, rather than it being continuously reconstructed from the config).

The previous design using a global singleton causes trouble in testing,
when we have multiple miners running in different threads of the same
process.
@obycode obycode marked this pull request as ready for review November 14, 2024 19:28
@obycode obycode requested review from a team as code owners November 14, 2024 19:28
@obycode
Copy link
Contributor Author

obycode commented Nov 14, 2024

This PR handles the tenure extend for the case where the next miner has committed to the wrong parent tenure. The other case mentioned in #5361, where a tenure extend is allowed after some time limit has passed with no blocks from the new miner, will be completed in a separate PR.

The test for that behavior is already in this PR, but it is not enabled yet.

@hstove
Copy link
Contributor

hstove commented Nov 14, 2024

@obycode can we create a new issue for the timeout case, and then rename #5361 to be more about "if the new miner commits to the wrong block"? Or some derivation of that?

Copy link
Contributor

@hstove hstove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Mainly requesting a change in the SignerDB migration and then a few extra assertions in your test.

stacks-signer/src/signerdb.rs Show resolved Hide resolved
let won_last_sortition =
last_good_block_election_snapshot.miner_pk_hash == Some(mining_pkh);
info!(
"Relayer: Current burn block had no sortition or a bad sortition. Checking for tenure continuation.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? Looking at line 962, the case where sn.winning_stacks_block_hash == highest_tenure_bhh - wouldn't that mean the current burn block has a good sortition? Or is the fact that we're in continue_tenure mean that my assertion is wrong? Sorry, it's tricky to follow all the code paths here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we're trying to find the last good sortition, given that the current one is already bad since we arrived in continue_tenure. We get the last snapshot with a sortition, but if we determine that is bad, then we go back to the sortition of the tip.

"consensus_hash" => %burn_tip.consensus_hash,
);

SortitionDB::get_block_snapshot_consensus(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 984 we run this exact query. With your change, is that always necessary? If it's maybe necessary, can we change 984 to something like:

let canonical_block_snapshot = if last_good_block_election_snapshot
    .canonical_stacks_tip_consensus_hash
    == canonical_stacks_tip_ch
{
    last_good_block_election_snapshot
} else {
    SortitionDB::get_block_snapshot_consensus(self.sortdb.conn(), &canonical_stacks_tip_ch)
        .map_err(|e| {
            error!("Relayer: failed to get block snapshot for canonical tip: {e:?}");
            NakamotoNodeError::SnapshotNotFoundForChainTip
        })?
        .ok_or_else(|| {
            error!("Relayer: failed to get block snapshot for canonical tip");
            NakamotoNodeError::SnapshotNotFoundForChainTip
        })?
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it's still necessary. We can definitely reuse it here and avoid looking it up again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... maybe it's not. I'll look at this some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I would appreciate some ELI5(ish) comments on some of the sortdb lookups to explain why/what they're doing. There's already great comments in here, I think it's just tricky to follow when I have to constantly look up with a specific function does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried to clean this function up a bit. What do you think about this refactoring? 2178846

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also improved the names and added more comments in another commit just now.

testnet/stacks-node/src/tests/signer/v0.rs Outdated Show resolved Hide resolved
testnet/stacks-node/src/tests/signer/v0.rs Show resolved Hide resolved
testnet/stacks-node/src/tests/signer/v0.rs Show resolved Hide resolved
@obycode
Copy link
Contributor Author

obycode commented Nov 15, 2024

Next step will be to remove the tenure extend disable flag and instead update those forking tests to expect this new behavior. This will be lower priority though as I pivot to the design for #5434 before coming back to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants