Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EFM] Update protocol.DKG to use IndexMap #6338

Merged
merged 50 commits into from
Sep 13, 2024

Conversation

durkmurder
Copy link
Member

@durkmurder durkmurder commented Aug 14, 2024

#6321

Context

This PR makes modifications to the current code base by using EpochCommit to implement the protocol.DKG interface.
Most of the changes are centered around usage of IndexMap in places where we previously used epoch participants from the setup event.

Additionally made another attempt to fix the fuzzy test, added test data to check if the issue is reproducible.

@codecov-commenter
Copy link

codecov-commenter commented Aug 14, 2024

Codecov Report

Attention: Patch coverage is 34.18803% with 77 lines in your changes missing coverage. Please review.

Project coverage is 41.51%. Comparing base (ae48e1b) to head (0ca2400).

Files with missing lines Patch % Lines
consensus/hotstuff/mocks/dkg.go 0.00% 18 Missing ⚠️
state/protocol/mock/dkg.go 0.00% 18 Missing ⚠️
cmd/bootstrap/run/qc.go 7.14% 13 Missing ⚠️
utils/unittest/fixtures.go 0.00% 12 Missing ⚠️
state/protocol/inmem/dkg.go 56.25% 6 Missing and 1 partial ⚠️
consensus/hotstuff/committees/static.go 0.00% 5 Missing ⚠️
integration/testnet/network.go 80.00% 2 Missing ⚠️
state/protocol/badger/state.go 0.00% 1 Missing ⚠️
state/protocol/inmem/epoch.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                   Coverage Diff                    @@
##           feature/efm-recovery    #6338      +/-   ##
========================================================
- Coverage                 41.52%   41.51%   -0.02%     
========================================================
  Files                      2012     2012              
  Lines                    143350   143296      -54     
========================================================
- Hits                      59530    59490      -40     
- Misses                    77625    77631       +6     
+ Partials                   6195     6175      -20     
Flag Coverage Δ
unittests 41.51% <34.18%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@durkmurder durkmurder marked this pull request as ready for review August 15, 2024 18:39
@durkmurder durkmurder requested review from jordanschalm, kc1116 and AlexHentschel and removed request for zhangchiqing August 15, 2024 18:39
Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote up the reasons I disagree with some of the new validity checks. We can discuss those here and make sure we're on the same page prior to merging.

Other than that, looks great.

if len(participants) != len(commit.DKGParticipantKeys) {
return NewInvalidServiceEventErrorf("participant list (len=%d) does not match dkg key list (len=%d)", len(participants), len(commit.DKGParticipantKeys))
// at max DKG participant count is equal to the number of participants,
// but it can be less if some participants took part in the DKG but are not part of the committee anymore(eg. ejected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with two of the checks here:

  • len(participants) <= len(commit.DKGParticipantKeys)
  • all nodes in setup.Participants.Filter(filter.IsValidDKGParticipant) have a DKG index

From the issue, there are 2 additional cases we want to be able to represent:

  • (1) there is a node that participated in the DKG, but isn't in the committee
  • (2) there is a node that is in the committee, but didn't participate in the DKG

Each of the problematic checks will cause us to not accept DKG models where property (2) is true.

Why would we want to allow EpochCommit events satisfying property (2)?

During epoch recovery, the EpochRecover event contains:

  • a copy of the most recent, successful DKG (final epoch prior to entering EFM)
  • an identity table consistent with the current staking state (we should exclude nodes which have unstaked, and include nodes which have staked)

Suppose the last successful DKG occurred in Epoch 10, and 100 consensus nodes participated. In Epoch 11, 5 consensus nodes staked to join the network. Also in Epoch 11, the network enters EFM.

When we recover, we will have:

  • a DKG with 100 participants (commit.DKGParticipantKeys)
  • an identity table (participants) with 105 consensus nodes (105 nodes eligible to participate in DKGs)

In my opinion, this is a valid situation that we should allow. But the resulting EpochRecover would fail the two checks here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7298d52 (#6338)

I have removed those two checks and adjusted tests. Now we should support any combination of participants and DKG data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I finally got to the bottom of this PR to contribute to this conversation: please see my comment below

state/protocol/dkg.go Outdated Show resolved Hide resolved
state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
state/protocol/inmem/dkg.go Show resolved Hide resolved
state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
consensus/hotstuff/committees/static.go Outdated Show resolved Hide resolved
model/flow/epoch.go Outdated Show resolved Hide resolved
cmd/bootstrap/run/qc.go Outdated Show resolved Hide resolved
cmd/bootstrap/run/qc.go Show resolved Hide resolved
cmd/bootstrap/run/qc.go Show resolved Hide resolved
cmd/bootstrap/run/qc.go Show resolved Hide resolved
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work: code is clear, consistent and well tested. Thank you.

From my perspective, there are two aspects:

state/protocol/inmem/dkg.go Outdated Show resolved Hide resolved
state/protocol/validity.go Outdated Show resolved Hide resolved
state/protocol/validity_test.go Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the function IsValidEpochCommit

I think it implements the absolutely necessary consistency checks, which is great. Though, I am concerned that there is no check enforcing sufficient overlap between the DKG committee 𝒟 and the consensus committee 𝒞. Only consensus committee members can vote and a sufficient subset of them must have Random beacon key shares - otherwise consensus cannot progress:

  • We define the random beacon committee $ℛ$ as the subset of the consensus nodes, which successfully completed the DKG. Hence, $ℛ \subseteq 𝒞$. (see documentation of flow.DKGIndexMap for further details)

  • Note that there might be nodes $f$ such that $f \in 𝒞$ and $f \in 𝒟$, which failed the DKG. In other words, despite $f \in 𝒞 \cap 𝒟$ it might still be the case that $f \notin ℛ$. Hence, $ℛ \subseteq 𝒞 \cap 𝒟$.

  • At the absolute minimum, we require signature.RandomBeaconThreshold nodes in $ℛ$:

    (i) Formally, signature.RandomBeaconThreshold $\leq |ℛ| \leq | 𝒞 \cap 𝒟~ |$

  • Note that (i) is the absolute lower bound (it assumed there is no node $𝒞 \cap 𝒟$ that failed the DKG) where even a single offline node could halt consensus❗ This is not practical for a production network.

  • Tarak has heuristically derived more practically-reliable bounds - for details, please see DKG contract success threshold (notion). Specifically, I would be inclined to reject the input EpochCommit as invalid unless it satisfies the heuristical bound

    (ii) 0.6·signature.RandomBeaconThreshold $\leq | 𝒞 \cap 𝒟~ |$ with

    • $𝒟$ is the ket set EpochCommit.DKGIndexMap
    • $𝒞$ is the EpochSetup.Participants.Filter(filter.IsConsensusCommitteeMember)

@jordanschalm what do you think about enforcing the bound (ii)? At least I think we should enforce the less strict bound (i) with is necessary for consensus progress. In comparison, bound (ii) is heuristically sufficient for consensus progress under non-extreme conditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexHentschel @jordanschalm Either way we will implement some sort of check so I went with (ii) ec92c60 (#6338), let me know if you want it updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm onboard with adding the suggested checks. But I'm confused by a couple things 🤔👇

Either way we will implement some sort of check so I went with (ii)

The comment says that we are implementing (ii), but the implementation is actually implementing (i).

We are using this code, which is equivalent to (i):

if signature.RandomBeaconThreshold(n) > numberOfRandomBeaconParticipants { ... }

(ii) 0.6·signature.RandomBeaconThreshold ≤|𝒞∩𝒟|

I'm reading this as 0.6 multiplied by signature.RandomBeaconThreshold, in other words 60% of the threshold value (making the threshold lower!). But it should be (0.6)(𝒟) -- 60% of the DKG committee size.

Also, the value we use on the smart contract is 0.7, or 70%, which covers the worst-case scenario. We may as well use the same value here.

 % flow scripts execute transactions/dkg/scripts/get_thresholds.cdc -n mainnet

Result: s.56c14e18f445f3a648d2455303c75a984cd93be3d4530a65fa17a6e4e1f38e07.Thresholds(native: 42, safe: 59, safePercentage: 0.70000000)

Suggestions

  • Use 70% of the DKG committee size as the safety threshold (define a RandomBeaconSafetyThreshold constant/function alongside signature.RandomBeaconThreshold that documents what this separate threshold means)
  • It would be ideal to include this new check when we generate the transaction arguments, so that any failures are caught much sooner and are easier to deal with. If it's non-trivial to add there, though, I'm fine to skip this suggestion, since the likelihood of this failure is low, and it will be caught eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexHentschel @jordanschalm let me know if you have any concerns about godoc 790b016 (#6338), can address in separate PR if needs changes.

durkmurder and others added 7 commits September 10, 2024 17:35
[EFM] Extended documentation where generalized protocol logic meets bootstrapping implementation
Co-authored-by: Alexander Hentschel <[email protected]>
Co-authored-by: Alexander Hentschel <[email protected]>
Co-authored-by: Alexander Hentschel <[email protected]>
cmd/bootstrap/run/qc.go Outdated Show resolved Hide resolved
cmd/bootstrap/run/qc.go Outdated Show resolved Hide resolved
cmd/bootstrap/run/qc.go Outdated Show resolved Hide resolved
cmd/bootstrap/run/qc.go Outdated Show resolved Hide resolved
model/flow/dkg.go Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm onboard with adding the suggested checks. But I'm confused by a couple things 🤔👇

Either way we will implement some sort of check so I went with (ii)

The comment says that we are implementing (ii), but the implementation is actually implementing (i).

We are using this code, which is equivalent to (i):

if signature.RandomBeaconThreshold(n) > numberOfRandomBeaconParticipants { ... }

(ii) 0.6·signature.RandomBeaconThreshold ≤|𝒞∩𝒟|

I'm reading this as 0.6 multiplied by signature.RandomBeaconThreshold, in other words 60% of the threshold value (making the threshold lower!). But it should be (0.6)(𝒟) -- 60% of the DKG committee size.

Also, the value we use on the smart contract is 0.7, or 70%, which covers the worst-case scenario. We may as well use the same value here.

 % flow scripts execute transactions/dkg/scripts/get_thresholds.cdc -n mainnet

Result: s.56c14e18f445f3a648d2455303c75a984cd93be3d4530a65fa17a6e4e1f38e07.Thresholds(native: 42, safe: 59, safePercentage: 0.70000000)

Suggestions

  • Use 70% of the DKG committee size as the safety threshold (define a RandomBeaconSafetyThreshold constant/function alongside signature.RandomBeaconThreshold that documents what this separate threshold means)
  • It would be ideal to include this new check when we generate the transaction arguments, so that any failures are caught much sooner and are easier to deal with. If it's non-trivial to add there, though, I'm fine to skip this suggestion, since the likelihood of this failure is low, and it will be caught eventually.

@durkmurder durkmurder merged commit 80eea48 into feature/efm-recovery Sep 13, 2024
54 of 55 checks passed
@durkmurder durkmurder deleted the yurii/6321-update-DKG-implementors branch September 13, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants