Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD-0046: Optimistic cluster restart automation #46

Merged
merged 128 commits into from
Jan 14, 2025

Conversation

wen-coding
Copy link
Contributor

No description provided.

@wen-coding wen-coding marked this pull request as draft April 10, 2023 16:06
@wen-coding wen-coding marked this pull request as ready for review April 10, 2023 16:06
@wen-coding wen-coding marked this pull request as draft April 10, 2023 16:07
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved

So after a validator sees that 75% of the validators received 75% of the votes,
wait for 10 more minutes so that the message it sent out have propagated, then
restart from the Heaviest slot everyone agreed on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our last call I was thinking once each validator has figured out the heaviest fork and repaired up to the highest oc slot, the validator would:

  1. Issue a "hard fork" at the highest oc slot, which also changes the gossip shred version
  2. Execute the existing "--wait-for-supermajority" logic (ie, purge all slots above the highest oc slot, wait for 80%)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. I think we probably. should wait for 75% here because we assume 5% could be non-conforming.

proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved

We calculate "enough" stake as follows. When there are 80% validators joining
the restart, assuming 5% restarted validators can make mistakes in voting, any
block with more than 67% - 5% - (100-80)% = 42% could potentially be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this calculation.
What if the other 100% - 42% = 58% pick some other block?
Why should the minority 42% block be optimistically confirmed?

Why should ever a block with less than 67% vote be optimistically confirmed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to prevent false negative (if a slot was oc'ed before the restart, you must pick it here), not to prevent false positive (it's okay if we pick a slot here which isn't oc'ed). Because when we select Heaviest later we should see the competing fork and count the votes accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prolly add the motivation and justification for these values to the document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 119 to 122
2.1 If vote_on_child + stake_on_validators_not_in_restart >= 62%, pick child.
For example, if 80% validators are in restart, child has 42% votes, then
42 + (100-80) = 62%, pick child. 62% is chosen instead of 67% because 5%
could make the wrong votes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here, I am not sure why it is safe to go below 67%?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to prevent false negative at all costs and it's okay to have false positive. Let's say X is the first block having only 62% but not 67%, we know if 75% of the validators decide to pick this fork, it will be instantly oc'ed and we won't kick another oc'ed slot out. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, add the motivation and justification in the doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@mvines mvines marked this pull request as ready for review May 2, 2023 17:05
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
proposals/0024-repair-and-restart.md Outdated Show resolved Hide resolved
@wen-coding wen-coding requested a review from carllin December 11, 2024 19:38
@wen-coding
Copy link
Contributor Author

@t-nelson Want to give the oldest open SIMD another look?

@Benhawkins18 Benhawkins18 self-requested a review January 8, 2025 17:10
@Benhawkins18 Benhawkins18 merged commit 7dbb6c3 into solana-foundation:main Jan 14, 2025
2 checks passed
@wen-coding wen-coding deleted the smart-restart-proposal branch January 14, 2025 17:04
@0xOsprey
Copy link

Congrats on getting this over the line and thanks to those who contributed. Solana Mainnet Beta™ will be safer for it 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Standard SIMD with type Core standard SIMD in the Standard category
Projects
None yet
Development

Successfully merging this pull request may close these issues.