-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement quiescence protocol #3588
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3588 +/- ##
========================================
Coverage 88.56% 88.56%
========================================
Files 149 150 +1
Lines 117091 117925 +834
Branches 117091 117925 +834
========================================
+ Hits 103699 104446 +747
- Misses 10876 10954 +78
- Partials 2516 2525 +9 ☔ View full report in Codecov by Sentry. |
This no longer depends on #3573. I've included an alternative approach here that @TheBlueMatt and I discussed offline. |
let peer_state_mutex = per_peer_state.get(counterparty_node_id).ok_or_else(|| { | ||
debug_assert!(false); | ||
MsgHandleErrInternal::send_err_msg_no_close( | ||
format!("Can't find a peer matching the passed counterparty node_id {}", counterparty_node_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably just my reading preference, but we could use format!("Can't find a peer matching the passed counterparty node_id { counterparty_node_id}"),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left some comments along the way
ACK 047ddd3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! How does this interact with async signing? It seems like it should consider async signing pending cases in the decision to send stfu/move to quiescent, no?
Yeah, should just need to also check for |
@TheBlueMatt addressed all your comments, should be ready for review again. Since things changed a good bit, I started fuzzing the new HEAD. |
|| self.monitor_pending_revoke_and_ack || self.signer_pending_revoke_and_ack | ||
|| self.monitor_pending_commitment_signed || self.signer_pending_commitment_update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I realize these maybe dont make sense to consider in the disconnect case, but do make sense to consider in the quiescence case.
@@ -6960,6 +7096,14 @@ impl<SP: Deref> FundedChannel<SP> where | |||
} | |||
assert!(!matches!(self.context.channel_state, ChannelState::ShutdownComplete)); | |||
|
|||
// TODO: The spec is pretty vague regarding the handling of shutdown within quiescence. | |||
if self.context.channel_state.is_local_stfu_sent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might have sent the stfu but they sent the shutdown before they received it. I dont think it makes sense to check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still disconnecting so it should be fine? They can retry the shutdown after reconnecting since it'll be like the stfu
never happened (unless it's also resent of course). This is really just a placeholder until we decide what to do at the spec level anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that might just get us into a loop instead - reconnect, we resent stfu, they resend shutdown, disconnect, .... IMO we should handle this instead, abort the stfu/thing we want to do and just shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, would be nice to avoid that. Maybe we leave this as follow-up work then? We don't have retrying logic yet, and I'd hate to have to implement all of this, just for it to be undone because the spec chose to go in a different direction.
if !self.context.is_live() { | ||
return Err(ChannelError::Ignore("Channel is not in a live state to propose quiescence".to_owned())); | ||
} | ||
if self.context.channel_state.is_quiescent() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also fail if the counterparty has sent stfu (as that means they'll get the talking stick after we went quiescent, but the caller almost certainly wanted to go quiescent so that they can get the talking stick and do something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. I was planning to leave it for follow-up work when integrating with splicing since we'd have a better idea how propose_quiescence
should be designed when not used in a test-only scenario. We have to continue retrying (up to some timeout?) propose_quiescence
until we're the initiator to get a locally requested splice through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. I was planning to leave it for follow-up work when integrating with splicing since we'd have a better idea how propose_quiescence should be designed when not used in a test-only scenario.
Seemed like we have a pretty reasonable idea already, no?
We have to continue retrying (up to some timeout?) propose_quiescence until we're the initiator to get a locally requested splice through.
If we want to retry quiescence until it works we need a lot more tracking and logic in a few places, including probably new states, no? I don't think the current code is conducive to that at all and was assuming we wouldn't (or would do it at a higher level)? For splice specifically we can add our stuff to the splice the counterparty initiates too so there would presumably be no need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed like we have a pretty reasonable idea already, no?
Well that was just one idea. I haven't given it too much thought, and trying to do it all in one PR seems overkill.
If we want to retry quiescence until it works we need a lot more tracking and logic in a few places, including probably new states, no? I don't think the current code is conducive to that at all and was assuming we wouldn't (or would do it at a higher level)?
Correct, this PR just serves as a base implementation of the protocol. The need to retry only becomes relevant when trying to carry out a user's request to do "something fundamental" (e.g., a splice). All the tracking and logic would operate at a higher level, unless you're proposing it shouldn't?
For splice specifically we can add our stuff to the splice the counterparty initiates too so there would presumably be no need.
Sure, if that opportunity is available we should take advantage of it, but we can't always rely on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this PR just serves as a base implementation of the protocol. The need to retry only becomes relevant when trying to carry out a user's request to do "something fundamental" (e.g., a splice). All the tracking and logic would operate at a higher level, unless you're proposing it shouldn't?
Sure, that's fine, but if we're gonna do retries at a higher level then we probably want to go ahead and write the API here to be fallible and let the higher level do retries :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just change it when we get there? This is just a test helper for now really
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna forget by then, but okay 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do forget, shipping retries with zero tests at all (which would immediately catch this) would be pretty irresponsible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it through the first three commits. Looks good so far. Will continue reviewing but figured I send out some minor comments in the meanwhile.
The existing `ChannelError::Warn` variant only sends the warning and does not disconnect. There are certain cases where we want to just send a warning, and other cases where we want to also disconnect, so we keep both variants around.
Quiescence is a new protocol feature that allows for channels to undergo "fundamental" changes (i.e., protocol upgrade) while there are no pending updates on either side. Its first use case will be to carry out channel splices, to ensure new HTLC/fee updates are not made while a splice is being negotiated. Each side of the channel is allowed to send a `stfu` message if any of their outbound updates are not pending for either side (i.e., irrevocably committed on both commitment transactions). Once both sides exchange `stfu`, the channel becomes quiescent. A message timeout is enforced during the quiescence handshake to ensure we can eventually re-establish the channel and propose new HTLC/fee updates again. Several new state flags have been added to `ChannelState::ChannelReady` to track the progress of the quiescence handshake. Once the channel becomes quiescent, all flags related to the handshake are cleared, and the `QUIESCENT` flag is enabled. While quiescence is not a persistent protocol (it implicitly terminates upon peer disconnection), and updates cannot be made, we still need to track `MONITOR_UPDATE_IN_PROGRESS` as it may be required by the quiescence-dependent protocol, like in the case of splicing.
We previously would avoid freeing our holding cells upon a `revoke_and_ack` if a monitor update was in progress, which we checked explicitly. With quiescence, if we've already sent `stfu`, we're not allowed to make further commitment updates, so we must also avoid freeing our holding cells in such cases. Along the way, we also remove the special handling of in-progress monitor updates now that it behaves the same as the handling of being quiescent.
With the introduction of `has_pending_channel_update`, we can now determine whether any messages are owed to irrevocably commit HTLC updates based on the current channel state. We prefer using the channel state, over manually tracking as previously done, to have a single source of truth. We also gain the ability to expect to receive multiple messages at once, which will become relevant with the quiescence protocol, where we may be waiting on a counterparty `revoke_and_ack` and `stfu`.
Since new updates are not allowed during quiescence (local updates enter the holding cell), we want to ensure quiescence eventually terminates if the handshake takes too long or our counterparty is uncooperative. Disconnecting implicitly terminates quiescence, so the holding cell can be freed upon re-establishing the channel (assuming quiescence is not requested again).
Rebased due to conflicts and added some more comments to the test you requested @jkczyz. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Quiescence is a new protocol feature that allows for channels to undergo "fundamental" changes (i.e., protocol upgrade) while there are no pending updates on either side. Its first use case will be to carry out channel splices, to ensure new HTLC/fee updates are not made while a splice is being negotiated.
Each side of the channel is allowed to send a
stfu
message if any of their outbound updates are not pending for either side (i.e., irrevocably committed on both commitment transactions). Once both sides exchangestfu
, the channel becomes quiescent. A message timeout is enforced during the quiescence handshake to ensure we can eventually re-establish the channel and propose new HTLC/fee updates again.Several new state flags have been added to
ChannelState::ChannelReady
to track the progress of the quiescence handshake. Once the channel becomes quiescent, all flags related to the handshake are cleared, and theQUIESCENT
flag is enabled. While quiescence is not a persistent protocol (it implicitly terminates upon peer disconnection), and updates cannot be made, we still need to trackMONITOR_UPDATE_IN_PROGRESS
as it may be required by the quiescence-dependent protocol, like in the case of splicing.