Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement quiescence protocol #3588

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

wpaulino
Copy link
Contributor

@wpaulino wpaulino commented Feb 4, 2025

Quiescence is a new protocol feature that allows for channels to undergo "fundamental" changes (i.e., protocol upgrade) while there are no pending updates on either side. Its first use case will be to carry out channel splices, to ensure new HTLC/fee updates are not made while a splice is being negotiated.

Each side of the channel is allowed to send a stfu message if any of their outbound updates are not pending for either side (i.e., irrevocably committed on both commitment transactions). Once both sides exchange stfu, the channel becomes quiescent. A message timeout is enforced during the quiescence handshake to ensure we can eventually re-establish the channel and propose new HTLC/fee updates again.

Several new state flags have been added to ChannelState::ChannelReady to track the progress of the quiescence handshake. Once the channel becomes quiescent, all flags related to the handshake are cleared, and the QUIESCENT flag is enabled. While quiescence is not a persistent protocol (it implicitly terminates upon peer disconnection), and updates cannot be made, we still need to track MONITOR_UPDATE_IN_PROGRESS as it may be required by the quiescence-dependent protocol, like in the case of splicing.

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 87.95580% with 109 lines in your changes missing coverage. Please review.

Project coverage is 88.56%. Comparing base (6cf270d) to head (570ddae).

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 70.22% 45 Missing and 8 partials ⚠️
lightning/src/ln/channel.rs 83.66% 47 Missing and 3 partials ⚠️
lightning/src/ln/quiescence_tests.rs 98.54% 6 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #3588    +/-   ##
========================================
  Coverage   88.56%   88.56%            
========================================
  Files         149      150     +1     
  Lines      117091   117925   +834     
  Branches   117091   117925   +834     
========================================
+ Hits       103699   104446   +747     
- Misses      10876    10954    +78     
- Partials     2516     2525     +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wpaulino
Copy link
Contributor Author

wpaulino commented Feb 7, 2025

This no longer depends on #3573. I've included an alternative approach here that @TheBlueMatt and I discussed offline.

@wpaulino wpaulino added the weekly goal Someone wants to land this this week label Feb 7, 2025
let peer_state_mutex = per_peer_state.get(counterparty_node_id).ok_or_else(|| {
debug_assert!(false);
MsgHandleErrInternal::send_err_msg_no_close(
format!("Can't find a peer matching the passed counterparty node_id {}", counterparty_node_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably just my reading preference, but we could use format!("Can't find a peer matching the passed counterparty node_id { counterparty_node_id}"),

Copy link
Contributor

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left some comments along the way

ACK 047ddd3

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! How does this interact with async signing? It seems like it should consider async signing pending cases in the decision to send stfu/move to quiescent, no?

@wpaulino
Copy link
Contributor Author

wpaulino commented Feb 11, 2025

How does this interact with async signing? It seems like it should consider async signing pending cases in the decision to send stfu/move to quiescent, no?

Yeah, should just need to also check for signer_pending_revoke_and_ack/commitment_update in has_pending_channel_update. I'll add some test coverage as well.

@wpaulino
Copy link
Contributor Author

@TheBlueMatt addressed all your comments, should be ready for review again. Since things changed a good bit, I started fuzzing the new HEAD.

Comment on lines +2909 to +2977
|| self.monitor_pending_revoke_and_ack || self.signer_pending_revoke_and_ack
|| self.monitor_pending_commitment_signed || self.signer_pending_commitment_update
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I realize these maybe dont make sense to consider in the disconnect case, but do make sense to consider in the quiescence case.

@@ -6960,6 +7096,14 @@ impl<SP: Deref> FundedChannel<SP> where
}
assert!(!matches!(self.context.channel_state, ChannelState::ShutdownComplete));

// TODO: The spec is pretty vague regarding the handling of shutdown within quiescence.
if self.context.channel_state.is_local_stfu_sent()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have sent the stfu but they sent the shutdown before they received it. I dont think it makes sense to check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still disconnecting so it should be fine? They can retry the shutdown after reconnecting since it'll be like the stfu never happened (unless it's also resent of course). This is really just a placeholder until we decide what to do at the spec level anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that might just get us into a loop instead - reconnect, we resent stfu, they resend shutdown, disconnect, .... IMO we should handle this instead, abort the stfu/thing we want to do and just shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, would be nice to avoid that. Maybe we leave this as follow-up work then? We don't have retrying logic yet, and I'd hate to have to implement all of this, just for it to be undone because the spec chose to go in a different direction.

if !self.context.is_live() {
return Err(ChannelError::Ignore("Channel is not in a live state to propose quiescence".to_owned()));
}
if self.context.channel_state.is_quiescent() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also fail if the counterparty has sent stfu (as that means they'll get the talking stick after we went quiescent, but the caller almost certainly wanted to go quiescent so that they can get the talking stick and do something).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. I was planning to leave it for follow-up work when integrating with splicing since we'd have a better idea how propose_quiescence should be designed when not used in a test-only scenario. We have to continue retrying (up to some timeout?) propose_quiescence until we're the initiator to get a locally requested splice through.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. I was planning to leave it for follow-up work when integrating with splicing since we'd have a better idea how propose_quiescence should be designed when not used in a test-only scenario.

Seemed like we have a pretty reasonable idea already, no?

We have to continue retrying (up to some timeout?) propose_quiescence until we're the initiator to get a locally requested splice through.

If we want to retry quiescence until it works we need a lot more tracking and logic in a few places, including probably new states, no? I don't think the current code is conducive to that at all and was assuming we wouldn't (or would do it at a higher level)? For splice specifically we can add our stuff to the splice the counterparty initiates too so there would presumably be no need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed like we have a pretty reasonable idea already, no?

Well that was just one idea. I haven't given it too much thought, and trying to do it all in one PR seems overkill.

If we want to retry quiescence until it works we need a lot more tracking and logic in a few places, including probably new states, no? I don't think the current code is conducive to that at all and was assuming we wouldn't (or would do it at a higher level)?

Correct, this PR just serves as a base implementation of the protocol. The need to retry only becomes relevant when trying to carry out a user's request to do "something fundamental" (e.g., a splice). All the tracking and logic would operate at a higher level, unless you're proposing it shouldn't?

For splice specifically we can add our stuff to the splice the counterparty initiates too so there would presumably be no need.

Sure, if that opportunity is available we should take advantage of it, but we can't always rely on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this PR just serves as a base implementation of the protocol. The need to retry only becomes relevant when trying to carry out a user's request to do "something fundamental" (e.g., a splice). All the tracking and logic would operate at a higher level, unless you're proposing it shouldn't?

Sure, that's fine, but if we're gonna do retries at a higher level then we probably want to go ahead and write the API here to be fallible and let the higher level do retries :p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just change it when we get there? This is just a test helper for now really

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna forget by then, but okay 🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do forget, shipping retries with zero tests at all (which would immediately catch this) would be pretty irresponsible...

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it through the first three commits. Looks good so far. Will continue reviewing but figured I send out some minor comments in the meanwhile.

The existing `ChannelError::Warn` variant only sends the warning and
does not disconnect. There are certain cases where we want to just send
a warning, and other cases where we want to also disconnect, so we
keep both variants around.
Quiescence is a new protocol feature that allows for channels to undergo
"fundamental" changes (i.e., protocol upgrade) while there are no
pending updates on either side. Its first use case will be to carry out
channel splices, to ensure new HTLC/fee updates are not made while a
splice is being negotiated.

Each side of the channel is allowed to send a `stfu` message if any of
their outbound updates are not pending for either side (i.e.,
irrevocably committed on both commitment transactions). Once both sides
exchange `stfu`, the channel becomes quiescent. A message timeout is
enforced during the quiescence handshake to ensure we can eventually
re-establish the channel and propose new HTLC/fee updates again.

Several new state flags have been added to `ChannelState::ChannelReady`
to track the progress of the quiescence handshake. Once the channel
becomes quiescent, all flags related to the handshake are cleared, and
the `QUIESCENT` flag is enabled. While quiescence is not a persistent
protocol (it implicitly terminates upon peer disconnection), and updates
cannot be made, we still need to track `MONITOR_UPDATE_IN_PROGRESS` as
it may be required by the quiescence-dependent protocol, like in the
case of splicing.
We previously would avoid freeing our holding cells upon a
`revoke_and_ack` if a monitor update was in progress, which we checked
explicitly. With quiescence, if we've already sent `stfu`, we're not
allowed to make further commitment updates, so we must also avoid
freeing our holding cells in such cases.

Along the way, we also remove the special handling of in-progress
monitor updates now that it behaves the same as the handling of being
quiescent.
With the introduction of `has_pending_channel_update`, we can now
determine whether any messages are owed to irrevocably commit HTLC
updates based on the current channel state. We prefer using the channel
state, over manually tracking as previously done, to have a single
source of truth. We also gain the ability to expect to receive multiple
messages at once, which will become relevant with the quiescence
protocol, where we may be waiting on a counterparty `revoke_and_ack` and
`stfu`.
Since new updates are not allowed during quiescence (local updates enter
the holding cell), we want to ensure quiescence eventually terminates if
the handshake takes too long or our counterparty is uncooperative.
Disconnecting implicitly terminates quiescence, so the holding cell can
be freed upon re-establishing the channel (assuming quiescence is not
requested again).
@wpaulino
Copy link
Contributor Author

Rebased due to conflicts and added some more comments to the test you requested @jkczyz.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
weekly goal Someone wants to land this this week
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants