Skip to content

simln-lib: add htlc interceptor for simulated nodes #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

elnosh
Copy link
Collaborator

@elnosh elnosh commented May 6, 2025

for #255

Mostly took changes from these 2 commits to add an Interceptor trait:

Some of the things I changed from those commits:

  • intercept_htlc now returns a Result to communicate the result of the intercepted htlc instead of sending it through a channel.
  • spawns a task in a JoinSet for each interceptor's intercept_htlc so that if any of those holds the htlc for a long time it does not block the other interceptors. It then waits for the completion of the tasks in the joinset. If any of them returns a result to fail the htlc, it drops the JoinSet since there is no need to wait for completion of other tasks because the htlc will fail anyways.

@carlaKC
Copy link
Contributor

carlaKC commented May 6, 2025

Driveby comment: could we split up the logic adding a unique index and the interceptor into separate commits?
For the sake of breaking up into smaller logical chunks / reviewability.

Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main comments are around how we handle shutdown in a reaonsable way. Specifically:

  • An interceptor has had a critical failure, how does it tell the simulator to shut down
  • The simultor needs to terminate the interceptors, how can we cleanly do this so that interception code isn't left in a bad state

#[error("DuplicateCustomRecord: key {0}")]
DuplicateCustomRecord(u64),
#[error("InterceptorError: {0}")]
InterceptorError(Box<dyn Error + Send + Sync + 'static>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead and define this as an error type in the top level library?

I think that simln could use an overhaul in the way we do errors (I did not understand rust errors when we started this project 🙈 ), and defining a reasonable general error type seems like a good start.

Comment on lines +699 to +700
/// The short channel id for the incoming channel that this htlc was delivered on.
pub incoming_htlc: HtlcRef,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: note that this is a unique identifier for the htlc

/// Custom records provided by the incoming htlc.
pub incoming_custom_records: CustomRecords,

/// The short channel id for the outgoing channel that this htlc should be forwarded over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expand doc to note that None indicates that the intercepting node is the receiver.

);

let interceptor_clone = Arc::clone(interceptor);
intercepts.spawn(async move { interceptor_clone.intercept_htlc(request).await });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good spot for futures::future::join_all?

Rather than needing to spawn a new task here, we can just call (but not await) the intercept_htlc method and then await them in whatever order they arrive after that?

Author's choice. - I don't see us running with so many interceptors that creating these tasks starts to give us a performanc hit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked into join_all but if iiuc it will wait until all futures resolve. I think we'd prefer to get the responses as the tasks finish rather than waiting for all to finish? that way we can fail the other interceptors if one failed the htlc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd prefer to get the responses as the tasks finish rather than waiting for all to finish? that way we can fail the other interceptors if one failed the htlc.

Yeah indeed, we want to hear back from the first completing one so that we can cancel the rest. Could use FuturesUnordered but I think that's an over-optimization (it's for lots of futures), so happy to leave as-is 👍

}
},
Ok(Err(e)) => {
drop(intercepts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we may need a more gentle shutdown than dropping the task (which afaik will force abort the intercept_htlc call).

I can see it being difficult to write interceptor code that works with another system when your code may be run half way and then killed - could end up with funny states. My instinct is to provides a triggered pair and instruct interceptors to listen on it for shutdown signals? Open to other approaches as well.

I do think that the ability to shut down all the other interceptors once the first error is reached is a really, really nice feature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good 👍 provide that trigger in the InterceptRequest and send on it if we need to fail other interceptors?
just to note, although intercept_htlc may force-aborted, interceptor should still get notified about htlc resolution through notify_resolution but I think we'd still like to shutdown gently through intercept_htlc?

Copy link
Contributor

@carlaKC carlaKC May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide that trigger in the InterceptRequest and send on it if we need to fail other interceptors?

I think that we should handle the triggering?

  • Give each interceptor a listener
  • On first interceptor error, trigger shutdown

Then interceptors are responsible for listening on that shutdown listener and getting their state in order before they exit. We still wait for each to finish, but that should be relatively quick because we've signaled that it's shutdown time.

Interceptor should still get notified about htlc resolution through notify_resolution

As-is I don't think we'd notify if the HTLC is failed by the interceptor? It'll never be "fully" forwarded by the node, so we don't notify it being resolved.

Edit: although coming to think of this, we probably do want to notify the failure even if one of the interceptors has returned a fail outcome - the others may have returned success and aren't aware that it actually never ended up going through.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Give each interceptor a listener
  • On first interceptor error, trigger shutdown

Then interceptors are responsible for listening on that shutdown listener and getting their state in order before they exit. We still wait for each to finish, but that should be relatively quick because we've signaled that it's shutdown time.

got it, will do.

As-is I don't think we'd notify if the HTLC is failed by the interceptor? It'll never be "fully" forwarded by the node, so we don't notify it being resolved.

I thought we would. Because if an interceptor returns an error, we then return it in add_htlcs and then call remove_htlcs here: https://github.com/elnosh/sim-ln/blob/2de405fdcb37c8ffe4bd8d2cc0077ef7099cc3ed/simln-lib/src/sim_node.rs#L1274 which internally calls notify_resolution for each interceptors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we would.

Ok cool, forgot when that is/isn't called bu sgtm!

pub trait Interceptor: Send + Sync {
/// Implemented by HTLC interceptors that provide input on the resolution of HTLCs forwarded in the simulation.
async fn intercept_htlc(&self, req: InterceptRequest)
-> Result<CustomRecords, ForwardingError>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only have one layer of Result here, how does the intercept_htlc call tell the simulator that it's hit a ciritical error and wants to shut down (rather than a forwarding error for this payment specifically)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if that's something we'd want. I didn't like the idea of nested Results. It could instead be communicated through specific variants in ForwardingError? Simulation could check with is_critical if the ForwardingError returned warrants a shutdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could instead be communicated through specific variants in ForwardingError

That seems like a bit of a layering violation to me. AnInterceptorError for interceptors that might want to fail forwards with their own ForwardingFalure reason that isn't defined in simln (eg, you don't have enough reputation) makes sense to me, but we're stretching its definition a bit to cover things like unexpected internal state errors.

Nested results are definitely ugly, but I think that it more clearly represents the two types of failures that we could have here? One for "fail this HTLC", one for "something has gone wrong with my interception". If we define a type for the inner result the function signatures won't be too ugly at least.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair! will change 👍

@elnosh
Copy link
Collaborator Author

elnosh commented May 9, 2025

  • An interceptor has had a critical failure, how does it tell the simulator to shut down

I wasn't sure if letting the interceptor shutdown the simulation was something we'd want but it could be done through specific variants in the ForwardingError returned? Simulation could check with is_critical if the ForwardingError returned warrants a shutdown. Although Interceptors should be aware of this to trigger a shutdown by providing specific variants.

  • The simultor needs to terminate the interceptors, how can we cleanly do this so that interception code isn't left in a bad state

should we include something like a channel or shutdown signal in InterceptRequest?

@carlaKC
Copy link
Contributor

carlaKC commented May 12, 2025

WDYT about adapting this latency interceptor and surfacing it on the CI?

Nice to have a user of the interceptor that we can run + test in master rather than merging in some unused code and having to test it externally.

@elnosh
Copy link
Collaborator Author

elnosh commented May 12, 2025

sgtm, as I was doing exactly that and testing externally heh.

Comment on lines +679 to +684
async fn notify_resolution(
&self,
_res: InterceptResolution,
) -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
Ok(())
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rethinking this, I'm wondering what type of error should this be or if this method should return a Result at all? Leaning towards this should not return a Result since the simulation is just notifying the interceptor of a resolution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the interceptor wants to signal shutdown on the notify_resolution because it got information it wasn't expecting? Or a database became unavailable, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants