Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverseproxy: Connection termination cleanup #5663

Merged
merged 2 commits into from
Aug 1, 2023

Conversation

mmm444
Copy link
Contributor

@mmm444 mmm444 commented Jul 31, 2023

This PR cleans up the situation with proxied connection termination after configuration reload under several circumstances.

Before PR

To my best knowledge the situation before this PR is captured in the following table that summarizes when the connections are terminated on configuration reload:

flush_interval Connection Type stream_close_delay Terminated
-1 Streaming 0 On Reload (1)
> 0 On Reload (1) *
Other On Reload (1) *
!= -1 Streaming 0 On Reload (2)
> 0 After Timeout
Other Naturally

The termination mechanisms require a bit of clarification:

  • Naturally means the the connection terminates when either upstream or downstream connection is closed. Note that all connections can terminate naturally before any other termination mechanism takes place.
  • After Timeout has effect on streaming connections (websockets). When the proxy is terminating (because of configuration reload) it sets up a timer that, when it fires, terminates all the remaining streaming connections.
  • On Reload (2) is similar to After Timeout with the difference that the termination happens immediately without setting up any timer.
  • On Reload (1) happens also when the proxy terminates but has different internal mechanism. It is related to the way the connections are naturally terminated. When flush_interval != -1 the proxy creates the upstream connection with a context bound to the downstream request so when the downstream connection terminates the upstream one gets immediately closed as well. There were some subtle issues when flush_interval == -1 (see caddyserver-reverseproxy + ffmpeg streaming causes invalid files/connection state #4922) that were resolved in reverseproxy: Ignore context cancel in stream mode #4952 by binding the context of upstream connections to the context of the proxy. The proxy context gets canceled when the configuration is reloaded thus effectively terminating all the connections.
  • The behaviors marked with * are buggy.

Effect

This PR changes the way how the connections are terminated when flush_interval == -1. Instead of binding the cancellation of upstream request to the context of the proxy, the cancellation is simply suppressed. It may seem that this may lead to connection leaks but it does not, because the connections will terminate whenever one of its parts (upstream or downstream) terminates and other tries to do some IO. So maybe there is a leak is in the sense that the termination happens later.

So the table from above looks like this after the PR, changed items are emphasized:

flush_interval Connection Type stream_close_delay Terminated
-1 Streaming 0 On Reload (2)
> 0 After Timeout
Other Naturally
!= -1 Streaming 0 On Reload (2)
> 0 After Timeout
Other Naturally

Notes

This PR follows the discussion in #5567. See also #5637 for more discussion on a related topic.

Hopefully this PR will clarify the whole situation.

@mmm444
Copy link
Contributor Author

mmm444 commented Jul 31, 2023

I did only some basic testing. I would like to do more when and if there are no objections against this PR.

@mholt mholt added this to the v2.7.0 milestone Jul 31, 2023
@francislavoie francislavoie added under review 🧐 Review is pending before merging bug 🐞 Something isn't working labels Jul 31, 2023
@francislavoie
Copy link
Member

Ooh, yeah this makes a lot of sense.

Thanks for the detailed comment, very helpful!

Copy link
Member

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a beautiful patch and PR description.

So basically we pass in a context that clears the deadline, cancel chan, and error, so that the parent context has no influence... but we can still access the embedded context's values.

Really appreciate the table up there and the clear description.

Would you like an invite to have write permission to the repo and an invite to our developer Slack? We'd be happy to accept more of your contributions in the future, and this way you could work in a branch instead of a fork. And also converse more real-time in Slack if that's helpful.

Thank you!

modules/caddyhttp/reverseproxy/reverseproxy.go Outdated Show resolved Hide resolved
Don't terminate connections using the proxy context when
flush_interval == -1. It is not necesseray and it causes issues
with websocket close timeouts.
@francislavoie francislavoie enabled auto-merge (squash) August 1, 2023 13:51
@francislavoie francislavoie merged commit da23501 into caddyserver:master Aug 1, 2023
19 checks passed
@mmm444
Copy link
Contributor Author

mmm444 commented Aug 1, 2023

So basically we pass in a context that clears the deadline, cancel chan, and error, so that the parent context has no influence... but we can still access the embedded context's values.

Exactly. That's what the new context.WithoutCancel does. 😉

Would you like an invite to have write permission to the repo and an invite to our developer Slack? We'd be happy to accept more of your contributions in the future, and this way you could work in a branch instead of a fork. And also converse more real-time in Slack if that's helpful.

Thank you for your trust, but I would keep it open for now. I'm quite busy with work and I cannot engage in caddy too much although I really like it. But feel free to mention me here on GitHub if you think that I could be of any help (maybe some other bugs/features in the proxy) and I will try to do my best.

BTW I did some tests regarding the closing of the connections with the non-cancellable context and it seems to be working fine.

@mholt
Copy link
Member

mholt commented Aug 1, 2023

Alrighty, sounds good. Well you're welcome here whenever you have time 😀

Thank you for the tests and confirming it works! Happy to get this out in 2.7 maybe even today or tomorrow.

@mholt mholt removed the under review 🧐 Review is pending before merging label Aug 1, 2023
@wazerstar
Copy link

So basically we pass in a context that clears the deadline, cancel chan, and error, so that the parent context has no influence... but we can still access the embedded context's values.

Exactly. That's what the new context.WithoutCancel does. 😉

Would you like an invite to have write permission to the repo and an invite to our developer Slack? We'd be happy to accept more of your contributions in the future, and this way you could work in a branch instead of a fork. And also converse more real-time in Slack if that's helpful.

Thank you for your trust, but I would keep it open for now. I'm quite busy with work and I cannot engage in caddy too much although I really like it. But feel free to mention me here on GitHub if you think that I could be of any help (maybe some other bugs/features in the proxy) and I will try to do my best.

BTW I did some tests regarding the closing of the connections with the non-cancellable context and it seems to be working fine.

Hey I want to test this

What's the default values for stream_timeout & stream_close_delay, and should I add flush_interval -1 still?

If I want to test this out for live streaming with DVR in it where hdhr is used - https://info.hdhomerun.com/info/dvr_api:live_tv

This basically shares the same down/up internally before handed over to caddy , either way I'm not sure if its working with or without, since when I do add flush_internal -1, it would give "error":"writing: client disconnected", removing it would result in "error":"reading: context canceled"

@francislavoie
Copy link
Member

Hey I want to test this

Sure, just build Caddy from master.

What's the default values for stream_timeout & stream_close_delay

None, i.e. zero. The timeout and delay are off by default (for now).

and should I add flush_interval -1 still?

Probably not. We have logic to enable immediate flushing if the request looks like it needs it. Only add that option if you actually need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants