-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Incoming request ended abruptly: context canceled - http type #1360
Comments
We have this recurring issue at Crisp (Cloudflare Enterprise customer) that significantly impacts us, especially in the past three weeks. We've been raising this numerous times here as well as Cloudflare Support for years. From my understanding, the problem lies in Cloudflared’s inability to auto-detect and switch to stable proxy points when a nearby location becomes unstable, from instance during a CF maintenance in a POP. Example: Possible issue: Cloudflared appears to use locations defined in allregions/discovery.go. We think issue could be:
Anyway, there is a lack of Stability Checks: Cloudflare does not appear to validate the stability of the connection to the selected proxy point. For instance: |
Hi, I'll help answer from the knowledge I know. Cloudflare operates multi data centers (known as PoPs) within one city, therefore, one city (e.x. Amsterdam) marked as "re-routed" or "partially re-routed" doesn't mean that the entire PoPs in Amsterdam is down. unless, cloudflare only operates one data center/PoP within one city (not the case with major cities such as AMS/LON/FRA) you brought up SRV and DNS caching, it is not the issue here as Cloudflare Tunnel IPs are all anycast to the nearest regions. However, using Cloudflare Tunnel doesn't guarantee network performance to endusers. Traffic from enduser's cloudflare pop <-> tunnel POP is carried by best-effort performance. let's say I'm hitting cloudflare's Japan PoP, traffic to your tunnel PoP (AMS) will go over public backbone. if there're any cable cuts, that might have an impact on perrformance. the solution to have better performance globally is by using Argo. I don't know what exact issues are you having, so maybe try turn on Argo for a bit to see if there're any differences. |
Hi @morpig thank you for your response. I am answering for Baptiste since I’m also from Crisp and handled the downtime this night. What our cfd indicated when we restarted them was that all 8 HA connections were established with ams00 targets, where 00 is a number that seemed to range from 01 to 16. This clearly indicates that your control plane served us with AMS endpoints, even if AMS was in maintenance at the same time. We did not observe packet loss in the region during the outage, therefore we linked our tunnel outage to CF AMS maintenance. Our DC is at DigitalOcean AMS, so I expect we cross common exchange points away from the public backbone. What I would have expected is that the CF control plane would serve backup PoP that cfd could use instead such as FRA which is nearby. We’ve seen our AMS cfd connect to both CF AMS and CF FRA PoPs in the past, as per cfd logs in our syslog. However yesterday CF tunnel control plane only served us with AMS CF PoP for all 8 HA connections. While I understand the principle of anycast, could some network level routing issue have occurred due to this night maintenance. It’s really hard to tell but I suspect serving different PoPs would have helped with resolving the outage. |
I brought this exact problem sometime ago (and got reply from CF team) #747 (comment) |
Interesting you’ve also noticed the same issue as us. That should have degraded smoothly yesterday however given CF response to you in 2022, which clearly wasn’t the case. To be noted that QUIC was completely down during the outage, and we had to force HTTP2 to get traffic to flow with excruciatingly high latency (5-10s per request). |
What did support have to say about this? If you want you can set the edge addresses used manually using --edge ip:7844 and do your own checks for health |
So far we received from them "I've escalated the issue and we'll get back to you as soon as we have more information." We don't expect any answer from them; they say it will be escalated every time but do nothing on their end. |
that is certainly odd. you said you are an ent client? do you have tunnel logs from this time? |
the SRV Records should not change regardless of the state of the PoP's just CF's routing of those ips. so DNS Cache should not impact anything. There is also health checking of the tunnel connections, however the routes and the health of those is out of scope of cloudflared. if you need dedicated connectivity or more control you should consider CPI/PNI.
AMS is a MCP, the appended numbers are the PoP number in that city
The control plane does not really serve you with a specific pop for tunnel endpoints as you might expect, bgp decides where your tunnel endpoints end up
Id say this is more likey to be network/isp related than a CF incident. cloudflared logs would help. |
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
If it's an issue with Cloudflare Tunnel:
4. Tunnel ID : 37c23069-2c21-413b-917e-3c618d4e05ef
5. Configuration is here: https://github.com/Five-Borough-Fedi-Project/masto.nyc-docean/blob/main/kubernetes/dependencies/cloudflared.yaml and here: https://github.com/Five-Borough-Fedi-Project/masto.nyc-docean/blob/main/kubernetes/services/configmap-cloudflared.yaml
Expected behavior
A clear and concise description of what you expected to happen:
I'm tracking down some issues in our proxy, and am addressing strange errors in the chain. This keeps coming up. It's not so much that I expect them to not happen, but I expect the error message to explain more what is going on-
Incoming request ended abruptly: context canceled
doesn't really tell me anything. What does this mean? How do I address it?Environment and versions
Logs and errors
The text was updated successfully, but these errors were encountered: