Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd proxy fails to connect to other proxy #12720

Open
andrewdinunzio opened this issue Jun 13, 2024 · 2 comments
Open

Linkerd proxy fails to connect to other proxy #12720

andrewdinunzio opened this issue Jun 13, 2024 · 2 comments
Labels

Comments

@andrewdinunzio
Copy link

andrewdinunzio commented Jun 13, 2024

What is the issue?

After upgrading to 2024.5.5, we have seen an increase in 503/504 timeouts, though seemingly only for a single (client) pod. It shows as logs like this:

[ 73137.020135s]  INFO ThreadId(01) outbound:proxy{addr=10.0.3.18:80}:service{ns=imodeljs name=svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7 port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[ 73137.020196s]  INFO ThreadId(01) outbound:proxy{addr=10.0.3.18:80}:rescue{client.addr=10.244.101.75:43486}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.3.18:80: route default.http: backend Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: service in fail-fast error.sources=[route default.http: backend Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: service in fail-fast, backend Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: service in fail-fast, Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: service in fail-fast, service in fail-fast]
[ 73137.024904s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.24.11:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.123570s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.12.6:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.224027s]  WARN ThreadId(01) outbound:proxy{addr=10.0.85.215:80}:service{ns=imodeljs name=svc-a2140638230d7090d9013f1fb4b91369a0674f75 port=80}:endpoint{addr=10.244.18.123:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.313264s]  WARN ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:service{ns=imodeljs name=svc-aafe19edc5d63559d25430238ffb792322920e7c port=80}:endpoint{addr=10.244.18.178:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.336371s]  WARN ThreadId(01) outbound:proxy{addr=10.0.3.18:80}:service{ns=imodeljs name=svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7 port=80}:endpoint{addr=10.244.30.115:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.592014s]  INFO ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:service{ns=imodeljs name=svc-3d97016a986f0e00f5120c270cc160ff6f23bf57 port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[ 73137.592108s]  INFO ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:rescue{client.addr=10.244.101.75:42520}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.80.209:80: route default.http: backend Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: service in fail-fast error.sources=[route default.http: backend Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: service in fail-fast, backend Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: service in fail-fast, Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: service in fail-fast, service in fail-fast]
[ 73137.845888s]  INFO ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:rescue{client.addr=10.244.101.75:42530}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.80.209:80: route default.http: service unavailable error.sources=[route default.http: service unavailable, service unavailable]
[ 73137.911194s]  WARN ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:service{ns=imodeljs name=svc-3d97016a986f0e00f5120c270cc160ff6f23bf57 port=80}:endpoint{addr=10.244.24.160:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.167582s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.2.4:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.189801s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.3.4:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.363328s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.18.8:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.422580s]  WARN ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:service{ns=imodeljs name=svc-aafe19edc5d63559d25430238ffb792322920e7c port=80}:endpoint{addr=10.244.18.178:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.437773s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.16.3:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.512983s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.10.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.527106s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.24.11:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.626325s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.12.6:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.726084s]  WARN ThreadId(01) outbound:proxy{addr=10.0.85.215:80}:service{ns=imodeljs name=svc-a2140638230d7090d9013f1fb4b91369a0674f75 port=80}:endpoint{addr=10.244.18.123:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.765340s]  WARN ThreadId(01) outbound:proxy{addr=10.0.3.18:80}:service{ns=imodeljs name=svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7 port=80}:endpoint{addr=10.244.30.115:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.311818s]  INFO ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:service{ns=imodeljs name=svc-aafe19edc5d63559d25430238ffb792322920e7c port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[ 73139.311892s]  INFO ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:rescue{client.addr=10.244.101.75:57314}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.66.11:80: route default.http: backend Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: service in fail-fast error.sources=[route default.http: backend Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: service in fail-fast, backend Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: service in fail-fast, Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: service in fail-fast, service in fail-fast]
[ 73139.323187s]  WARN ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:service{ns=imodeljs name=svc-3d97016a986f0e00f5120c270cc160ff6f23bf57 port=80}:endpoint{addr=10.244.24.160:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.570916s]  INFO ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:rescue{client.addr=10.244.101.75:57326}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.66.11:80: route default.http: service unavailable error.sources=[route default.http: service unavailable, service unavailable]
[ 73139.642228s]  WARN ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:service{ns=imodeljs name=svc-aafe19edc5d63559d25430238ffb792322920e7c port=80}:endpoint{addr=10.244.18.178:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.669498s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.2.4:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.691646s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.3.4:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.865747s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.18.8:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.940952s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.16.3:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73140.015251s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.10.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73140.028382s]  WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.24.11:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s

... etc

It fails to connect, then goes into fail-fast mode. This is only happening on a single (client) pod out of 5 total replicas. This is the youngest replica at <1d old; others are ~5 days old. Other replicas don't have any issue connecting to this service (even while this one is having a problem). Also, there are requests that do successfully make it from this problematic pod occasionally.

It's also not just this service (00dfd97) that it is having trouble connecting to. The same is happening with ~20 others as well.

This happened yesterday, and I removed the bad pod, then the new one that was created had the same issue. After removing the pod a second time, it seemingly started working until a few hours later.

How can it be reproduced?

I don't know how to reproduce it. It seems to just happen after some time.

Logs, error output, etc

controller metrics: https://gist.github.com/andrewdinunzio/022c6db28b347cc333402af3092ac18a
(problematic client) proxy metrics: https://gist.github.com/andrewdinunzio/9597687d904d3ffe8cd30489c1568fb6
(server) proxy metrics: https://gist.github.com/andrewdinunzio/29ed81d33f915596a4ba72bf9a97db5a

Linkerd-init logs (client):

time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy-save -t nat"
time="2024-06-12T19:18:41Z" level=info msg="# Generated by iptables-save v1.8.10 on Wed Jun 12 19:18:41 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\nCOMMIT\n# Completed on Wed Jun 12 19:18:41 2024\n"
time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy -t nat -N PROXY_INIT_REDIRECT"
time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568"
time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp -j REDIRECT --to-port 4143 -m comment --comment proxy-init/redirect-all-incoming-to-proxy-port"
time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy -t nat -A PREROUTING -j PROXY_INIT_REDIRECT -m comment --comment proxy-init/install-proxy-init-prerouting"
time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy -t nat -N PROXY_INIT_OUTPUT"
time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -j RETURN -m comment --comment proxy-init/ignore-proxy-user-id"
time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -o lo -j RETURN -m comment --comment proxy-init/ignore-loopback"
time="2024-06-12T19:18:41Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp --match multiport --dports 4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4567,4568"
time="2024-06-12T19:18:42Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp -j REDIRECT --to-port 4140 -m comment --comment proxy-init/redirect-all-outgoing-to-proxy-port"
time="2024-06-12T19:18:42Z" level=info msg="/sbin/iptables-legacy -t nat -A OUTPUT -j PROXY_INIT_OUTPUT -m comment --comment proxy-init/install-proxy-init-output"
time="2024-06-12T19:18:42Z" level=info msg="/sbin/iptables-legacy-save -t nat"
time="2024-06-12T19:18:42Z" level=info msg="# Generated by iptables-save v1.8.10 on Wed Jun 12 19:18:42 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\n:PROXY_INIT_OUTPUT - [0:0]\n:PROXY_INIT_REDIRECT - [0:0]\n-A PREROUTING -m comment --comment \"proxy-init/install-proxy-init-prerouting\" -j PROXY_INIT_REDIRECT\n-A OUTPUT -m comment --comment \"proxy-init/install-proxy-init-output\" -j PROXY_INIT_OUTPUT\n-A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -m comment --comment \"proxy-init/ignore-proxy-user-id\" -j RETURN\n-A PROXY_INIT_OUTPUT -o lo -m comment --comment \"proxy-init/ignore-loopback\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m multiport --dports 4567,4568 -m comment --comment \"proxy-init/ignore-port-4567,4568\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m comment --comment \"proxy-init/redirect-all-outgoing-to-proxy-port\" -j REDIRECT --to-ports 4140\n-A PROXY_INIT_REDIRECT -p tcp -m multiport --dports 4190,4191,4567,4568 -m comment --comment \"proxy-init/ignore-port-4190,4191,4567,4568\" -j RETURN\n-A PROXY_INIT_REDIRECT -p tcp -m comment --comment \"proxy-init/redirect-all-incoming-to-proxy-port\" -j REDIRECT --to-ports 4143\nCOMMIT\n# Completed on Wed Jun 12 19:18:42 2024\n"

Policy controller logs:

2024-06-13T12:47:47.209114Z  INFO linkerd_policy_controller: Lease already exists, no need to create it
2024-06-13T12:47:47.214136Z  INFO grpc{port=8090}: linkerd_policy_controller: policy gRPC server listening addr=[::]:8090
2024-06-13T12:52:11.556884Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:52:13.547002Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:52:16.435004Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:56:37.915245Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:57:01.804959Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:57:32.253743Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T12:57:32.254971Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T12:57:32.256153Z  INFO httproutes.policy.linkerd.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T12:57:32.264109Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T12:57:32.346189Z  INFO authorizationpolicies: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 919557429 (919562865): Expired
2024-06-13T13:01:26.390023Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:01:31.524588Z  INFO serverauthorizations: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 919557084 (919566533): Expired
2024-06-13T13:01:56.563801Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:02:14.948188Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:06:04.198676Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:06:10.988277Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:06:28.279723Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:06:33.366486Z  INFO meshtlsauthentications: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 919565644 (919570834): Expired
2024-06-13T13:06:44.016472Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:07:17.284613Z  INFO httproutes.policy.linkerd.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:11:14.714751Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:11:15.358710Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:11:50.449105Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:12:07.296544Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:15:54.230786Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:15:56.024557Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:20:38.522047Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:21:05.386145Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:21:24.026874Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:25:44.241994Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:25:46.058145Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:26:10.098664Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:26:23.893965Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:30:06.030611Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:30:23.780417Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:31:14.043831Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:35:35.889872Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:39:56.041890Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:40:01.022925Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:40:13.790827Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:40:26.070287Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:41:03.920485Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:44:54.211229Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:45:00.677632Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:45:54.060052Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:49:22.560686Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:49:46.059577Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:50:16.076464Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:50:53.928532Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:54:44.219365Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:55:44.068770Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:59:36.068293Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:59:40.713378Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:00:06.087488Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:00:13.802182Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:04:02.576301Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:04:07.659296Z  INFO networkauthentications: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 919616945 (919622505): Expired
2024-06-13T14:05:34.074444Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:09:02.007269Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:09:02.078340Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:09:26.080119Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:09:56.096428Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:10:03.824590Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:13:57.675280Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:14:54.453441Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:19:16.090038Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:19:23.054248Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:23:42.018107Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:24:43.852303Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:28:50.400630Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:33:32.028759Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:34:33.862793Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:38:40.408621Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:43:05.365968Z  INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:43:55.043981Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:48:30.418314Z  INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:48:55.050759Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:58:45.062891Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T15:12:42.877055Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T15:13:02.300229Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T15:27:19.302560Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)

(looks like a lot of errors but we see similar logs in other regions that are not exhibiting this problem)

output of linkerd check -o short

$ linkerd check -o short
control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.5.5 but the latest edge version is 24.6.1
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.5.5 but cli running edge-24.6.1
    see https://linkerd.io/2/checks/#l5d-version-control for hints
 
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-57559bb8bb-28rdj (edge-24.5.5)
        * linkerd-identity-6f78fbcb7f-vfkkw (edge-24.5.5)
        * linkerd-proxy-injector-6c9847c9d7-zcrlx (edge-24.5.5)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-57559bb8bb-28rdj running edge-24.5.5 but cli running edge-24.6.1
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
 
Status check results are √

Environment

AKS

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

@andrewdinunzio
Copy link
Author

andrewdinunzio commented Jun 14, 2024

We enabled DEBUG logs on the client proxy:
https://gist.github.com/andrewdinunzio/e60cd0c6d4cbc9b5c0c1c28a33ad3b6c
(some info redacted)

I captured 1000 lines before any mention of svc-d93b402b4545fe597edf8ba9e49be61ae01973c4 (the problematic server in this case) to 1000 lines after the last mention of it.

These seem suspect:

[  2483.392885s] DEBUG ThreadId(01) outbound:accept{client.addr=10.244.27.147:45346 server.addr=10.0.89.172:80}:proxy{addr=10.0.89.172:80}:http: linkerd_stack::loadshed: Service has become unavailable
[  2483.392892s] DEBUG ThreadId(01) outbound:accept{client.addr=10.244.27.147:45346 server.addr=10.0.89.172:80}:proxy{addr=10.0.89.172:80}:http: linkerd_stack::loadshed: Service shedding load

But it's only one of the client replicas having this issue. Others are sending requests just fine.
Is it just assuming it's due to load if the service is "unavailable"? Or is it the other way around and marking the service unavailable if it's under load. It's strange because I definitely don't expect the server to have many simultaneous requests; I'd expect 5-10 max.

Also, these logs seem like they would be good to have in INFO; it's difficult for us to toggle DEBUG logs on because they log user tokens, etc.

@alpeb
Copy link
Member

alpeb commented Jun 20, 2024

I'm afraid I can't give you a lot of guidance here without having a reproducible scenario. We have some docs about failfast and 503 and 504 errors that might help. What's special about the client that is causing these issues? Is it generating too many requests that the resources allocated to the proxy aren't enough? You could also look at the proxy metrics for the problematic pod and see how they compare to the healthy ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants