You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After upgrading to 2024.5.5, we have seen an increase in 503/504 timeouts, though seemingly only for a single (client) pod. It shows as logs like this:
[ 73137.020135s] INFO ThreadId(01) outbound:proxy{addr=10.0.3.18:80}:service{ns=imodeljs name=svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7 port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[ 73137.020196s] INFO ThreadId(01) outbound:proxy{addr=10.0.3.18:80}:rescue{client.addr=10.244.101.75:43486}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.3.18:80: route default.http: backend Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: service in fail-fast error.sources=[route default.http: backend Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: service in fail-fast, backend Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: service in fail-fast, Service.imodeljs.svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7:80: service in fail-fast, service in fail-fast]
[ 73137.024904s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.24.11:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.123570s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.12.6:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.224027s] WARN ThreadId(01) outbound:proxy{addr=10.0.85.215:80}:service{ns=imodeljs name=svc-a2140638230d7090d9013f1fb4b91369a0674f75 port=80}:endpoint{addr=10.244.18.123:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.313264s] WARN ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:service{ns=imodeljs name=svc-aafe19edc5d63559d25430238ffb792322920e7c port=80}:endpoint{addr=10.244.18.178:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.336371s] WARN ThreadId(01) outbound:proxy{addr=10.0.3.18:80}:service{ns=imodeljs name=svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7 port=80}:endpoint{addr=10.244.30.115:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73137.592014s] INFO ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:service{ns=imodeljs name=svc-3d97016a986f0e00f5120c270cc160ff6f23bf57 port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[ 73137.592108s] INFO ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:rescue{client.addr=10.244.101.75:42520}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.80.209:80: route default.http: backend Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: service in fail-fast error.sources=[route default.http: backend Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: service in fail-fast, backend Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: service in fail-fast, Service.imodeljs.svc-3d97016a986f0e00f5120c270cc160ff6f23bf57:80: service in fail-fast, service in fail-fast]
[ 73137.845888s] INFO ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:rescue{client.addr=10.244.101.75:42530}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.80.209:80: route default.http: service unavailable error.sources=[route default.http: service unavailable, service unavailable]
[ 73137.911194s] WARN ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:service{ns=imodeljs name=svc-3d97016a986f0e00f5120c270cc160ff6f23bf57 port=80}:endpoint{addr=10.244.24.160:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.167582s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.2.4:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.189801s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.3.4:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.363328s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.18.8:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.422580s] WARN ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:service{ns=imodeljs name=svc-aafe19edc5d63559d25430238ffb792322920e7c port=80}:endpoint{addr=10.244.18.178:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.437773s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.16.3:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.512983s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.10.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.527106s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.24.11:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.626325s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.12.6:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.726084s] WARN ThreadId(01) outbound:proxy{addr=10.0.85.215:80}:service{ns=imodeljs name=svc-a2140638230d7090d9013f1fb4b91369a0674f75 port=80}:endpoint{addr=10.244.18.123:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73138.765340s] WARN ThreadId(01) outbound:proxy{addr=10.0.3.18:80}:service{ns=imodeljs name=svc-7ed19bdd30302ada2f8bab1286f9096fe1ee48b7 port=80}:endpoint{addr=10.244.30.115:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.311818s] INFO ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:service{ns=imodeljs name=svc-aafe19edc5d63559d25430238ffb792322920e7c port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[ 73139.311892s] INFO ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:rescue{client.addr=10.244.101.75:57314}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.66.11:80: route default.http: backend Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: service in fail-fast error.sources=[route default.http: backend Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: service in fail-fast, backend Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: service in fail-fast, Service.imodeljs.svc-aafe19edc5d63559d25430238ffb792322920e7c:80: service in fail-fast, service in fail-fast]
[ 73139.323187s] WARN ThreadId(01) outbound:proxy{addr=10.0.80.209:80}:service{ns=imodeljs name=svc-3d97016a986f0e00f5120c270cc160ff6f23bf57 port=80}:endpoint{addr=10.244.24.160:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.570916s] INFO ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:rescue{client.addr=10.244.101.75:57326}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.66.11:80: route default.http: service unavailable error.sources=[route default.http: service unavailable, service unavailable]
[ 73139.642228s] WARN ThreadId(01) outbound:proxy{addr=10.0.66.11:80}:service{ns=imodeljs name=svc-aafe19edc5d63559d25430238ffb792322920e7c port=80}:endpoint{addr=10.244.18.178:3000}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.669498s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.2.4:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.691646s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.3.4:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.865747s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.18.8:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73139.940952s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.16.3:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73140.015251s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.10.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 73140.028382s] WARN ThreadId(01) outbound:proxy{addr=10.0.181.10:4317}:service{ns=monitoring name=otel-collector port=4317}:endpoint{addr=10.244.24.11:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
... etc
It fails to connect, then goes into fail-fast mode. This is only happening on a single (client) pod out of 5 total replicas. This is the youngest replica at <1d old; others are ~5 days old. Other replicas don't have any issue connecting to this service (even while this one is having a problem). Also, there are requests that do successfully make it from this problematic pod occasionally.
It's also not just this service (00dfd97) that it is having trouble connecting to. The same is happening with ~20 others as well.
This happened yesterday, and I removed the bad pod, then the new one that was created had the same issue. After removing the pod a second time, it seemingly started working until a few hours later.
How can it be reproduced?
I don't know how to reproduce it. It seems to just happen after some time.
2024-06-13T12:47:47.209114Z INFO linkerd_policy_controller: Lease already exists, no need to create it
2024-06-13T12:47:47.214136Z INFO grpc{port=8090}: linkerd_policy_controller: policy gRPC server listening addr=[::]:8090
2024-06-13T12:52:11.556884Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:52:13.547002Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:52:16.435004Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:56:37.915245Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:57:01.804959Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T12:57:32.253743Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T12:57:32.254971Z INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T12:57:32.256153Z INFO httproutes.policy.linkerd.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T12:57:32.264109Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T12:57:32.346189Z INFO authorizationpolicies: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 919557429 (919562865): Expired
2024-06-13T13:01:26.390023Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:01:31.524588Z INFO serverauthorizations: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 919557084 (919566533): Expired
2024-06-13T13:01:56.563801Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:02:14.948188Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:06:04.198676Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:06:10.988277Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:06:28.279723Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:06:33.366486Z INFO meshtlsauthentications: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 919565644 (919570834): Expired
2024-06-13T13:06:44.016472Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:07:17.284613Z INFO httproutes.policy.linkerd.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:11:14.714751Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:11:15.358710Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:11:50.449105Z INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:12:07.296544Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:15:54.230786Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:15:56.024557Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:20:38.522047Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:21:05.386145Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:21:24.026874Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:25:44.241994Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:25:46.058145Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:26:10.098664Z INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:26:23.893965Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:30:06.030611Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:30:23.780417Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:31:14.043831Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:35:35.889872Z INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:39:56.041890Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:40:01.022925Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:40:13.790827Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:40:26.070287Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:41:03.920485Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:44:54.211229Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:45:00.677632Z INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:45:54.060052Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:49:22.560686Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T13:49:46.059577Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:50:16.076464Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:50:53.928532Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:54:44.219365Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:55:44.068770Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:59:36.068293Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T13:59:40.713378Z INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:00:06.087488Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:00:13.802182Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:04:02.576301Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:04:07.659296Z INFO networkauthentications: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 919616945 (919622505): Expired
2024-06-13T14:05:34.074444Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:09:02.007269Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:09:02.078340Z INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:09:26.080119Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:09:56.096428Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:10:03.824590Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:13:57.675280Z INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:14:54.453441Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:19:16.090038Z INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:19:23.054248Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:23:42.018107Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:24:43.852303Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:28:50.400630Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:33:32.028759Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:34:33.862793Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:38:40.408621Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:43:05.365968Z INFO external_workloads: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:43:55.043981Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T14:48:30.418314Z INFO serverauthorizations: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:48:55.050759Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T14:58:45.062891Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2024-06-13T15:12:42.877055Z INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T15:13:02.300229Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-13T15:27:19.302560Z INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
(looks like a lot of errors but we see similar logs in other regions that are not exhibiting this problem)
output of linkerd check -o short
$ linkerd check -o short
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 24.5.5 but the latest edge version is 24.6.1
see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-24.5.5 but cli running edge-24.6.1
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-57559bb8bb-28rdj (edge-24.5.5)
* linkerd-identity-6f78fbcb7f-vfkkw (edge-24.5.5)
* linkerd-proxy-injector-6c9847c9d7-zcrlx (edge-24.5.5)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-57559bb8bb-28rdj running edge-24.5.5 but cli running edge-24.6.1
see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
Status check results are √
Environment
AKS
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered:
I captured 1000 lines before any mention of svc-d93b402b4545fe597edf8ba9e49be61ae01973c4 (the problematic server in this case) to 1000 lines after the last mention of it.
These seem suspect:
[ 2483.392885s] DEBUG ThreadId(01) outbound:accept{client.addr=10.244.27.147:45346 server.addr=10.0.89.172:80}:proxy{addr=10.0.89.172:80}:http: linkerd_stack::loadshed: Service has become unavailable
[ 2483.392892s] DEBUG ThreadId(01) outbound:accept{client.addr=10.244.27.147:45346 server.addr=10.0.89.172:80}:proxy{addr=10.0.89.172:80}:http: linkerd_stack::loadshed: Service shedding load
But it's only one of the client replicas having this issue. Others are sending requests just fine.
Is it just assuming it's due to load if the service is "unavailable"? Or is it the other way around and marking the service unavailable if it's under load. It's strange because I definitely don't expect the server to have many simultaneous requests; I'd expect 5-10 max.
Also, these logs seem like they would be good to have in INFO; it's difficult for us to toggle DEBUG logs on because they log user tokens, etc.
I'm afraid I can't give you a lot of guidance here without having a reproducible scenario. We have some docs about failfast and 503 and 504 errors that might help. What's special about the client that is causing these issues? Is it generating too many requests that the resources allocated to the proxy aren't enough? You could also look at the proxy metrics for the problematic pod and see how they compare to the healthy ones.
What is the issue?
After upgrading to 2024.5.5, we have seen an increase in 503/504 timeouts, though seemingly only for a single (client) pod. It shows as logs like this:
It fails to connect, then goes into fail-fast mode. This is only happening on a single (client) pod out of 5 total replicas. This is the youngest replica at <1d old; others are ~5 days old. Other replicas don't have any issue connecting to this service (even while this one is having a problem). Also, there are requests that do successfully make it from this problematic pod occasionally.
It's also not just this service (00dfd97) that it is having trouble connecting to. The same is happening with ~20 others as well.
This happened yesterday, and I removed the bad pod, then the new one that was created had the same issue. After removing the pod a second time, it seemingly started working until a few hours later.
How can it be reproduced?
I don't know how to reproduce it. It seems to just happen after some time.
Logs, error output, etc
controller metrics: https://gist.github.com/andrewdinunzio/022c6db28b347cc333402af3092ac18a
(problematic client) proxy metrics: https://gist.github.com/andrewdinunzio/9597687d904d3ffe8cd30489c1568fb6
(server) proxy metrics: https://gist.github.com/andrewdinunzio/29ed81d33f915596a4ba72bf9a97db5a
Linkerd-init logs (client):
Policy controller logs:
(looks like a lot of errors but we see similar logs in other regions that are not exhibiting this problem)
output of
linkerd check -o short
Environment
AKS
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: