Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy trying to connect to no-longer available endpoints #12781

Open
peterhuberit opened this issue Jun 26, 2024 · 4 comments
Open

Proxy trying to connect to no-longer available endpoints #12781

peterhuberit opened this issue Jun 26, 2024 · 4 comments
Labels

Comments

@peterhuberit
Copy link

What is the issue?

Occasionally, some of our pod requests fail with a 504 error, indicating that the request is attempting to reach an unavailable (or no longer available) IP address.
This issue does not occur when Linkerd mesh is not in use. Restarting the affected pods resolves the issue.
The problem looks like similar then this, but since we are using a newer version of Linkerd it could be something else:
#6842

How can it be reproduced?

We can't reproduced it yet, it happens time-to-time, but we don't know what causes it.

Logs, error output, etc

The request failed with 504 error while trying to reach the pods of config-service service, because the requested IP is no longer available (in this case: 100.66.27.231).
linkerd tap command logs on the source pod:

req id=103:3 proxy=in  src=100.66.26.82:50966 dst=100.66.29.176:8080 tls=true :method=GET :authority=172.20.183.211 :path=/v1/configurations
req id=103:4 proxy=out src=100.66.29.176:43828 dst=100.66.27.231:8080 tls=not_provided_by_service_discovery :method=GET :authority=100.66.27.231:8080 :path=/config
rsp id=103:4 proxy=out src=100.66.29.176:43828 dst=100.66.27.231:8080 tls=not_provided_by_service_discovery :status=504 latency=1001350µs
end id=103:4 proxy=out src=100.66.29.176:43828 dst=100.66.27.231:8080 tls=not_provided_by_service_discovery duration=15µs response-length=0B
rsp id=103:3 proxy=in  src=100.66.26.82:50966 dst=100.66.29.176:8080 tls=true :status=500 latency=1014335µs
end id=103:3 proxy=in  src=100.66.26.82:50966 dst=100.66.29.176:8080 tls=true duration=3105µs response-length=191B

This 100.66.27.231 IP doesnt exist in the whole cluster, not just in the config-service or namespace cluster. All the pods, service and node IPs checked, the IP is not available on the moment of the error.

k8s endpoints checked for config-service:

kubectl get endpoints config-service -o json | jq ".subsets[0].addresses[] | .ip"
"100.66.24.216"
"100.66.27.242"
"100.66.28.41"

config-service linkerd endpoints:

NAMESPACE   IP              PORT   POD                               SERVICE
uat01       100.66.27.242   8080   config-service-64f874ff57-wv9lb   config-service.uat01
uat01       100.66.28.41    8080   config-service-64f874ff57-gfbl2   config-service.uat01
uat01       100.66.24.216   8080   config-service-64f874ff57-p444r   config-service.uat01

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2024-06-28T04:01:32Z
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2024-06-27T03:17:53Z
    see https://linkerd.io/2/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2024-06-27T03:17:10Z
    see https://linkerd.io/2/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2024-06-27T03:17:53Z
    see https://linkerd.io/2/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ can determine the latest version
    Get "https://versioncheck.linkerd.io/version.json?version=edge-24.3.4&uuid=0b1baa44-cadd-4e23-a446-35219f6b800c&source=cli": stream error: stream ID 1; NO_ERROR; received from peer
    see https://linkerd.io/2/checks/#l5d-version-latest for hints
‼ cli is up-to-date
    unable to determine version channel
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unable to determine version channel
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.3.2 but cli running edge-24.3.4
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    unable to determine version channel
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-6ffdcb5dc7-xpsgj running edge-24.3.2 but cli running edge-24.3.4
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2024-06-27T03:28:02Z
    see https://linkerd.io/2/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ viz extension proxies are up-to-date
    Get "https://versioncheck.linkerd.io/version.json?version=edge-24.3.4&uuid=unknown&source=cli": stream error: stream ID 1; NO_ERROR; received from peer
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-76499b55cc-5p47g running edge-24.3.2 but cli running edge-24.3.4
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Kubernetes Version: v1.29.4-eks-036c24b
Cluster Environment: AWS
Linkerd version: edge-24.3.2

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

@andrewdinunzio
Copy link

I think we are seeing this issue as well in 2024.5.5.

@peterhuberit
Copy link
Author

Thank you for your answer @andrewdinunzio , we're going to make a test with the edge-2024.6.3, but we need couple of days for it.

@peterhuberit
Copy link
Author

So, we deployed the edge-24.6.3, but the problem is still persists.

@adleong
Copy link
Member

adleong commented Aug 2, 2024

Hi @peterhuberit!

Are you able to provide full Linkerd proxy logs when you see this issue? The proxy logs when IP addresses are added and removed from its load balancers so this should help give us some clues as to what's going on here. Please try to provide the full log since the start of the process if possible so that we can get the full context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants