Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinkerD-proxy does not upgrade HTTP reuquest to HTTPS, randomly #13013

Open
florian-besser opened this issue Sep 3, 2024 · 4 comments
Open
Labels

Comments

@florian-besser
Copy link

florian-besser commented Sep 3, 2024

What is the issue?

I have a meshed prometheus that scrapes all instances of linkerd-proxy in our K8s cluster. For all but one instance this works, but for one pod it cannot get a proper response.

How can it be reproduced?

I exec. into the prometheus pod and run wget "http://10.0.3.202:4191/metrics" which yields

Connecting to 10.0.3.202:4191 (10.0.3.202:4191)
wget: server returned error: HTTP/1.1 403 Forbidden

IP 10.0.3.202 belongs to pod reporting-depl-c79d4b7c4-w2vbm. Both the target pod as well as Prometheus have LinkerD proxy injected.

Logs, error output, etc

I check the targets logs with kubectl logs -n gsg reporting-depl-c79d4b7c4-w2vbm -c linkerd-proxy

[156002.685829s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::http: Request denied server.group=policy.linkerd.io server.kind=server server.name=gsg-reporting-linkerd route.group= route.kind=default route.name=default client.tls=None(NoClientHello) client.ip=10.0.3.159
[156002.685871s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.0.3.159:60112}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=unauthorized request on route

This is weird; I would have expected the LinkerD proxy of Prometheus (the source) to use HTTPS / TLS but it seemingly decided against that. The target logs to me sound like the target LinkerD proxy rejected a non-TLS connection.

Debugging further on the source (Prometheus) side I add config.linkerd.io/proxy-log-level: trace as an annotation, which yields:

[ 50137.490436s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.3.159:45014 server.addr=10.0.3.202:4191}:proxy{addr=10.0.3.202:4191}:http:forward{addr=10.0.3.202:4191}:http1: linkerd_proxy_transport::connect: Connecting server.addr=10.0.3.202:4191
[ 50137.490433s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.3.159:45014 server.addr=10.0.3.202:4191}:proxy{addr=10.0.3.202:4191}:http:forward{addr=10.0.3.202:4191}:http1: linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery
[ 50137.490427s] TRACE ThreadId(01) outbound:accept{client.addr=10.0.3.159:45014 server.addr=10.0.3.202:4191}:proxy{addr=10.0.3.202:4191}:http:forward{addr=10.0.3.202:4191}:http1: linkerd_app_outbound::tcp::tagged_transport: Not attempting opaque transport reason=not_provided_by_service_discovery

[ 50137.490181s] DEBUG ThreadId(01) outbound:accept{client.addr=10.0.3.159:45014 server.addr=10.0.3.202:4191}:proxy{addr=10.0.3.202:4191}:http:forward{addr=10.0.3.202:4191}:http1: linkerd_proxy_http::client: headers={"host": "10.0.3.202:4191", "user-agent": "Prometheus/2.51.1", "accept": "application/openmetrics-text;version=1.0.0;q=0.5,application/openmetrics-text;version=0.0.1;q=0.4,text/plain;version=0.0.4;q=0.3,*/*;q=0.2", "accept-encoding": "gzip", "x-prometheus-scrape-timeout-seconds": "10"}

[ 50137.520218s] TRACE ThreadId(01) outbound:accept{client.addr=10.0.3.159:45014 server.addr=10.0.3.202:4191}:proxy{addr=10.0.3.202:4191}:http:encode_headers: hyper::proto::h1::role: Server::encode status=403, body=None, req_method=Some(GET)

output of linkerd check -o short

control-plane-version

can retrieve the control plane version

control plane is up-to-date

unsupported version channel: stable-2.14.10

seehttps://linkerd.io/2.14/checks/#l5d-version-controlfor hints

kubernetes-api

can query the Kubernetes API

kubernetes-version

is running the minimum Kubernetes API version

linkerd-config

control plane Namespace exists

control plane ClusterRoles exist

control plane ClusterRoleBindings exist

control plane ServiceAccounts exist

control plane CustomResourceDefinitions exist

control plane MutatingWebhookConfigurations exist

control plane ValidatingWebhookConfigurations exist

proxy-init container runs as root user if docker container runtime is used

linkerd-existence

'linkerd-config' config map exists

heartbeat ServiceAccount exist

control plane replica sets are ready

no unschedulable pods

control plane pods are ready

cluster networks contains all pods

cluster networks contains all services

linkerd-version

can determine the latest version

image

Environment

K8s 1.28 on AWS EKS
Host: AL2023, running on t4g.large machines

Possible solution

No response

Additional context

The issue randomly goes away after a few hours, just to return a few minutes later. The issue once it happens is reproducible. I would be happy to contribute, once someone can enlighten me what's actually going wrong.

I originally raised this issue in the LinkerD Slack: https://linkerd.slack.com/archives/C89RTCWJF/p1724914178271069

Would you like to work on fixing this bug?

yes

@florian-besser
Copy link
Author

I received additional questions via Slack, to copy them here:
linkerd identity -n gsg reporting-depl-c79d4b7c4-w2vbm

POD reporting-depl-c79d4b7c4-w2vbm (1 of 1)

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 506 (0x1fa)
    Signature Algorithm: ECDSA-SHA256
        Issuer: CN=identity.linkerd.cluster.local
        Validity
            Not Before: Sep 1 09:02:31 2024 UTC
            Not After : Sep 2 09:03:11 2024 UTC
        Subject: CN=reporting.gsg.serviceaccount.identity.linkerd.cluster.local
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)
                X:
                    66:5d:0f:12:92:29:dd:38:8b:3b:8f:9c:e7:85:b1:
                    dc:45:a8:dc:13:b4:73:1b:c4:0f:6b:18:a2:8e:b4:
                    c8:6f
                Y:
                    9d:b3:76:6d:14:0f:e2:12:25:24:d6:4f:f1:bb:92:
                    68:34:da:d2:2e:b6:96:ef:07:78:7e:d5:30:df:e8:
                    e0:7f
                Curve: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Authority Key Identifier:
                keyid:D8:35:36:51:4C:7B:B8:A2:D9:A2:38:CA:3E:F3:93:83:CC:DE:F6:CF
            X509v3 Subject Alternative Name:
                DNS:reporting.gsg.serviceaccount.identity.linkerd.cluster.local

    Signature Algorithm: ECDSA-SHA256
         30:44:02:20:2f:45:bf:69:db:12:c6:3a:65:9d:bb:78:51:69:
         32:39:0a:00:1e:14:49:7a:b2:99:6b:a6:93:75:aa:34:75:74:
         02:20:4a:dc:db:58:85:c1:83:17:6b:22:f1:86:a9:7d:ce:e8:
         58:f6:39:de:d0:94:85:53:72:0e:8e:e6:90:07:be:ee

How is traffic allowed?

kubectl get Server -n gsg -o yaml gsg-reporting-linkerd

apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  creationTimestamp: "2024-08-14T07:35:26Z"
  generation: 1
  name: gsg-reporting-linkerd
  namespace: gsg
  resourceVersion: "318520442"
  uid: 8b509112-93ff-4b55-9b8f-93a9f19201eb
spec:
  podSelector:
    matchLabels:
      run: reporting-pod
  port: 4191
  proxyProtocol: HTTP/2

As well as:

apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  creationTimestamp: "2024-08-14T07:39:18Z"
  generation: 1
  name: allow-alm-prometheus-server-to-gsg-reporting-linkerd
  namespace: gsg
  resourceVersion: "318524492"
  uid: 34db3400-9f8c-4225-b5b0-d3aed4caf22d
spec:
  requiredAuthenticationRefs:
  - kind: ServiceAccount
    name: prometheus-server
    namespace: alm
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: gsg-reporting-linkerd

To ensure this is correct I run kubectl describe pod -n gsg reporting-depl-c79d4b7c4-w2vbm:

Name:             reporting-depl-c79d4b7c4-w2vbm
Namespace:        gsg
Priority:         0
Service Account:  reporting
Node:             ip-10-0-3-68.ap-southeast-1.compute.internal/10.0.3.68
Start Time:       Tue, 27 Aug 2024 19:25:12 +0800
Labels:           linkerd.io/control-plane-ns=linkerd
                  linkerd.io/proxy-deployment=reporting-depl
                  linkerd.io/workload-ns=gsg
                  pod-template-hash=c79d4b7c4
                  run=reporting-pod

As well as kubectl describe pod -n alm prometheus-server-5b64477998-x8kxn:

Name:             prometheus-server-5b64477998-x8kxn
Namespace:        alm
Priority:         0
Service Account:  prometheus-server

The weird thing is that we use the same mechanism for dozens of other pods successfully. We have several separate K8s clusters for our environments (test, UAT, prod), and this works fine in all environments except UAT. It's the same code (we use terraform so I can be reasonably sure of this).
Which is why this is such a weird thing to be seeing.

@alpeb alpeb added support and removed bug labels Sep 4, 2024
@alpeb
Copy link
Member

alpeb commented Sep 4, 2024

Have you got results back from changing the proxyProtocol to HTTP/1 ?
Also, you can run linkerd authz -n gsg po/reporting-depl-c79d4b7c4-w2vbm to verify the policy that is getting applied.

@florian-besser
Copy link
Author

The issue has not reappeared; I'm still confused why the incorrect protocol worked in all but one cases but hey 🤷
Output of linkerd authz -n gsg po/reporting-depl-c79d4b7c4-q8rw5

ROUTE   SERVER                 AUTHORIZATION_POLICY                                  SERVER_AUTHORIZATION                       
*       gsg-reporting-linkerd  allow-alm-prometheus-server-to-gsg-reporting-linkerd                                  

I would reopen this if the issue returns, but for now everything seems to be working fine.

@florian-besser
Copy link
Author

Unfortunately the issue has reoccurred. With a different pod, but the behavior is the same.

LinkerD check is still as before.

From inside Prometheus:

/prometheus $ wget "http://10.0.3.86:4191/metrics"
Connecting to 10.0.3.86:4191 (10.0.3.86:4191)
wget: server returned error: HTTP/1.1 403 Forbidden

Find the pod kubectl get pods -n ops-rapid-testing -o wide:

onboarding-depl-66554bdf9f-8mh78   2/2     Running     0              20m     10.0.3.86    ip-10-0-3-146.ap-southeast-1.compute.internal   <none>           <none>

kubectl logs -n ops-rapid-testing onboarding-depl-66554bdf9f-8mh78 -c linkerd-proxy

[  1289.245238s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::http: Request denied server.group=policy.linkerd.io server.kind=server server.name=ops-rapid-testing-onboarding-linkerd route.group= route.kind=default route.name=default client.tls=None(NoClientHello) client.ip=10.0.3.252
[  1289.245771s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.0.3.252:52222}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=unauthorized request on route
[  1289.401536s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_inbound::policy::http: Request denied server.group=policy.linkerd.io server.kind=server server.name=ops-rapid-testing-onboarding-linkerd route.group= route.kind=default route.name=default client.tls=None(NoClientHello) client.ip=10.0.3.252
[  1289.401609s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:rescue{client.addr=10.0.3.252:52222}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=unauthorized request on route

Again, no TLS.

linkerd identity -n ops-rapid-testing onboarding-depl-66554bdf9f-8mh78

POD onboarding-depl-66554bdf9f-8mh78 (1 of 1)

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 170 (0xaa)
    Signature Algorithm: ECDSA-SHA256
        Issuer: CN=identity.linkerd.cluster.local
        Validity
            Not Before: Sep 12 02:34:14 2024 UTC
            Not After : Sep 13 02:34:54 2024 UTC
        Subject: CN=onboarding.ops-rapid-testing.serviceaccount.identity.linkerd.cluster.local
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)
                X:
                    50:10:45:23:23:84:bb:1b:04:fa:e9:d4:8f:b6:63:
                    4c:e4:f5:e1:8f:aa:56:ce:eb:58:5a:53:8d:14:97:
                    15:d9
                Y:
                    23:75:15:d2:28:86:0e:af:65:e7:d0:69:2a:01:36:
                    d2:b8:2d:72:3a:d7:f0:79:87:da:2f:59:0e:b9:78:
                    42:d6
                Curve: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Authority Key Identifier:
                keyid:D8:35:36:51:4C:7B:B8:A2:D9:A2:38:CA:3E:F3:93:83:CC:DE:F6:CF
            X509v3 Subject Alternative Name:
                DNS:onboarding.ops-rapid-testing.serviceaccount.identity.linkerd.cluster.local

    Signature Algorithm: ECDSA-SHA256
         30:45:02:20:6d:63:2c:72:c2:48:f6:ef:e6:89:ba:2c:70:24:
         c7:f3:32:29:f3:0d:4e:6e:9f:11:70:fc:be:a5:20:98:f6:e5:
         02:21:00:ad:d9:fd:17:fc:9f:25:df:a3:b4:30:d5:fb:62:66:
         05:89:82:50:94:06:e0:d8:3a:13:81:22:3d:e3:a7:cc:93

kubectl get Server -n ops-rapid-testing -o yaml ops-rapid-testing-onboarding-linkerd

apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  creationTimestamp: "2024-09-10T07:40:35Z"
  generation: 1
  name: ops-rapid-testing-onboarding-linkerd
  namespace: ops-rapid-testing
  resourceVersion: "332710654"
  uid: b599266f-85b1-4842-a557-179ae2c5a3ed
spec:
  podSelector:
    matchLabels:
      run: onboarding-pod
  port: 4191
  proxyProtocol: HTTP/1

And kubectl get AuthorizationPolicy -n ops-rapid-testing allow-alm-prometheus-server-to-ops-rapid-testing-onboarding-linkerd -o yaml

apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  creationTimestamp: "2024-09-10T07:41:26Z"
  generation: 1
  name: allow-alm-prometheus-server-to-ops-rapid-testing-onboarding-linkerd
  namespace: ops-rapid-testing
  resourceVersion: "332711182"
  uid: 9b143a83-2fc0-42a6-b654-b00821db3ba8
spec:
  requiredAuthenticationRefs:
  - kind: ServiceAccount
    name: prometheus-server
    namespace: alm
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: ops-rapid-testing-onboarding-linkerd

The prometheus pod uses the prometheus-server SA, same as previously.

kubectl describe pod -n ops-rapid-testing onboarding-depl-66554bdf9f-8mh78

Name:             onboarding-depl-66554bdf9f-8mh78
Namespace:        ops-rapid-testing
Priority:         0
Service Account:  onboarding
Node:             ip-10-0-3-146.ap-southeast-1.compute.internal/10.0.3.146
Start Time:       Thu, 12 Sep 2024 10:34:32 +0800
Labels:           linkerd.io/control-plane-ns=linkerd
                  linkerd.io/proxy-deployment=onboarding-depl
                  linkerd.io/workload-ns=ops-rapid-testing
                  pod-template-hash=66554bdf9f
                  run=onboarding-pod

So far I'm not seeing anything that would prevent TLS; and this is again randomly appearing for a single pod, while several other pods are working just fine.

What other logs could I turn on to debug this further? Iv'e tried config.linkerd.io/proxy-log-level: trace (see above) but that just gave me linkerd_tls::client: Peer does not support TLS reason=not_provided_by_service_discovery which I was unable to debug further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants