Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing linkerd-proxy to accept inbound traffic despite Pod not being ready #13247

Open
tjorri opened this issue Oct 30, 2024 · 9 comments · May be fixed by #13557
Open

Allowing linkerd-proxy to accept inbound traffic despite Pod not being ready #13247

tjorri opened this issue Oct 30, 2024 · 9 comments · May be fixed by #13557
Labels

Comments

@tjorri
Copy link

tjorri commented Oct 30, 2024

What problem are you trying to solve?

We have a clustered service, which at boot time requires all the cluster nodes to establish connections with each other prior to the service (or any of the Pods) being able to pass readinessProbes (specifically we run StatefulSets with a headless Service set to publishNotReadyAddresses: true, which allows nodes to discover other nodes and then communicate on the clustering port on each Pod directly). We run a separate, regular Service for then providing the interface for other services to communicate to this service, which refuses traffic until the clustered service is ready through the readinessProbes.

When we enable linkerd-proxy sidecars, we observe that direct Pod-to-Pod communication is rejected by linkerd-proxy. If we deactivate the readinessProbes for the Pods (or set config.linkerd.io/skip-inbound-ports: [clustering port]), the cluster setup works, so our belief is that the unreadiness of the Pod causes the proxy to reject traffic.

In our situation we are attempting to use Linkerd mainly for convenient encryption of traffic as well as the mTLS-based authentication and authorization.

How should the problem be solved?

It would be ideal if we could potentially flag certain ports to be such that linkerd-proxy should accept traffic on those ports even if a readiness check fails.

I'm not fully familiar with the internals of Linkerd, so an initial naive, end-user view would be to think of it similar to the config.linkerd.io/skip-inbound-ports annotation/options, but I understand those are mainly used for the linkerd-proxy-init and iptables setup, so perhaps this would require some further coordination with other resources (potentially e.g. allowing configuration of a Server resource to state that the target should receive traffic irrespective of readiness checks).

Any alternatives you've considered?

Right now our less-than-elegant workaround is to omit the clustering ports from linkerd-proxy through the use of config.linkerd.io/skip-inbound-port/--inbound-ports-to-ignore configs. As the clustering ports are important from a security standpoint, we then overlay a NetworkPolicy that forbids access to those ports. This is a bit ugly but works in a single-cluster context, but would fail with multi-clustered setups, at which multi-cluster capable NetworkPolicies or similar structures would be needed, whereas having this type of functionality in Linkerd would allow keeping the toolset smaller and more elegant.

How would users interact with this feature?

No response

Would you like to work on this feature?

None

@tjorri
Copy link
Author

tjorri commented Oct 30, 2024

Actually ended up looking into this a bit more and I'm guessing that the endpoint watchers (e.g. https://github.com/linkerd/linkerd2/blob/main/controller/api/destination/watcher/endpoints_watcher.go#L917, https://github.com/linkerd/linkerd2/blob/main/controller/api/destination/watcher/endpoints_watcher.go#L1025) make decisions to omit unready endpoints.

https://linkerd.io/2-edge/features/protocol-detection/#marking-ports-as-opaque already supports flagging opaque ports via an annotation in the Service itself, so one approach could be to support Service-level annotation for communicating whether to allow unready Endpoints to still be included as destinations.

I'm not immediately sure whether this would bring any adverse effects, or whether it would be considered acceptable to expand the scope in this way.

@kflynn
Copy link
Member

kflynn commented Oct 31, 2024

@tjorri, as I recall from Slack, your use case is pretty unique -- can I trouble you to summarize that here, as well? I feel like just "have a way to ignore unready endpoints" is likely to be a bit overly blunt, so the details of your use case might be helpful too.

@tjorri
Copy link
Author

tjorri commented Oct 31, 2024

Sure, not sure about the uniqueness per se, but basically generally the issue I think is present in any type of clustered system which requires nodes of the cluster to build up state together with any other nodes directly before being ready to accept client connections (the readiness, as we understand it, is more related to this, rather than being ready to accept any connections).

Specifically in our case we currently have a statically sharded actor framework where effectively we have an arbitrary number of different shard types, which get spun out to StatefulSets and then Pods. Each Pod knows the desired topology, including the names of shard types, desired replica counts, and then thanks to the stable network IDs, they're able to figure out the DNS names for all other Pods across all the shard types. This in our case relies on the headless Service and publishing the addresses of unready nodes.

While we're building up the entire state of the system, we can't let clients connect yet so we have to report unready so that the primary Services continue to not give access to them. If we allowed readiness probes to succeed earlier, we would then have to build an additional level of handling readiness along the lines of "LB, we report ready but we really aren't ready yet, so just give us traffic and we'll then tell clients that nope, not actually ready.", which in our mind defeats the purpose of the readiness check for allowing client connections in the first place.

Conversely, we might have a Pod in an unready state for whatever reason, potentially e.g. due to having too much load, but we would still need the Pod to have communication channels to other Pods to allow moving state over, for example. So losing connectivity at that point for the intra-cluster level would be problematic.

I'm sorry if this is still a bit abstract, but I'd be happy to answer any more specific questions if I can. :)

Also, spent a short time digging into Linkerd more today and realized that what I think I'm after is the Pod readiness checks in the workload watcher and was thinking of just testing this out with some lightweight, Pod-level annotations for skipping the readiness requirement just for fun. It might indeed be a bit blunt and I'm not sure whether potentially relaxing the readiness requirement might impact other parts of the system (service discovery/load balancing?), so I'll take your word if you feel this is a dead end.

@olix0r
Copy link
Member

olix0r commented Nov 15, 2024

Proxies do not block inbound traffic on container readiness. This would be a chicken-and-egg problem: the proxy needs to proxy readiness probes!

The details you investigated about the destination controller's endpoint watcher aren't related to this--that controls how outbound proxies resolve Services; but I don't think that's the problem you're describing.

When we enable linkerd-proxy sidecars, we observe that direct Pod-to-Pod communication is rejected by linkerd-proxy.

Can you elaborate on this? Do the proxies log anything in this case?

If this is a non-HTTP protocol, it may be that you must mark this port as opaque.

@tjorri
Copy link
Author

tjorri commented Nov 15, 2024

Ok, so if the outbound proxies rely on the endpoint watcher details, would it then be the outbound proxies not sending traffic forward if the Service is not ready? And because of the headless Services being used for discovering the other Pods, would that be the issue here?

When I was looking at this a few weeks ago, I built a custom image which relaxed the readiness requirements for the watchers and got things working with it, which I guess would be in line with what you described.

Had to put this item on the back-burner for a while, but I'll try to get back to this and build some form of repro next week to explore this more and get back with more details.

@olix0r
Copy link
Member

olix0r commented Nov 15, 2024

would it then be the outbound proxies not sending traffic forward if the Service is not ready? And because of the headless Services being used for discovering the other Pods, would that be the issue here?

When an application resolves a headless service, it gets back pod IPs, and so Linkerd's outbound resolution should be in terms of the target pod and not the Service IP.

This goes through the workload watcher which does appear to ignore pods before they are marked ready. If I'm reading it correctly, my sense is that this filtering logic may be inappropriate. We should allow direct connections as soon requested.

Next time you get to look at this, can you inspect the output of

:; linkerd diagnostics profile $POD_IP:80

This will show the control plane's endpoint response for these connections.

If that contains a proper endpoint section, then the watcher logic probably isn't at play, and the next step will be to set the config.linkerd.io/proxy-log-level=linkerd=debug,info annotation to get a clearer picture of what behavior the client observing.

@tjorri
Copy link
Author

tjorri commented Nov 18, 2024

Alright, played around with this a bit more and here are some notes (spoiler: it's the workload watcher which is causing headaches).

  • Running Linkerd edge-24.10.2 with the linkerd-control-plane and linkerd-crds Helm chart versions 2024.10.2 with a fairly vanilla setup.
  • Deployed a simple nginx test setup[1].
    • Add some default index.html files to identify each Pod: kubectl exec -n test nginx-0 -c nginx -- sh -c "echo nginx-0 > /wwwroot/index.html" && kubectl exec -n test nginx-1 -c nginx -- sh -c "echo nginx-1 > /wwwroot/index.html".
  • In the initial state, both nginx Pods are left unready.
    • Access from nginx-0 to nginx-1 can be tested with curl -v http://nginx-1.nginx-headless/ on nginx-0. Response is HTTP 403.
    • linkerd diagnostics profile [nginx Pod IP]:80 is very empty[2].
  • The nginx Pods can be brought to ready state by touching /wwwroot/not-ready file in either of the containers, after which the readiness checks pass.
    • After this the curl test passes from one nginx Pod to the other.
    • linkerd diagnostics profile looks a lot better[3].
  • I had previously been testing this with tjorri@6d9dcd4, which allows annotating the Pods while running. Annotating an unready Pod similarly allowed then the Pod-to-Pod communication to happen and linkerd diagnostics profile looked also good.

So I guess it's fairly safe to say that the filtering logic for the workload watcher is probably too tight and allowing direct connections from the outset would help things.

[1] Simple nginx test setup.

apiVersion: v1
kind: Namespace
metadata:
  name: test
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-headless
  namespace: test
  labels:
    app: nginx
spec:
  clusterIP: None
  publishNotReadyAddresses: true
  selector:
    app: nginx
  ports:
    - port: 80
      targetPort: 80
      name: http
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nginx
  namespace: test
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: test
data:
  nginx.conf: |
    events {}
    http {
        server {
            listen 80;
            server_name localhost;

            root /wwwroot;
            index index.html;
        }
    }
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nginx
  namespace: test
  labels:
    app: nginx
spec:
  serviceName: nginx-headless
  podManagementPolicy: Parallel
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
      annotations:
        linkerd.io/inject: enabled
        config.linkerd.io/default-inbound-policy: deny
    spec:
      serviceAccountName: nginx
      volumes:
        - name: nginx-config
          configMap:
            name: nginx-config
        - name: nginx-wwwroot
          emptyDir: {}
      containers:
        - name: nginx
          image: nginx:latest
          ports:
            - containerPort: 80
              name: http
              protocol: TCP
          volumeMounts:
            - name: nginx-config
              mountPath: /etc/nginx/nginx.conf
              subPath: nginx.conf
            - name: nginx-wwwroot
              mountPath: /wwwroot
          readinessProbe:
            httpGet:
              path: /not-ready
              port: http
            initialDelaySeconds: 1
            periodSeconds: 5
---
apiVersion: policy.linkerd.io/v1beta3
kind: Server
metadata:
  name: nginx
  namespace: test
spec:
  podSelector:
    matchLabels:
      app: nginx
  port: http
---
apiVersion: policy.linkerd.io/v1alpha1
kind: MeshTLSAuthentication
metadata:
  name: nginx
  namespace: test
spec:
  identities:
    - nginx.test.serviceaccount.identity.linkerd.cluster.local
---
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: nginx
  namespace: test
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: nginx
  requiredAuthenticationRefs:
    - group: policy.linkerd.io
      kind: MeshTLSAuthentication
      name: nginx

[2] Example linkerd diagnostics profile when nginx Pod is not ready:

$ linkerd diagnostics profile 10.0.38.41:80
{
  "retry_budget": {
    "retry_ratio": 0.2,
    "min_retries_per_second": 10,
    "ttl": {
      "seconds": 10
    }
  },
  "endpoint": {
    "addr": {
      "ip": {
        "Ip": {
          "Ipv4": 167781929
        }
      },
      "port": 80
    },
    "weight": 10000
  }
}

[3] Example linkerd diagnostics profile when nginx Pod is ready:

$ linkerd diagnostics profile 10.0.38.41:80
{
  "retry_budget": {
    "retry_ratio": 0.2,
    "min_retries_per_second": 10,
    "ttl": {
      "seconds": 10
    }
  },
  "endpoint": {
    "addr": {
      "ip": {
        "Ip": {
          "Ipv4": 167781929
        }
      },
      "port": 80
    },
    "weight": 10000,
    "metric_labels": {
      "control_plane_ns": "linkerd",
      "namespace": "test",
      "pod": "nginx-0",
      "serviceaccount": "nginx",
      "statefulset": "nginx",
      "zone": ""
    },
    "tls_identity": {
      "Strategy": {
        "DnsLikeIdentity": {
          "name": "nginx.test.serviceaccount.identity.linkerd.cluster.local"
        }
      },
      "server_name": {
        "name": "nginx.test.serviceaccount.identity.linkerd.cluster.local"
      }
    },
    "protocol_hint": {
      "Protocol": null
    },
    "http2": {
      "keep_alive": {
        "interval": {
          "seconds": 10
        },
        "timeout": {
          "seconds": 3
        },
        "while_idle": true
      }
    }
  }
}

@tjorri
Copy link
Author

tjorri commented Dec 31, 2024

@olix0r, just following up on this: any opinions on just relaxing the filtering logic on the workload watcher? Happy to raise a PR for that if it's something that you guys would feel would be appropriate as well.

@olix0r olix0r added bug and removed enhancement labels Jan 9, 2025
@olix0r
Copy link
Member

olix0r commented Jan 9, 2025

@tjorri Yeah, I agree that the destination controller should return endpoint metadata in GetProfiles when the pod isn't yet ready. We'd be happy to review a PR!

tjorri added a commit to tjorri/linkerd2 that referenced this issue Jan 13, 2025
Requiring Pods to pass readiness checks before allowing Pod to Pod communication disrupts communication in e.g. clustered systems which require Pods to communicate with each other prior to establishing ready state and allowing inbound traffic.

Relaxed the requirement and modified the workload watcher to only require that a Pod exists and is in Running phase.

Reproduced the issue with a test setup described in linkerd#13247.

Signed-off-by: Tuomo Jorri <[email protected]>
tjorri added a commit to tjorri/linkerd2 that referenced this issue Jan 13, 2025
Requiring Pods to pass readiness checks before allowing Pod to Pod communication disrupts communication in e.g. clustered systems which require Pods to communicate with each other prior to establishing ready state and allowing inbound traffic.

Relaxed the requirement and modified the workload watcher to only require that a Pod exists and is in Running phase.

Reproduced the issue with a test setup described in linkerd#13247.

Fixes linkerd#13247.

Signed-off-by: Tuomo Jorri <[email protected]>
tjorri added a commit to tjorri/linkerd2 that referenced this issue Jan 13, 2025
Requiring Pods to pass readiness checks before allowing Pod to Pod communication disrupts communication in e.g. clustered systems which require Pods to communicate with each other prior to establishing ready state and allowing inbound traffic.

Relaxed the requirement and modified the workload watcher to only require that a Pod exists and is in Running phase.

Reproduced the issue with a test setup described in linkerd#13247.

Fixes linkerd#13247.

Signed-off-by: Tuomo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants