Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracing] received corrupt message of type InvalidContentType - When Collector is not in mesh, or has ports marked inbound skip #13427

Open
jseiser opened this issue Dec 4, 2024 · 9 comments
Labels

Comments

@jseiser
Copy link
Contributor

jseiser commented Dec 4, 2024

What is the issue?

Linkerd breaks traces when running against the OTLP collector, confusing the collector into thinking the traces come from the collector itself, not from the originating pod.

Example: grafana/alloy#1336 (comment)

As a work around, we wanted to just remove the collector from the mesh, that breaks linkerd-proxy being able to send traces. We then attempted to leave the collector in the mesh, but tell it to skip the relevant ports inbound, that also breaks linkerd-proxy being able to send traffic.

How can it be reproduced?

  1. AWS EKS Cluster
  2. Linkerd
  3. Grafana Alloy
  4. Configure Linkerd-jaeger to send traces to Grafana Alloy

Logs, error output, etc

This happens with Alloy removed from the mesh, or with alloy set to skip the ports in bound.

{"timestamp":"2024-12-04T16:30:47.984933Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.7.129:4317: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
{"timestamp":"2024-12-04T16:30:48.202478Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.16.177:4317: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
{"timestamp":"2024-12-04T16:30:48.229660Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.16.177:4318: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}

output of linkerd check -o short

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all pods
√ cluster networks contains all services

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 24.11.4 but the latest edge version is 24.11.8
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
    is running version 24.11.4 but the latest edge version is 24.11.8
    see https://linkerd.io/2/checks/#l5d-version-control for hints
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-c9b6b96c-lwwv4 (edge-24.11.4)
	* linkerd-destination-c9b6b96c-m44vg (edge-24.11.4)
	* linkerd-destination-c9b6b96c-xm56z (edge-24.11.4)
	* linkerd-identity-7d9c687659-mftdm (edge-24.11.4)
	* linkerd-identity-7d9c687659-qcq5l (edge-24.11.4)
	* linkerd-identity-7d9c687659-sj5zp (edge-24.11.4)
	* linkerd-proxy-injector-7b5f5c7d66-c2q6z (edge-24.11.4)
	* linkerd-proxy-injector-7b5f5c7d66-cltg2 (edge-24.11.4)
	* linkerd-proxy-injector-7b5f5c7d66-nn9t7 (edge-24.11.4)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
√ control plane proxies and cli versions match

linkerd-ha-checks
-----------------
√ multiple replicas of control plane pods

linkerd-extension-checks
------------------------
√ namespace configuration for extensions

linkerd-jaeger
--------------
√ linkerd-jaeger extension Namespace exists
√ jaeger extension pods are injected
√ jaeger injector pods are running
√ jaeger extension proxies are healthy
‼ jaeger extension proxies are up-to-date
    some proxies are not running the current version:
	* jaeger-injector-bdf688f96-ddcxf (edge-24.11.4)
	* jaeger-injector-bdf688f96-xf8hn (edge-24.11.4)
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
√ jaeger extension proxies and cli versions match

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-cfbfbfcbc-jr4fn (edge-24.11.4)
	* metrics-api-cfbfbfcbc-qbzgz (edge-24.11.4)
	* prometheus-5464dc854b-wl6w6 (edge-24.11.4)
	* tap-858b7b86d4-gl9mn (edge-24.11.4)
	* tap-858b7b86d4-vb8wz (edge-24.11.4)
	* tap-injector-d49bf4cfb-djfcv (edge-24.11.4)
	* tap-injector-d49bf4cfb-fv2zp (edge-24.11.4)
	* web-5dd7bf96db-dj55j (edge-24.11.4)
	* web-5dd7bf96db-qcg8m (edge-24.11.4)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ viz extension self-check

Status check results are √

Environment

Kubernetes version - 1.30
Cluster Environment - EKS
Host OS - Bottle Rocket

Possible solution

No response

Additional context

Im not sure why port 4318, ever shows up in the logs. Its configured for 4317

LINKERD2_PROXY_TRACE_COLLECTOR_SVC_ADDR:                   alloy-cluster.grafana-alloy.svc.cluster.local:4317
LINKERD2_PROXY_TRACE_PROTOCOL:                             opentelemetry
LINKERD2_PROXY_TRACE_SERVICE_NAME:                         linkerd-proxy
LINKERD2_PROXY_TRACE_COLLECTOR_SVC_NAME:                   alloy.grafana-alloy.serviceaccount.identity.linkerd.cluster.local

The pod it's failing to connect to, is an alloy pod.

❯ kubectl get pods -A -o wide | rg 10.1.16.177                                                
grafana-alloy               alloy-68fbb65465-wc6v5                                      3/3     Running     0             20h    10.1.16.177   i-03b207599fb2636db.us-gov-west-1.compute.internal   <none>           <none>

The linkerd proxy, on the alloy pod logs this

{"timestamp":"2024-12-04T17:01:18.536899Z","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}                                         
{"timestamp":"2024-12-04T17:01:18.553171Z","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}                                         
{"timestamp":"2024-12-04T17:01:19.037516Z","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}                                         
{"timestamp":"2024-12-04T17:01:19.054832Z","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"} 

There are no actual errors on the actual alloy pod itself.

Everything that is not Linkerd-proxy, is still able to send traces without a problem, including pods that are fully in the mesh themselves.

Would you like to work on fixing this bug?

None

@jseiser jseiser added the bug label Dec 4, 2024
@jseiser
Copy link
Contributor Author

jseiser commented Dec 11, 2024

Is there any additional information which would make this easier to troubleshoot?

@wc-s
Copy link
Contributor

wc-s commented Jan 9, 2025

We ran into the same error message. On a hunch, I thought it might be because the collector was not meshed. After meshing our collector, this error went away.

@jseiser
Copy link
Contributor Author

jseiser commented Jan 9, 2025 via email

@wc-s
Copy link
Contributor

wc-s commented Jan 9, 2025

Huh, what do you mean it's shown as being emitted by the collector?

For what it's worth, we now have it working correctly in our cluster, properly correlated with all our other spans.

@kflynn
Copy link
Member

kflynn commented Jan 9, 2025

@wc-s I would love to hear how exactly you've set things up to have everything working – I'm not an Alloy expert and would like to learn more. 🙂

@jseiser
Copy link
Contributor Author

jseiser commented Jan 9, 2025 via email

@jseiser
Copy link
Contributor Author

jseiser commented Jan 9, 2025 via email

@wc-s
Copy link
Contributor

wc-s commented Jan 9, 2025

OK, I gotta apologize, I didn't read the original post carefully enough and didn't realize you're using Grafana Alloy. We're using the upstream opentelemetry-collector directly. What's more, we don't use the k8sattributes processor and instead populate the k8s metadata some other way, so our experience is largely irrelevant to you 😅

However, examining your Alloy config, have you tried applying linkerd people's recommended collector config?

It's here: https://github.com/linkerd/linkerd2/blob/main/jaeger/charts/linkerd-jaeger/values.yaml#L120

Grafana Alloy I think is just a thin wrapper around opentelemetry-collector, so the config format is pretty much the same.

I suspect that the linkerd people also realized that here we cannot rely on Pod IP to associate, so there they used the host.name attribute (which is populated by linkerd-proxy and not retrieved by collector). This attribute is basically the pod name.

Firstly they rename the host.name attribute to k8s.pod.name here: https://github.com/linkerd/linkerd2/blob/main/jaeger/charts/linkerd-jaeger/values.yaml#L112

Then they tell collector to use that attribute to find the right pod: https://github.com/linkerd/linkerd2/blob/main/jaeger/charts/linkerd-jaeger/values.yaml#L124

And they run the resources processor before the k8sattributes processor: https://github.com/linkerd/linkerd2/blob/main/jaeger/charts/linkerd-jaeger/values.yaml#L184

Hopefully that makes the Collector find the correct Pod.

If the above doesn't work though, for what it's worth, you can probably get like half of the data you want without even using the k8sattribute processor.

linkerd-proxy populates the following attributes:

linkerd.io/proxy-deployment
linkerd.io/workload-ns
host.name

Which gets you the workload name, the workload namespace, and pod name. The first two though, are overriden by your current k8sattributes config:

        extract {
          annotation {
            from      = "pod"
            key_regex = "(.*)/(.*)"
            tag_name  = "$1.$2"
          }
          label {
            from      = "pod"
            key_regex = "(.*)/(.*)"
            tag_name  = "$1.$2"
          }

And by removing that, you'd be able to see those two. But yea this still doesn't get you the node.name, pod.uid, pod.start_time, and the other labels and annotations.

@jseiser
Copy link
Contributor Author

jseiser commented Jan 13, 2025

@wc-s

Grafana Alloy I think is just a thin wrapper around opentelemetry-collector, so the config format is pretty much the same.

It is, its just the config syntax is different as you see.

I know we had alot of this in place already, but Ill go back and make sure the above is in there.

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants