Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent fails with credential errors, cannot use IAM Roles for Service Accounts (IRSA) or EKS Pod Identities with EKS Addon #119

Open
AaronFriel opened this issue Mar 11, 2024 · 6 comments

Comments

@AaronFriel
Copy link

Users may wish to run the CloudWatch Agent using pod-based IAM roles, using the IRSA or EKS Pod Identities technologies. Recently (PR below) this was enabled when an environment variable is set on the agent pod, RUN_WITH_IRSA=true, and this enables the agent to utilize the default provider chain for AWS authentication.

However, the EKS Addon for AWS CloudWatch Observability creates a managed AmazonCloudWatchAgent configuration, making it unsafe - there is no guarantee it won't be overridden - to add environment variables.

Background

CloudWatch Agent PR:

Expected behavior

Running the EKS Addon for AWS CloudWatch Observability with pod-based IAM should work by default.

Actual behavior

The agent fails, and there is no knob available to users to ensure the agent works.

Proposal

Either of these solutions would address this:

  • The Addon and Operator should permit an additional configuration, to merge either individual env vars or arbitrary config into the AmazonCloudWatchAgent custom resource
  • The Agent's configuration file, cwagentconfig.json, which is managed by the add-on should accept a configuration key to enable the RUN_WITH_IRSA mode.
@jefchien
Copy link
Contributor

Hi @AaronFriel,

Have a few questions that would help us look into your issue.

@AaronFriel
Copy link
Author

Hey @jefchien thanks for getting back to me.

Yeah, this is the TypeScript code used with Pulumi IaC to deploy EKS CloudWatch.

  // This configures IRSA with an `AssumeRoleWithWebIdentity` - confirmed working with other addons including:
  // * AWS EFS CSI Driver
  // * Karpenter Controller
  const role = IamServiceAccountRole(`${clusterPetName}-cloudwatch-observability`, {
    namespaceName: 'amazon-cloudwatch',
    serviceAccountName: 'cloudwatch-agent',
  });

  new aws.iam.RolePolicyAttachment(`${clusterPetName}-cloudwatch-observability-agent`, {
    policyArn: 'arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy',
    role,
  });
  new aws.iam.RolePolicyAttachment(`${clusterPetName}-cloudwatch-observability-xray`, {
    policyArn: 'arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess',
    role,
  });

  const addonVersion = aws.eks.getAddonVersionOutput({
    addonName: 'amazon-cloudwatch-observability',
    kubernetesVersion: clusterVersion,
    mostRecent: true,
  });

  const addon = new aws.eks.Addon(
    `${clusterPetName}-cloudwatch-observability`,
    {
      clusterName,
      addonName: addonVersion.addonName,
      addonVersion: addonVersion.version,
      serviceAccountRoleArn: role.arn,
      preserve: false,
    },
    { dependsOn },
  );

Yes, here are the agent logs. I've formatted these for readability:

2024-04-09T01:27:56Z E! {
  "caller": "[email protected]/cwlog_client.go:135",
  "msg": "cwlog_client: Error occurs in PutLogEvents",
  "kind": "exporter",
  "data_type": "metrics",
  "name": "awsemf/containerinsights",
  "error": "SharedCredsLoad: failed to load shared credentials file
  caused by: FailedRead: unable to open file
  caused by: open /root/.aws/credentials: no such file or directory",
  "stacktrace": "github.com/open-telemetry/opentelemetry-collector-contrib/internal/aws/cwlogs.(*Client).PutLogEvents
  \tgithub.com/open-telemetry/opentelemetry-collector-contrib/internal/aws/[email protected]/cwlog_client.go:135
  github.com/open-telemetry/opentelemetry-collector-contrib/internal/aws/cwlogs.(*logPusher).pushEventBatch
  \tgithub.com/open-telemetry/opentelemetry-collector-contrib/internal/aws/[email protected]/pusher.go:264
  github.com/open-telemetry/opentelemetry-collector-contrib/internal/aws/cwlogs.(*logPusher).AddLogEntry
  \tgithub.com/open-telemetry/opentelemetry-collector-contrib/internal/aws/[email protected]/pusher.go:238
  github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter.(*emfExporter).pushMetricsData
  \tgithub.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/emf_exporter.go:153
  go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).Export
  \tgo.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:58
  go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send
  \tgo.opentelemetry.io/collector/[email protected]/exporterhelper/timeout_sender.go:38
  go.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send
  \tgo.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:33
  go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
  \tgo.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:173
  go.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send
  \tgo.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:33
  go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send
  \tgo.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:189
  go.opentelemetry.io/collector/exporter/exporterhelper.NewMetricsExporter.func1
  \tgo.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:98
  go.opentelemetry.io/collector/consumer.ConsumeMetricsFunc.ConsumeMetrics
  \tgo.opentelemetry.io/collector/[email protected]/metrics.go:25
  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/resourcetotelemetry.(*wrapperMetricsExporter).ConsumeMetrics
  \tgithub.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/resource_to_telemetry.go:32
  go.opentelemetry.io/collector/processor/batchprocessor.(*batchMetrics).export
  \tgo.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:442
  go.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems
  \tgo.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:256
  go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start
  \tgo.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:218"
}
2024-04-09T01:27:56Z W! {
  "caller": "[email protected]/batch_processor.go:258",
  "msg": "Sender failed",
  "kind": "processor",
  "name": "batch/containerinsights",
  "pipeline": "metrics/containerinsights",
  "error": "SharedCredsLoad: failed to load shared credentials file
  caused by: FailedRead: unable to open file
  caused by: open /root/.aws/credentials: no such file or directory"
}

The only way to set RUN_WITH_IRSA to true is to edit the amazoncloudwatchagents.cloudwatch.aws.amazon.com resource, because the operator will reconcile that against the pods. Editing the resource like so:

 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName

This edit enables the CloudWatch Agent to succeed. However, there is no way to guarantee that this change is persistent, because the cloudwatch-agent resource is owned by the addon and could be overwritten during an upgrade.

@jefchien
Copy link
Contributor

jefchien commented Apr 9, 2024

Do you have IMDS disabled or a hop limit set to 1? This seems like a similar issue to aws/amazon-cloudwatch-agent#1101 where the agent thinks it is onPrem because it cannot reach IMDS, which results in it trying to read the /root/.aws/credentials file.

@AaronFriel
Copy link
Author

AaronFriel commented Apr 10, 2024

Yes, because that is considered best practice - but also because if node IMDS is enabled, it is not using pod identity, it's using node identity.

If the CloudWatch Agent does not work with IMDS hop limit set to 1, what is this section doing?

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html#install-CloudWatch-Observability-EKS-addon-serviceaccountrole

I think the answer is "nothing"?

@charlierm
Copy link

Any updates on this? It's preventing us from using it.

@jefchien
Copy link
Contributor

We're working on providing an alternative to IMDS. You can track that here aws/amazon-cloudwatch-agent#1101.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants