Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloudwatch-agent doesn't appear to be working with IMDSv2 w/ auth enabled #249

Closed
JohnPolansky opened this issue Jul 26, 2021 · 4 comments
Closed

Comments

@JohnPolansky
Copy link

JohnPolansky commented Jul 26, 2021

Hey All,

We've been using the amazon-cloudwatch-agent for awhile now and so far we have been loving it. We set it up using the documentation found here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs-FluentBit.html

However during a security audit we recently discovered that all our AWS (EKS) EC2 instances are running IMDS with open/unauthenticated access. As this is a security vulnerability we wanted to remove the concern. To that end we updated our terraform EC2 instance templates to switch the IMDS interface to authenticated and 1 hop

    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 1
  }

This worked as expected the IMDS interface is still open on HTTPS and required authentication which we assume should work through the service account provisioned for the cluster. We currently have 2 node-groups setup NodeGroupA was configured for IMDs https/auth/1 hop and NodeGroupB was left the default http/open access. However after we made the change, we didn't notice until several days later that all the cloudwatch-agents on the NodeGroupA started failing continuously

NAME                                 READY   STATUS    RESTARTS   AGE
cloudwatch-agent-5hbk8               1/1     Running   0          38d
cloudwatch-agent-lp688               1/1     Running   0          38d
cloudwatch-agent-tzzpm               1/1     Running   5938       38d
cwagent-prometheus-57d597f5c-65cvw   1/1     Running   0          52d
fluent-bit-245ns                     1/1     Running   0          64d
fluent-bit-k67d8                     1/1     Running   0          64d
fluent-bit-s84tg                     1/1     Running   0          52d

This has resulted in a total lose of cloudwatch logs for any pods in the NodeGroupA and all CloudInsight Performance data is also being lost for the same nodegroup. We are hoping someone can help us with this issue, as it would seem best practice not to leave the IMDs with open access. We've temporarily enabled aws-for-fluent-bit on the cluster to at least ensure our cloudwatch logs keep coming in but we'd like to stick with your tool if we can.

Here are the logs for a standard failure on NodeGroupA:

+ cloudwatch-agent-tzzpm › cloudwatch-agent
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 I! 2021/07/26 20:20:31 E! ec2metadata is not available
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:31 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:32 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:33 W! retry [1/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 W! retry [2/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
cloudwatch-agent-tzzpm cloudwatch-agent I! Detected the instance is OnPrem
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ...
cloudwatch-agent-tzzpm cloudwatch-agent /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 Reading json config file path: /etc/cwagentconfig/..2021_06_18_17_23_34.900788054/cwagentconfig.json ...
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 Find symbolic link /etc/cwagentconfig/..data
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 Find symbolic link /etc/cwagentconfig/cwagentconfig.json
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 Reading json config file path: /etc/cwagentconfig/cwagentconfig.json ...
cloudwatch-agent-tzzpm cloudwatch-agent Valid Json input schema.
cloudwatch-agent-tzzpm cloudwatch-agent Got Home directory: /root
cloudwatch-agent-tzzpm cloudwatch-agent No csm configuration found.
cloudwatch-agent-tzzpm cloudwatch-agent No metric configuration found.
cloudwatch-agent-tzzpm cloudwatch-agent Configuration validation first phase succeeded
cloudwatch-agent-tzzpm cloudwatch-agent
cloudwatch-agent-tzzpm cloudwatch-agent 2021/07/26 20:20:34 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! Starting AmazonCloudWatchAgent 1.247347.6
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! Loaded inputs: k8sapiserver cadvisor
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! Loaded aggregators:
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! Loaded processors: ec2tagger k8sdecorator
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! Loaded outputs: cloudwatchlogs
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! Tags enabled:
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-0-2-81.us-east-2.compute.internal", Flush Interval:1s
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! [logagent] starting
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:20:34Z I! [logagent] found plugin cloudwatchlogs is a log backend
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:24:35Z E! [processors.ec2tagger] ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance
cloudwatch-agent-tzzpm cloudwatch-agent 2021-07-26T20:24:35Z E! [telegraf] Error running agent: could not initialize processor ec2tagger: ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance

We noticed the lines about failing to connect to IMDS and then it mentioned Detected the instance is OnPrem which after some google search we found this issue: aws-samples/amazon-cloudwatch-container-insights#56, so we attempted to edit the daemonset config and add the RUN_IN_AWS=True the logs did change to indicate it picked up the option, but we are still getting failures:

+ cloudwatch-agent-9w9zh › cloudwatch-agent
cloudwatch-agent-r48hd cloudwatch-agent 2021-07-26T20:52:17Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
cloudwatch-agent-r48hd cloudwatch-agent 2021-07-26T20:52:17Z I! k8sapiserver Switch New Leader: ip-10-0-2-62.us-east-2.compute.internal
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:21 I! I! Detected from ENV instance is EC2
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:14 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ...
cloudwatch-agent-9w9zh cloudwatch-agent /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:14 Reading json config file path: /etc/cwagentconfig/..2021_07_26_20_48_04.766796875/cwagentconfig.json ...
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:14 Find symbolic link /etc/cwagentconfig/..data
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:14 Find symbolic link /etc/cwagentconfig/cwagentconfig.json
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:14 Reading json config file path: /etc/cwagentconfig/cwagentconfig.json ...
cloudwatch-agent-9w9zh cloudwatch-agent Valid Json input schema.
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:18 E! ec2metadata is not available
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:18 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:19 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:20 W! retry [1/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:21 W! retry [2/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:21 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
cloudwatch-agent-9w9zh cloudwatch-agent No csm configuration found.
cloudwatch-agent-9w9zh cloudwatch-agent No metric configuration found.
cloudwatch-agent-9w9zh cloudwatch-agent Configuration validation first phase succeeded
cloudwatch-agent-9w9zh cloudwatch-agent
cloudwatch-agent-9w9zh cloudwatch-agent 2021/07/26 20:52:21 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! Starting AmazonCloudWatchAgent 1.247347.6
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! Loaded inputs: cadvisor k8sapiserver
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! Loaded aggregators:
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! Loaded processors: ec2tagger k8sdecorator
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! Loaded outputs: cloudwatchlogs
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! Tags enabled:
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-0-2-81.us-east-2.compute.internal", Flush Interval:1s
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! [logagent] starting
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:52:21Z I! [logagent] found plugin cloudwatchlogs is a log backend
cloudwatch-agent-qp4sq cloudwatch-agent 2021-07-26T20:52:26Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
cloudwatch-agent-qp4sq cloudwatch-agent 2021-07-26T20:52:26Z I! k8sapiserver Switch New Leader: ip-10-0-2-62.us-east-2.compute.internal
cloudwatch-agent-qp4sq cloudwatch-agent 2021-07-26T20:52:26Z I! k8sapiserver OnStartedLeading: ip-10-0-2-62.us-east-2.compute.internal
cloudwatch-agent-r48hd cloudwatch-agent 2021-07-26T20:56:18Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
cloudwatch-agent-r48hd cloudwatch-agent 2021-07-26T20:56:18Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:56:21Z E! [processors.ec2tagger] ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance
cloudwatch-agent-9w9zh cloudwatch-agent 2021-07-26T20:56:21Z E! [telegraf] Error running agent: could not initialize processor ec2tagger: ec2tagger: Unable to retrieve InstanceId. This plugin must only be used on an EC2 instance
- cloudwatch-agent-9w9zh › cloudwatch-agent

We appreciate any insights this work is being done on AWS Region us-east-2 and we are seeing it in two different accounts that we enabled the IMDS w/ auth.

@nielsvbrecht
Copy link

Running into the same issue. Have not really been able to find any other guidance or leads on how to solve this.

@JohnPolansky
Copy link
Author

Apologies guys, I actually opened this ticket with AWS Support - EKS team and got some GREAT help.. I've copied their commands here for others in the hope it helps.

Issue: Cloudwatch -> Insights wasn't showing performance metrics for pods in the node-group with IMDSv2

  • Fix: Adjust the hop_limit to 2 .. with EKS the container counts as a hope so having 1 will not work you must have 2
"In a container environment, if the hop limit is 1, the IMDSv2 response does not return because going to the container is considered an additional network hop. To avoid the process of falling back to IMDSv1 and the resultant delay, in a container environment we recommend that you set the hop limit to 2". [1]
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html#imds-considerations

Once i updated the count to 2 my performance metrics immediately started showing up in insights.. BUT the cloudwatch_logs were still not showing...

Issue: Cloudwatch -> log groups no node-group or applications logs from node-groups with the IMDSv2
Fix: yes it works but I'm sorry it's a bit messy

I have dived deeper into this issue and there are some answers that I want to share.

Let me say that to set up Container Insights on Amazon EKS we need to execute 3 main steps [1]:
1. Verify requirements - Permissions
2. Set up CloudWatch agent - send metrics log
a. /performance
3. Set up Fluent Bit - send other logs
a. /application
b. /dataplane
c. /host
About item #1, one of the requirements is to grant IAM permissions to enable CloudWatch agent and Fluent Bit to work with Amazon Cloud Watch. There are 2 ways to grant these permissions [2]:
• relying on worker node instance role, so pods get the authentication token from IMDS (I understand this is the solution implemented in your environment)
• creating a specific IAM role associated with both cloudwatch-agent and fluent-bit service accounts.

As you explained, after setting the http-put-response-hop-limit=2, you started to see new logs under /performance, which indicates that CloudWatch agent is working OK. Also the logs reflect the same.
On the other hand, you still don't get the rest of the logs pushed to CloudWatch, which means that Fluent Bit is not working as expected. Fluent Bit logs should reflect this issue, it would be relevant to verify.

Fluent Bit has a bug when it relies on IMDSv2 to get authentication tokens [3], so this solution won't be useful for now.

AWS provides another way to obtain authentication tokens, as mentioned before, using IAM role and service accounts. For this configuration, you will need to do 3 main steps: 
1. Create an IAM role that can be assumed by your cluster OIDC 
2. Attach CloudWatchAgentServerPolicy to this role
3. Associate this role with both cloudwatch-agent and fluent-bit service accounts.
The steps above are intended just for a high level reference. For detailed steps and how to implement this configuration, please refer to [2].

Still about Fluent-Bit, once you have it up and running, you may notice that even though it sends logs to Cloudwatch, the pod shows errors about not being able to retrieve ec2 metadata from IMDS. This is because Fluent-Bit tries to enrich the logs with EC2 metadata before sending to Cloudwatch, so we need to update the IMDS version in Fluent-Bit configuration [4]. 
1. Edit config map fluent-bit-config
2. Find parameter imds_version under sections [FILTER] (it appears multiple times) 
3. Update value to v2


In summary, these are the changes we have spoke about to have Container Insights running in your EKS environment with IMDSv2 required:
1. Set http-put-response-hop-limit=2
2. Set up IAM roles for Service Accounts for both cloudwatch-agent and fluent-bit
3. Update IMDS version to v2 on fluent-bit-config config map.
Additionally, you can detach CloudWatchAgentServerPolicy from worker node instance role, further improving security.

[1] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html
[2] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-prerequisites.html
[3] https://github.com/fluent/fluent-bit/issues/2840
[4] https://docs.fluentbit.io/manual/pipeline/filters/aws-metadata

In the end I updated created the role/attached the policy/created the service account via Kubectl manifest and then updated the aws-cw-fluent-bit configmap to set imds_version=v2 and presto as soon as it all applied my logs showed up within minutes solving the issue. I really hope this helps others.

@JohnPolansky JohnPolansky changed the title cloudwatch-agent doesn't appear to be working with IMDB w/ auth enabled cloudwatch-agent doesn't appear to be working with IMDSv2 w/ auth enabled Aug 2, 2021
@JohnPolansky
Copy link
Author

Closing this ticket as the steps provided solved my issue.. Thanks!

sky333999 pushed a commit that referenced this issue Jul 21, 2023
Merging on behalf of Seth
@VedanshToTheNew
Copy link

how can we check the hop-limit and further change it for an imdsv2 ec2 server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants