Skip to content
This repository has been archived by the owner on Oct 11, 2021. It is now read-only.

codedeploy-agent.service is made executable and causes sytemd to error #185

Open
ilyanoskov opened this issue Jun 23, 2020 · 11 comments
Open
Labels

Comments

@ilyanoskov
Copy link

Hello, the scheduler has failed and then it keeps trying to get recreated without success:

[   22.583701] cloud-init[2662]: Cloud-init v. 19.3-2.amzn2 running 'modules:final' at Tue, 23 Jun 2020 20:16:58 +0000. Up 22.49 seconds.
[   22.598211] cloud-init[2662]: + /opt/aws/bin/cfn-init -v --region eu-central-1 --stack airflow-stack-TurbineCluster-KYVRTM1L34GC-SchedulerStack-4B6CDKN0L3OL --resource LaunchConfiguration
[  111.979026] systemd-sysv-generator[3400]: Configuration file /usr/lib/systemd/system/codedeploy-agent.service is marked executable. Please remove executable permission bits. Proceeding anyway.
[  112.056604] systemd-sysv-generator[3420]: Configuration file /usr/lib/systemd/system/codedeploy-agent.service is marked executable. Please remove executable permission bits. Proceeding anyway.
[  113.121328] cloud-init[2662]: + /opt/aws/bin/cfn-signal -e 0
[  113.383730] cloud-init[2662]: Error: No WaitConditionHandle URL or logical resource id specified
[  113.389476] cloud-init[2662]: Usage: cfn-signal [options] [WaitConditionHandle URL]
[  113.393242] cloud-init[2662]: Options:
[  113.395372] cloud-init[2662]: -h, --help            show this help message and exit
[  113.398876] cloud-init[2662]: -s SUCCESS, --success=SUCCESS
[  113.401478] cloud-init[2662]: If true, signal success to CloudFormation; if false,
[  113.412297] cloud-init[2662]: signal failure. Default: true
[  113.414889] cloud-init[2662]: -i ID, --id=ID        A unique ID to send with the signal
[  113.418303] cloud-init[2662]: -e EXIT_CODE, --exit-code=EXIT_CODE
[  113.421459] cloud-init[2662]: Derive success or failure from specified exit code
[  113.432048] cloud-init[2662]: AWS Credentials:
[  113.434343] cloud-init[2662]: Options for specifying AWS Account Credentials.
[  113.437537] cloud-init[2662]: -f CREDENTIAL_FILE, --credential-file=CREDENTIAL_FILE
[  113.440815] cloud-init[2662]: A credential file, readable only by the owner, with
[  113.452722] cloud-init[2662]: keys 'AWSAccessKeyId' and 'AWSSecretKey'
[  113.455761] cloud-init[2662]: --role=IAM_ROLE     An IAM Role
[  113.458485] cloud-init[2662]: --access-key=ACCESS_KEY
[  113.460655] cloud-init[2662]: An AWS Access Key
[  113.462848] cloud-init[2662]: --secret-key=SECRET_KEY
[  113.465099] cloud-init[2662]: An AWS Secret Key
[  113.467541] cloud-init[2662]: Proxy:
[  113.469520] cloud-init[2662]: Options for specifying proxies. Format:
[  113.472558] cloud-init[2662]: [scheme://][user:password@]host:port
[  113.475352] cloud-init[2662]: --http-proxy=HTTP_PROXY
[  113.477669] cloud-init[2662]: A (non-SSL) HTTP proxy
[  113.480200] cloud-init[2662]: --https-proxy=HTTPS_PROXY
ci-info: no authorized ssh keys fingerprints found for user ec2-user.
[  113.485129] cloud-init[2662]: An HTTPS proxy
[  113.487392] cloud-init[2662]: WaitConditionHandle Signal Options:
[  113.490793] cloud-init[2662]: -r REASON, --reason=REASON
[  113.493343] cloud-init[2662]: The reason for success/failure
[  113.496181] cloud-init[2662]: -d DATA, --data=DATA
[  113.498354] cloud-init[2662]: Data to include with the WaitCondition signal
[  113.501489] cloud-init[2662]: Resource Signal Options:
[  113.503868] cloud-init[2662]: --stack=STACK_NAME  A CloudFormation stack
[  113.507285] cloud-init[2662]: --resource=LOGICAL_RESOURCE_ID
[  113.510332] cloud-init[2662]: A CloudFormation logical resource ID
[  113.513647] cloud-init[2662]: --url=ENDPOINT      The CloudFormation service URL. The endpoint URL must
[  113.517877] cloud-init[2662]: match the region option. Use of this parameter is
[  113.521165] cloud-init[2662]: discouraged.
[  113.525759] cloud-init[2662]: --region=REGION     The CloudFormation region. Default: us-east-1.
[  113.530549] cloud-init[2662]: Jun 23 20:18:29 cloud-init[2662]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
[  113.537307] cloud-init[2662]: Jun 23 20:18:29 cloud-init[2662]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[  113.544112] cloud-init[2662]: Jun 23 20:18:29 cloud-init[2662]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed

This repeats over and over. What do you think could be the source of this issue?

@villasv
Copy link
Owner

villasv commented Jun 23, 2020

Looks like /usr/lib/systemd/system/codedeploy-agent.service was made an executable, which systemd is against. Not sure why that would happen, maybe a bug on the setup process that does indeed mark a few files to executables, shouldn't target this file though.

You might fix it by entering the shell and removing the -x flag from this file permissions, though I'm worried that a few other files migh have the same problem.

@villasv villasv added the bug label Jun 23, 2020
@villasv villasv changed the title Scheduler unable to start up in the new availability zone, gets stuck in a recreation loop codedeploy-agent.service is made executable and causes sytemd to error Jun 23, 2020
@ilyanoskov
Copy link
Author

ilyanoskov commented Jun 26, 2020

okay! This bug is quite serious, as I had my Airflow scheduler fail during the day and the new one was not able to be provisioned due to this error. I had to manually redeploy the entire cluster to fix this (the new ec2 schedulers were starting up and shutting down immediately)

@ilyanoskov
Copy link
Author

ilyanoskov commented Jul 20, 2020

@villasv this issue persists, just had a scheduler fail with the same error.

I found the source of this issue:
aws/aws-codedeploy-agent#205

do you think a simple chmod in setup scripts would fix this issue? what permissions would you suggest for this service?

@villasv
Copy link
Owner

villasv commented Jul 21, 2020

Oh geez, thanks for finding that. I assumed it was a bug in one of this repo's scripts and could never find it.

what permissions would you suggest for this service?

Maybe just unset the execution part? chmod -x so we don't touch unintended permissions.

@villasv
Copy link
Owner

villasv commented Jul 21, 2020

On the other hand, looking through the issue it seems to be a benign warning that doesn't cause error. It's probably something else. Now I'm looking at the Error: No WaitConditionHandle URL or logical resource id specified message.

The command above is /opt/aws/bin/cfn-signal -e 0 which is really weird. For some reason the condition handle is not being substituted.

EDIT.: Ah, I see the error now. Once again a missing \ line continuation on the cfn-signal (pretty sure this happened before!, must be a regression). But this means that this error is irrelevant, it shouldn't be stopping the scheduler from working because this command is not essential.

So if your scheduler is failing, it must be due to something else :-/

@ilyanoskov
Copy link
Author

okay @villasv thanks a lot for checking this, I will investigate again once this failure happens.

A good way to reproduce this issue is simple: just stop the scheduler in EC2 console, this will force it to shutdown and the new one to be provisioned. The new one will then fail to start, causing an infinite loop of provisioning and deprovisioning. I was unable to resolve this issue, so for now I have resorted to deleting and redeploying the entire cluster once such failure happens.

@TayyabShahzada1
Copy link

TayyabShahzada1 commented Jul 29, 2020

Hi a colleague and I had the same issue where both the schedular and webserver would keep failing and restarting if they were terminated for any reason.
we found the cause to be that a autoscaling triggered event in codedeploy service did not respect the overwrite flag.
With the scheduler and webserver start scripts not checking if there was a code deployment pending they were starting and creating the cfg file in the airflow folder before the code deploy had the chance to deploy the code.

This led to code deploy failing shutting down the instance and starting a new one.
Using ssm to connect to a starting ec2 i was able to observe this file being created before it should, perhaps you may be able to confirm in the same manner.

If so the fix we implemented was to perform a rm /airflow/* -rf as part of a before install in the code deploy

@villasv
Copy link
Owner

villasv commented Jul 29, 2020

Ohhh, thank you for the investigation, this is quite promising.

Indeed. Curiously this is already solved for the workers, they will wait for CodeDeploy (either to deploy or to say there's nothing to deploy)! The original motivation was that I didn't want them to start consuming messages from the Celery broker before they had the DAG files in place.

Now it seems this was relevant for the scheduler and webserver after all. Since there's some code reuse, it shouldn't be hard. Glad to see that you were able to workaround using the existing scripts! Nicely done.

@ilyanoskov
Copy link
Author

@Tayyab-Dativa could you please share your scripts? I was not able to figure out how to fix this issue, still relying on manual redeploys :( I would really really appreciate your help!

@TayyabShahzada1
Copy link

So at your project level, in the scripts directory, the one that contains cdapp_start.sh and cdapp_stop.sh if you add a new bash file, we called it before_install.sh and add the following 'rm /airflow/* -rf', then as part of the appspec.yml you need to add
'''
BeforeInstall:
- location: scripts/before_install.sh
runas: root
'''
as one of the hooks, this way code deploy knows to call this script before installing

@ilyanoskov
Copy link
Author

@Tayyab-Dativa excellent, thanks a lot! This is huge, now I can have confidence in my cluster :D

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants