codedeploy-agent.service is made executable and causes sytemd to error #185

ilyanoskov · 2020-06-23T20:40:04Z

Hello, the scheduler has failed and then it keeps trying to get recreated without success:

[   22.583701] cloud-init[2662]: Cloud-init v. 19.3-2.amzn2 running 'modules:final' at Tue, 23 Jun 2020 20:16:58 +0000. Up 22.49 seconds.
[   22.598211] cloud-init[2662]: + /opt/aws/bin/cfn-init -v --region eu-central-1 --stack airflow-stack-TurbineCluster-KYVRTM1L34GC-SchedulerStack-4B6CDKN0L3OL --resource LaunchConfiguration
[  111.979026] systemd-sysv-generator[3400]: Configuration file /usr/lib/systemd/system/codedeploy-agent.service is marked executable. Please remove executable permission bits. Proceeding anyway.
[  112.056604] systemd-sysv-generator[3420]: Configuration file /usr/lib/systemd/system/codedeploy-agent.service is marked executable. Please remove executable permission bits. Proceeding anyway.
[  113.121328] cloud-init[2662]: + /opt/aws/bin/cfn-signal -e 0
[  113.383730] cloud-init[2662]: Error: No WaitConditionHandle URL or logical resource id specified
[  113.389476] cloud-init[2662]: Usage: cfn-signal [options] [WaitConditionHandle URL]
[  113.393242] cloud-init[2662]: Options:
[  113.395372] cloud-init[2662]: -h, --help            show this help message and exit
[  113.398876] cloud-init[2662]: -s SUCCESS, --success=SUCCESS
[  113.401478] cloud-init[2662]: If true, signal success to CloudFormation; if false,
[  113.412297] cloud-init[2662]: signal failure. Default: true
[  113.414889] cloud-init[2662]: -i ID, --id=ID        A unique ID to send with the signal
[  113.418303] cloud-init[2662]: -e EXIT_CODE, --exit-code=EXIT_CODE
[  113.421459] cloud-init[2662]: Derive success or failure from specified exit code
[  113.432048] cloud-init[2662]: AWS Credentials:
[  113.434343] cloud-init[2662]: Options for specifying AWS Account Credentials.
[  113.437537] cloud-init[2662]: -f CREDENTIAL_FILE, --credential-file=CREDENTIAL_FILE
[  113.440815] cloud-init[2662]: A credential file, readable only by the owner, with
[  113.452722] cloud-init[2662]: keys 'AWSAccessKeyId' and 'AWSSecretKey'
[  113.455761] cloud-init[2662]: --role=IAM_ROLE     An IAM Role
[  113.458485] cloud-init[2662]: --access-key=ACCESS_KEY
[  113.460655] cloud-init[2662]: An AWS Access Key
[  113.462848] cloud-init[2662]: --secret-key=SECRET_KEY
[  113.465099] cloud-init[2662]: An AWS Secret Key
[  113.467541] cloud-init[2662]: Proxy:
[  113.469520] cloud-init[2662]: Options for specifying proxies. Format:
[  113.472558] cloud-init[2662]: [scheme://][user:password@]host:port
[  113.475352] cloud-init[2662]: --http-proxy=HTTP_PROXY
[  113.477669] cloud-init[2662]: A (non-SSL) HTTP proxy
[  113.480200] cloud-init[2662]: --https-proxy=HTTPS_PROXY
ci-info: no authorized ssh keys fingerprints found for user ec2-user.
[  113.485129] cloud-init[2662]: An HTTPS proxy
[  113.487392] cloud-init[2662]: WaitConditionHandle Signal Options:
[  113.490793] cloud-init[2662]: -r REASON, --reason=REASON
[  113.493343] cloud-init[2662]: The reason for success/failure
[  113.496181] cloud-init[2662]: -d DATA, --data=DATA
[  113.498354] cloud-init[2662]: Data to include with the WaitCondition signal
[  113.501489] cloud-init[2662]: Resource Signal Options:
[  113.503868] cloud-init[2662]: --stack=STACK_NAME  A CloudFormation stack
[  113.507285] cloud-init[2662]: --resource=LOGICAL_RESOURCE_ID
[  113.510332] cloud-init[2662]: A CloudFormation logical resource ID
[  113.513647] cloud-init[2662]: --url=ENDPOINT      The CloudFormation service URL. The endpoint URL must
[  113.517877] cloud-init[2662]: match the region option. Use of this parameter is
[  113.521165] cloud-init[2662]: discouraged.
[  113.525759] cloud-init[2662]: --region=REGION     The CloudFormation region. Default: us-east-1.
[  113.530549] cloud-init[2662]: Jun 23 20:18:29 cloud-init[2662]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
[  113.537307] cloud-init[2662]: Jun 23 20:18:29 cloud-init[2662]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[  113.544112] cloud-init[2662]: Jun 23 20:18:29 cloud-init[2662]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed

This repeats over and over. What do you think could be the source of this issue?

The text was updated successfully, but these errors were encountered:

villasv · 2020-06-23T21:31:04Z

Looks like /usr/lib/systemd/system/codedeploy-agent.service was made an executable, which systemd is against. Not sure why that would happen, maybe a bug on the setup process that does indeed mark a few files to executables, shouldn't target this file though.

You might fix it by entering the shell and removing the -x flag from this file permissions, though I'm worried that a few other files migh have the same problem.

ilyanoskov · 2020-06-26T13:39:30Z

okay! This bug is quite serious, as I had my Airflow scheduler fail during the day and the new one was not able to be provisioned due to this error. I had to manually redeploy the entire cluster to fix this (the new ec2 schedulers were starting up and shutting down immediately)

ilyanoskov · 2020-07-20T15:06:02Z

@villasv this issue persists, just had a scheduler fail with the same error.

I found the source of this issue:
aws/aws-codedeploy-agent#205

do you think a simple chmod in setup scripts would fix this issue? what permissions would you suggest for this service?

villasv · 2020-07-21T12:55:59Z

Oh geez, thanks for finding that. I assumed it was a bug in one of this repo's scripts and could never find it.

what permissions would you suggest for this service?

Maybe just unset the execution part? chmod -x so we don't touch unintended permissions.

villasv · 2020-07-21T12:58:25Z

On the other hand, looking through the issue it seems to be a benign warning that doesn't cause error. It's probably something else. Now I'm looking at the Error: No WaitConditionHandle URL or logical resource id specified message.

The command above is /opt/aws/bin/cfn-signal -e 0 which is really weird. For some reason the condition handle is not being substituted.

EDIT.: Ah, I see the error now. Once again a missing \ line continuation on the cfn-signal (pretty sure this happened before!, must be a regression). But this means that this error is irrelevant, it shouldn't be stopping the scheduler from working because this command is not essential.

So if your scheduler is failing, it must be due to something else :-/

ilyanoskov · 2020-07-21T14:39:56Z

okay @villasv thanks a lot for checking this, I will investigate again once this failure happens.

A good way to reproduce this issue is simple: just stop the scheduler in EC2 console, this will force it to shutdown and the new one to be provisioned. The new one will then fail to start, causing an infinite loop of provisioning and deprovisioning. I was unable to resolve this issue, so for now I have resorted to deleting and redeploying the entire cluster once such failure happens.

TayyabShahzada1 · 2020-07-29T19:22:41Z

Hi a colleague and I had the same issue where both the schedular and webserver would keep failing and restarting if they were terminated for any reason.
we found the cause to be that a autoscaling triggered event in codedeploy service did not respect the overwrite flag.
With the scheduler and webserver start scripts not checking if there was a code deployment pending they were starting and creating the cfg file in the airflow folder before the code deploy had the chance to deploy the code.

This led to code deploy failing shutting down the instance and starting a new one.
Using ssm to connect to a starting ec2 i was able to observe this file being created before it should, perhaps you may be able to confirm in the same manner.

If so the fix we implemented was to perform a rm /airflow/* -rf as part of a before install in the code deploy

villasv · 2020-07-29T19:38:54Z

Ohhh, thank you for the investigation, this is quite promising.

Indeed. Curiously this is already solved for the workers, they will wait for CodeDeploy (either to deploy or to say there's nothing to deploy)! The original motivation was that I didn't want them to start consuming messages from the Celery broker before they had the DAG files in place.

Now it seems this was relevant for the scheduler and webserver after all. Since there's some code reuse, it shouldn't be hard. Glad to see that you were able to workaround using the existing scripts! Nicely done.

ilyanoskov · 2020-07-29T19:43:23Z

@Tayyab-Dativa could you please share your scripts? I was not able to figure out how to fix this issue, still relying on manual redeploys :( I would really really appreciate your help!

TayyabShahzada1 · 2020-07-30T10:42:55Z

So at your project level, in the scripts directory, the one that contains cdapp_start.sh and cdapp_stop.sh if you add a new bash file, we called it before_install.sh and add the following 'rm /airflow/* -rf', then as part of the appspec.yml you need to add
'''
BeforeInstall:
- location: scripts/before_install.sh
runas: root
'''
as one of the hooks, this way code deploy knows to call this script before installing

ilyanoskov · 2020-07-30T13:55:25Z

@Tayyab-Dativa excellent, thanks a lot! This is huge, now I can have confidence in my cluster :D

villasv added the bug label Jun 23, 2020

villasv changed the title ~~Scheduler unable to start up in the new availability zone, gets stuck in a recreation loop~~ codedeploy-agent.service is made executable and causes sytemd to error Jun 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codedeploy-agent.service is made executable and causes sytemd to error #185

codedeploy-agent.service is made executable and causes sytemd to error #185

ilyanoskov commented Jun 23, 2020

villasv commented Jun 23, 2020

ilyanoskov commented Jun 26, 2020 •

edited

Loading

ilyanoskov commented Jul 20, 2020 •

edited

Loading

villasv commented Jul 21, 2020

villasv commented Jul 21, 2020 •

edited

Loading

ilyanoskov commented Jul 21, 2020

TayyabShahzada1 commented Jul 29, 2020 •

edited

Loading

villasv commented Jul 29, 2020

ilyanoskov commented Jul 29, 2020

TayyabShahzada1 commented Jul 30, 2020

ilyanoskov commented Jul 30, 2020

codedeploy-agent.service is made executable and causes sytemd to error #185

codedeploy-agent.service is made executable and causes sytemd to error #185

Comments

ilyanoskov commented Jun 23, 2020

villasv commented Jun 23, 2020

ilyanoskov commented Jun 26, 2020 • edited Loading

ilyanoskov commented Jul 20, 2020 • edited Loading

villasv commented Jul 21, 2020

villasv commented Jul 21, 2020 • edited Loading

ilyanoskov commented Jul 21, 2020

TayyabShahzada1 commented Jul 29, 2020 • edited Loading

villasv commented Jul 29, 2020

ilyanoskov commented Jul 29, 2020

TayyabShahzada1 commented Jul 30, 2020

ilyanoskov commented Jul 30, 2020

ilyanoskov commented Jun 26, 2020 •

edited

Loading

ilyanoskov commented Jul 20, 2020 •

edited

Loading

villasv commented Jul 21, 2020 •

edited

Loading

TayyabShahzada1 commented Jul 29, 2020 •

edited

Loading