Skip to content
This repository has been archived by the owner on Oct 11, 2021. It is now read-only.

Error - Connection Refused on clean install via CloudFormation. #142

Closed
cgddrd opened this issue Nov 20, 2019 · 5 comments
Closed

Error - Connection Refused on clean install via CloudFormation. #142

cgddrd opened this issue Nov 20, 2019 · 5 comments

Comments

@cgddrd
Copy link

cgddrd commented Nov 20, 2019

Everytime I run the CF stack, the web server seems really ropey. Sometimes I can connect to the Web UI, but other times I simply get a 'Connection refused' error. I'm certain it's not my VPC/subnet/SG configuration (and besides, it's using the base config provided back the stack anyway).

This is really frustrating as I'd love to use this stack. Any help greatly appreciated. Thanks.

@villasv
Copy link
Owner

villasv commented Nov 21, 2019

By “sometimes” you mean different attempts in the same stack (sometimes the webserver responds) or different deployments (sometimes the stack has a completely unresponsive webserver)?

I’ve never experienced this, but the fact that it’s intermittent indicates that it might be a resource load problem. Maybe you could try choosing a bigger instance type for the webserver, like a t3.medium?

@amizzo87
Copy link
Contributor

amizzo87 commented Feb 19, 2020

I'm also experiencing this issue. Clean (successful) build from cloudformation; can't access web server...SSH'ing into the machine and it seems airflow service is crashing, but I can't find where the log files are to debug why

@villasv
Copy link
Owner

villasv commented Feb 20, 2020

Maybe related to this: #149

I've indeed verified that "sometimes" (once in a few new deployments) that race condition occurs and the services won't start on the scheduler or the webserver.

If you're interested in debugging the initial deployment process done by cloudformation, you can peek at the log files in /var/log/cfn*

@cgddrd
Copy link
Author

cgddrd commented Feb 20, 2020

Hi @villasv - I've been doing a lot of work recently with your turbine stack - which I must say is really excellent!

I believe I've found the cause of the symptoms you describe in #149, and will soon be raising a PR with suggested fixes. There are two key factors I think are at play here:

  1. The launch config for the three tiers (webserver, scheduler and workers) should be using cfn-signal to ensure CFN knows when the User Data script has finished running (this is recommended practice as proposed by AWS).

  2. In my own testing (which there's been a lot of recently!), I weirdly found that cfn-init would sometimes randomly fail with the message Unknown error retrieving SharedCloudInitMetadata., causing Airflow to not be correctly installed. I never really got to the end result of why this happened (even looking through all the logs - cfn-init.log and user-data.log), but weirdly, by explicitly including --configsets default in the call to cfn-init, this problem appears to have stopped happening, and after at least 15 stack re-creations, I've not been able to replicate this issue since (it used to happen every couple of attempts before).

I think this issue is related to this - the fact that the Airflow Webserver service has not been correctly installed/configured, thus the service fails to start leading to the 'Connection Refused' issue.

I'm going to close this issue as I don't believe it is a separate problem, and is instead linked to what I've just been discussing (which I'll raise as it's own PR).

Thanks.

@cgddrd cgddrd closed this as completed Feb 20, 2020
@cgddrd
Copy link
Author

cgddrd commented Feb 20, 2020

P.S. I've also added support for Cloudwatch log monitoring of key installation log files (cfn-init.log, user-data.log) - somewhat related to #124 - which I'll raise in a second PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants