Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for containers running on AWS ECS #65

Merged
merged 4 commits into from
Jul 18, 2023
Merged

Conversation

karls
Copy link
Collaborator

@karls karls commented Jul 12, 2023

No description provided.

@@ -71,6 +75,18 @@ def for_render(cls, env: Mapping):
api_base_url = f"https://adapter.judoscale.com/api/{service_id}"
return cls(runtime_container, api_base_url, env)

@classmethod
def for_ecs(cls, env: Mapping):
container_metadata = requests.get(env["ECS_CONTAINER_METADATA_URI_V4"]).json()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this endpoint have any authentication?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any documentation about this endpoint being auth-protected. This endpoint is a private IP supplied automatically by ECS.

There are some differences in what data is returned between running ECS-on-Fargate running ECS-on-EC2. I suspect that most folks migrating from Heroku will be on Fargate, though. But I have no data to back this up, just a gut feel.


# All ECS containers have to be manually assigned as web instances,
# because ECS does not distinguish between web and worker instances.
runtime_container = RuntimeContainer(service_name, instance, "web")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the only place we use this is here: https://github.com/judoscale/judoscale-python/blob/aws-ecs-support/judoscale/core/metrics_collectors.py#L33-L34

Do you agree?

And I'm really confused about its purpose there. We would only call should_collect if we're trying to add a web metric, would would only happen on a web instance.

Does any of this ring a bell for you? I feel like I might need to dig a bit deeper on this logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure I agree. What is confusing about its purpose? The Ruby adapter seems to have an identical implementation.

For context, the actual app instance (Flask, Django, etc) is created on dyno/container types other than web, too. E.g in the Flask RQ sample app, the app instance initialises the request middleware (judoscale.init_app(app)) as well as the RQ extension (judoscale_rq(app.redis)) — https://github.com/judoscale/judoscale-python/blob/aws-ecs-support/sample-apps/flask_rq_sample/app/app.py#L26-L27.

The run-worker.py script creates an app instance in order to run the worker, and would be running both the web and queue metrics collectors.

A "web" dyno/container would report metrics for both web request latency and queue latency. But a "worker" dyno/container would only report queue latency.

But ECS as a platform does not distinguish between "web" and "worker" services — it does not assign any special names to services and tasks. I hardcoded "web" so that both request and queue latency is reported on all instances. Perhaps it'd be clearer if the last argument to the RuntimeContainer constructor was optional and tweak .is_web_instance to handle this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ruby adapter seems to have an identical implementation.

I agree, and I'm equally confused in the Ruby adapter. I'm not questioning anything in this PR—it's more of a "why are we doing things this way in general" kind of question. And I know I'm the one who put that logic in there to begin with, so I don't necessarily expect you to have the answer.

A "web" dyno/container would report metrics for both web request latency and queue latency. But a "worker" dyno/container would only report queue latency.

This is the piece I'm confused about. We have "web" containers skip the reporting for web request latency, but if it's not a web container, there won't be any web requests to report anyway. So why have that logic at all?

In other words, would anything behave differently if we removed the is_web_instance method (and its counterpart in the Ruby adapter)? And if we removed that, we'd have no need to differentiate "web" vs "worker" containers. We could remove the service_type property from RuntimeContainer and simplify a bit of our code.

That would be for a separate PR, and I apologize for the digression here. It just felt relevant since we're having to make arbitrary decisions about the service type here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have "web" containers skip the reporting for web request latency, but if it's not a web container, there won't be any web requests to report anyway. So why have that logic at all?

Did you mean "worker" instead of "web" here?

I think the logic exists to minimise sending redundant metrics and decrease unnecessary noise. This Python issue describes the logic and reasoning in more detail https://github.com/judoscale/judoscale-python-issues/issues/1.

I think we could simplify this code for sure. The "cost" is making redundant requests to the Judoscale adapter API. If that's acceptable then I think we could remove .is_web_instance and have metrics reported on all instance types, provided that the adapter is enabled (there is an endpoint to POST metrics to).

Copy link
Contributor

@adamlogic adamlogic Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "cost" is making redundant requests to the Judoscale adapter API.

That was the intention, but I don't think it achieves that goal in practice. A worker container won't receive web requests, so it won't have web metrics to report regardless of the is_web_instance logic. (Sorry I've been struggling to put this into coherent words.)

I'll quit beating this horse in this particular PR and open up a new issue for it. (#66)

Copy link
Contributor

@adamlogic adamlogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to merge, but I want to have a full working end-to-end PoC before we cut a release.

Copy link
Contributor

@adamlogic adamlogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I dove into the Ruby adapter today, I realized that we don't need to fetch the metadata. The container ID (instance) is part of the URL, and we don't actually care about the service_name.

For the Ruby adapter, I actually removed the service_name from our RuntimeContainer class. It's now just a string wrapper with a single method for redundant instances. Note that I also removed the web/worker service_type distinction in a previous PR.

The PR for adding ECS support to the Ruby adapter ended up being super tiny after those changes.

Does this make sense?

@karls karls changed the base branch from main to simplify-runtime-container July 17, 2023 12:37
@@ -55,6 +57,13 @@ def for_render(cls, env: Mapping):
api_base_url = f"https://adapter.judoscale.com/api/{service_id}"
return cls(runtime_container, api_base_url, env)

@classmethod
def for_ecs(cls, env: Mapping):
instance = env["ECS_CONTAINER_METADATA_URI_V4"].split("/")[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Ruby adapter, I used ECS_CONTAINER_METADATA_URI instead of ECS_CONTAINER_METADATA_URI_V4, thinking that the former would be more reliable. The docs say the latter env var is inject "beginning with Fargate platform version 1.4.0", so I'm concerned that it'll be missing for some customers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout! 👍🏻

Copy link
Contributor

@adamlogic adamlogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Base automatically changed from simplify-runtime-container to main July 18, 2023 10:59
@karls karls merged commit 3a6a356 into main Jul 18, 2023
4 checks passed
@karls karls deleted the aws-ecs-support branch July 18, 2023 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants