Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kuberenetes resource limits & security context configs #1612

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

rolf-moz
Copy link

@rolf-moz rolf-moz commented Oct 24, 2023

This PR adds more support for secure computing environments for metaflow in kubernetes and with argo.

• Refactored so that common code is used to create the kubernetes client container object. This fixes and existing TODO in the code.

• Added support for kubernetes security_context field in the metaflow config file or decorators.
Use the 'METAFLOW_KUBERNETES_SECURITY_CONTEXT' field in the config file or 'security_context' (see #1262)

The input is a JSONified version of the YML in the kuberentes specification for security context.

Example from the config file

...
       "METAFLOW_KUBERNETES_SECURITY_CONTEXT": {
        "allowPrivilegeEscalation": false,
    "runAsNonRoot": true,
    "runAsUser": 1000,
    "seccompProfile": {
      "type": "RuntimeDefault"
    }
  }
...

Note that modifications in your docker images may be needed to to support this (such as creating a non-root user etc)

• Added support for kubernetes resource memory and cpu limits field in the metaflow config file or decorators.
This follows the format for memory and cpu requirements for the resource requests field. Examples usage:

... "METAFLOW_KUBERNETES_RESOURCE_LIMITS_MEMORY": 4096, "METAFLOW_KUBERNETES_RESOURCE_LIMITS_CPU": 1, ...

@kubernetes(
    memory=2048,
    cpu=1
    resource_limits_memory=4096,
    resource_limits_cpu=1
)

@roofurmston
Copy link

Just an FYI, but there is a (semi-)related PR for adding support for limits. There is a related issue on it too, which is linked in the PR.

Not sure if the approach in the issue is still the desired approach from the Metaflow team though. Currently waiting for a review on the PR.

@rolf-moz
Copy link
Author

Just an FYI, but there is a (semi-)related PR for adding support for limits. There is a related issue on it too, which is linked in the PR.

Not sure if the approach in the issue is still the desired approach from the Metaflow team though. Currently waiting for a review on the PR.

Thanks. Maybe 'limit_memory' and 'limit_cpu' in the decorator is the cleanest way to specify it

@roofurmston
Copy link

Just an FYI, but there is a (semi-)related PR for adding support for limits. There is a related issue on it too, which is linked in the PR.
Not sure if the approach in the issue is still the desired approach from the Metaflow team though. Currently waiting for a review on the PR.

Thanks. Maybe 'limit_memory' and 'limit_cpu' in the decorator is the cleanest way to specify it

Perhaps, the implementation in the other PR was based on a discussion with the Metaflow team in Slack. (Think there is a link to the Slack thread in the original issue.)

You are not the first person to raise the alternative suggestion. Someone else raised it in the issue too. I've referenced it in the other PR and was waiting to hear back from the Metaflow folk.

@roofurmston
Copy link

roofurmston commented Oct 25, 2023

Just an FYI, but there is a (semi-)related PR for adding support for limits. There is a related issue on it too, which is linked in the PR.
Not sure if the approach in the issue is still the desired approach from the Metaflow team though. Currently waiting for a review on the PR.

Thanks. Maybe 'limit_memory' and 'limit_cpu' in the decorator is the cleanest way to specify it

Perhaps, the implementation in the other PR was based on a discussion with the Metaflow team in Slack. (Think there is a link to the Slack thread in the original issue.)

You are not the first person to raise the alternative suggestion. Someone else raised it in the issue too. I've referenced it in the other PR and was waiting to hear back from the Metaflow folk.

If it is decided to go with the approach in this PR, then I guess Argo Workflow functionality will need to be updated too. Also, I guess ephemeral-storage?

@rolf-moz
Copy link
Author

@roofurmston I've just refactored the code to support argo as well. There was a 'TODO' in the original code to use shared code for the Argo and Kubernetes container creation. I did that by creating a utility function used by both.

@savingoyal savingoyal self-requested a review October 30, 2023 15:49
@rolf-moz
Copy link
Author

@roofurmston checking in on this. Since there is a refactor in here it will be hard to keep this PR open & up-to date for an extended period.

@roofurmston
Copy link

@roofurmston checking in on this. Since there is a refactor in here it will be hard to keep this PR open & up-to date for an extended period.

I'm not a maintainer of Metaflow, I'm afraid. :)

I left the initial comment as I had a related PR. Mine is still waiting to be reviewed.

Maybe it has slipped the radar of the Metaflow folk and it is worth raising with them?

@shrinandj
Copy link
Contributor

For all K8s options, there is a possibility to specify them also on (a) the command line and (b) as args to the @kubernetes decorator.

Will the security context be able to be specified the same way? The user-experience won't be great since it's a json object.

The benefit of having the @kubernetes decorator is that there can be different values for different steps. Especially if there are different docker images used for different steps, a global security context may not work. E.g. a Docker image the user authored could have a non-root user, but a Docker image taken from an OSS project may only have a root user.

Not sure if this is a good idea but one option could be to specify the security_context in the @kubernetes decorator to point to a file. This file could have the security context json.

Similarly, will it be able to specify resource_limits in the decorator?

@rolf-moz
Copy link
Author

rolf-moz commented Dec 6, 2023

Yes @shrinandj all 3 fields that have been added to the config also support the kubernetes decorator.

For example we support

  @kubernetes(security_context=secure_environment_config,
               memory=100,
               resource_limits_memory=200,
               resource_limits_cpu=4)
   @step
   def hello(self):
         pass

Where secure_environment_config is some complex setting like the below (which is what we use):

secure_environment_config = {
        "allowPrivilegeEscalation": False,
    "capabilities": {
      "drop": [
        "ALL"
      ]
    },
    "runAsNonRoot": True,
    "runAsUser": 1000,
    "seccompProfile": {
      "type": "RuntimeDefault"
    }
}

There are so many recursive params in the security settings it seemed to me to make sense to use json (config file) and dict (in python code) so that everything can be specified in detail.

@DonIvanCorleone
Copy link

Is this PR going to happen? We wanted to apply some Kubernetes quota on one of our namespaces for limiting excessive overuse but realized that Kubernetes quotas require limits, which is currently unsupported. Would be interested in that feature.

@roofurmston
Copy link

The feedback I have had from the Metaflow team is that you need to set limits through a policy agent. This is the approach we will be investigating in due course.

@DonIvanCorleone
Copy link

Hey @roofurmston thanks for your response.
Well I am certainly not an k8s expert, but afaik those policies check for example that the pods do have limits and quotes in valid ranges. I was not / I am not aware that you can apply limits that way as well.... have to look around and familiarize myself with that concept.

Cheers

@roofurmston
Copy link

I was pointed in the direction of https://kyverno.io/. Other policy agents also allow mutations, such as open policy agent gatekeeper, but I couldn't find a way to set the resources through that policy agent.

@DonIvanCorleone
Copy link

Many thanks for the pointer. Will definitely try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants