Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ray Autoscaler to the Flyte-Ray plugin. #4187

Closed
2 tasks done
Tracked by #4064
Yicheng-Lu-llll opened this issue Oct 9, 2023 · 12 comments
Closed
2 tasks done
Tracked by #4064

Add Ray Autoscaler to the Flyte-Ray plugin. #4187

Yicheng-Lu-llll opened this issue Oct 9, 2023 · 12 comments
Assignees
Labels

Comments

@Yicheng-Lu-llll
Copy link
Member

Motivation: Why do you think this is important?

Currently, the Flyte-Ray plugin utilizes Rayjob. However, there are cases where Rayjob may require an autoscaler.

For instance, after completing a workload with Rayjob, a user might want to retain all the information, logs, past tasks, and actor execution history for a period. As of now, Ray lacks a mechanism to persist these data, necessitating the continuous operation of the Ray cluster even after workload completion. With an autoscaler, the Ray cluster will maintain only the head pod while scaling down all worker pods.

Goal: What should the final outcome look like, ideally?

Have config to enable Ray Autoscaler.

Describe alternatives you've considered

None

Propose: Link/Inline OR Additional context

None

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@Yicheng-Lu-llll Yicheng-Lu-llll added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Oct 9, 2023
@pingsutw pingsutw added Epic: Ray Ray/KubeRay Support in Flyte Epic: Flyte Agent feature-request hacktoberfest and removed untriaged This issues has not yet been looked at by the Maintainers labels Oct 9, 2023
@asingh9530
Copy link

Hi @samhita-alla ,

need help from you, I had question.
In current setting for WorkerGroupSpec we already have config enabled for min_replica with default as 0 and max_replica can be seen in here, also for autoscaling as per ray doc we only need small additional config class with following setting

  • max_workers[default_value=2, min_value=0]
  • min_workers[default_value=0, min_value=0]
    and the input from user end should be able to accept max_workers config and min_workers config.

I believe this should be workflow to enable this autoscaling, if that's the case I am happy to create a PR for this but if any other details are required please let me know. 🙂

@osevin
Copy link

osevin commented Oct 24, 2023

I think we also need to be able to set

enableInTreeAutoscaling: true

in the cluster spec for autoscaling to work (https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html#enabling-autoscaling)

@samhita-alla
Copy link
Contributor

@Yicheng-Lu-llll, could you clarify if we need to incorporate the changes suggested by @asingh9530 and @osevin?

@kumare3
Copy link
Contributor

kumare3 commented Oct 25, 2023

I think we should support the evolving spec completely. We should make it a json

@asingh9530
Copy link

Hi @osevin, agreed need to add this.

@kumare3 just confirming, you are suggesting accepting input in json and validating json using some dataclass like pydantic ?

@kumare3
Copy link
Contributor

kumare3 commented Oct 27, 2023

There is no pydantic as this is in golang, but what I would love is to have the ability to keep the spec more evolvable as things change without sacrificing simplicity and correctness- we can brainstorm on solutions

@asingh9530
Copy link

@kumare3 this issue was tagged under flytekit not in flyte here. That's why I suggested to directly incorporate it under RayFunctionTask.

@samhita-alla
Copy link
Contributor

@pingsutw, can you confirm if the changes @asingh9530 mentioned are the correct ones?

@asingh9530
Copy link

Hi @pingsutw @samhita-alla Guys, do you have any update on this ?

@samhita-alla
Copy link
Contributor

samhita-alla commented Nov 2, 2023

Hey. @pingsutw's currently not available. @eapolinario, can you chime in here please?

@Yicheng-Lu-llll
Copy link
Member Author

Yicheng-Lu-llll commented Nov 2, 2023

I think the final generated yaml should looks like this:

  enableInTreeAutoscaling: true
  # `autoscalerOptions` is an OPTIONAL field specifying configuration overrides for the Ray Autoscaler.
  # The example configuration shown below below represents the DEFAULT values.
  # (You may delete autoscalerOptions if the defaults are suitable.)
  autoscalerOptions:
    # `upscalingMode` is "Default" or "Aggressive."
    # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
    # Default: Upscaling is not rate-limited.
    # Aggressive: An alias for Default; upscaling is not rate-limited.
    upscalingMode: Default
    # `idleTimeoutSeconds` is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
    idleTimeoutSeconds: 60
    # `image` optionally overrides the Autoscaler's container image. The Autoscaler uses the same image as the Ray container by default.
    ## image: "my-repo/my-custom-autoscaler-image:tag"
    # `imagePullPolicy` optionally overrides the Autoscaler container's default image pull policy (IfNotPresent).
    imagePullPolicy: IfNotPresent
    # Optionally specify the Autoscaler container's securityContext.
    securityContext: {}
    env: []
    envFrom: []
    # resources specifies optional resource request and limit overrides for the Autoscaler container.
    # The default Autoscaler resource limits and requests should be sufficient for production use-cases.
    # However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  # Ray head pod template

To do so, We should:

  1. Add these config to flytekit. Like here and here.
  2. Add these config to flyteidl. Like here.
  3. Add these config to flyte-ray plugins. Like here.

@kumare3
Copy link
Contributor

kumare3 commented Mar 22, 2024

We will close this issue as it was already merged.

@kumare3 kumare3 closed this as completed Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants