Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI/ML Blueprint Consolidation #729

Open
omrishiv opened this issue Jan 15, 2025 · 1 comment
Open

AI/ML Blueprint Consolidation #729

omrishiv opened this issue Jan 15, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@omrishiv
Copy link
Contributor

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

Currently, there are a few blueprints to deploy the different AI/ML examples:

  • MLFlow
  • trainium-inferentia
  • JARK

It would be great to consolidate them into one blueprint and expose the different addons as components that can be enabled/disabled

Describe the solution you would like

The proposed solution would be one blueprint that would deploy a consistent infrastructure with optional infrastructure components:

  • fsx/efs
  • nvidia
  • neuron
  • ?

Optional infrastructure addons:

  • grafana
  • prometheus
  • opensearch
  • loki
  • ?

And the optional AI/ML components:

  • Jupyter
  • Ray
  • MLFlow
  • ?

And then allow deploying the optional examples

  • training
  • triton inference
  • ray inference
  • ?

Describe alternatives you have considered

Additional context

@omrishiv
Copy link
Contributor Author

omrishiv commented Jan 21, 2025

Proposal

We propose creating a single infrastructure blueprint that consolidates the main components into one terraform while exposing all the components as variables disabled by default. For advanced users, they can go into this terraform and manually toggle the components depending on which one they want to deploy.

To maintain the spirit of Data on EKS, we will have separate blueprint folders with a variables.tf file that is unique to that blueprint. It will have all of the variables necessary for that blueprint to function enabled. The install.sh file will copy the infrastructure terraform into the blueprint folder as the first step, then continue as it does today. This will enable the ability to maintain backwards compatibility with the current blueprints, facilitate the use-case based functionality of the examples, while also enabling much easier maintainability. It also allows for advanced usage of the infrastructure developed by the Data on EKS collaborators.

Steps

  • Use the JARK-stack infrastructure as the main infrastructure
  • Add the neuron Karpenter resources to the JARK-stack.
  • Add the neuron-device-plugin and neuron-monitor daemonsets to the JARK-stack and make sure they only schedule one neuron nodes
  • Add the MLFlow blueprint as an addon to the JARK-stack.

Risks

The biggest risk is backwards compatibility. We want to ensure minimal/no changes for the existing blueprints/users.

Benefits

  • Maintained, use-case based philosophy of Data on EKS
  • Much higher maintainability/modernization of blueprints
  • Configurability for advanced users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants