AI/ML Blueprint Consolidation #729

omrishiv · 2025-01-15T17:54:34Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

Currently, there are a few blueprints to deploy the different AI/ML examples:

MLFlow
trainium-inferentia
JARK

It would be great to consolidate them into one blueprint and expose the different addons as components that can be enabled/disabled

Describe the solution you would like

The proposed solution would be one blueprint that would deploy a consistent infrastructure with optional infrastructure components:

fsx/efs
nvidia
neuron
?

Optional infrastructure addons:

grafana
prometheus
opensearch
loki
?

And the optional AI/ML components:

Jupyter
Ray
MLFlow
?

And then allow deploying the optional examples

training
triton inference
ray inference
?

Describe alternatives you have considered

Additional context

omrishiv · 2025-01-21T21:21:49Z

Proposal

We propose creating a single infrastructure blueprint that consolidates the main components into one terraform while exposing all the components as variables disabled by default. For advanced users, they can go into this terraform and manually toggle the components depending on which one they want to deploy.

To maintain the spirit of Data on EKS, we will have separate blueprint folders with a variables.tf file that is unique to that blueprint. It will have all of the variables necessary for that blueprint to function enabled. The install.sh file will copy the infrastructure terraform into the blueprint folder as the first step, then continue as it does today. This will enable the ability to maintain backwards compatibility with the current blueprints, facilitate the use-case based functionality of the examples, while also enabling much easier maintainability. It also allows for advanced usage of the infrastructure developed by the Data on EKS collaborators.

Steps

Use the JARK-stack infrastructure as the main infrastructure
Add the neuron Karpenter resources to the JARK-stack.
Add the neuron-device-plugin and neuron-monitor daemonsets to the JARK-stack and make sure they only schedule one neuron nodes
Add the MLFlow blueprint as an addon to the JARK-stack.

Risks

The biggest risk is backwards compatibility. We want to ensure minimal/no changes for the existing blueprints/users.

Benefits

Maintained, use-case based philosophy of Data on EKS
Much higher maintainability/modernization of blueprints
Configurability for advanced users

askulkarni2 assigned omrishiv Jan 29, 2025

askulkarni2 added the enhancement New feature or request label Jan 29, 2025

vara-bonthu mentioned this issue Feb 1, 2025

Deprecation Notice: Upcoming Removal of Specific Blueprints #623

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI/ML Blueprint Consolidation #729

AI/ML Blueprint Consolidation #729

omrishiv commented Jan 15, 2025

omrishiv commented Jan 21, 2025 •

edited

Loading

AI/ML Blueprint Consolidation #729

AI/ML Blueprint Consolidation #729

Comments

omrishiv commented Jan 15, 2025

Community Note

What is the outcome that you are trying to reach?

Describe the solution you would like

Describe alternatives you have considered

Additional context

omrishiv commented Jan 21, 2025 • edited Loading

Proposal

Steps

Risks

Benefits

omrishiv commented Jan 21, 2025 •

edited

Loading