You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
What is the outcome that you are trying to reach?
Currently, there are a few blueprints to deploy the different AI/ML examples:
MLFlow
trainium-inferentia
JARK
It would be great to consolidate them into one blueprint and expose the different addons as components that can be enabled/disabled
Describe the solution you would like
The proposed solution would be one blueprint that would deploy a consistent infrastructure with optional infrastructure components:
fsx/efs
nvidia
neuron
?
Optional infrastructure addons:
grafana
prometheus
opensearch
loki
?
And the optional AI/ML components:
Jupyter
Ray
MLFlow
?
And then allow deploying the optional examples
training
triton inference
ray inference
?
Describe alternatives you have considered
Additional context
The text was updated successfully, but these errors were encountered:
We propose creating a single infrastructure blueprint that consolidates the main components into one terraform while exposing all the components as variables disabled by default. For advanced users, they can go into this terraform and manually toggle the components depending on which one they want to deploy.
To maintain the spirit of Data on EKS, we will have separate blueprint folders with a variables.tf file that is unique to that blueprint. It will have all of the variables necessary for that blueprint to function enabled. The install.sh file will copy the infrastructure terraform into the blueprint folder as the first step, then continue as it does today. This will enable the ability to maintain backwards compatibility with the current blueprints, facilitate the use-case based functionality of the examples, while also enabling much easier maintainability. It also allows for advanced usage of the infrastructure developed by the Data on EKS collaborators.
Steps
Use the JARK-stack infrastructure as the main infrastructure
Add the neuron Karpenter resources to the JARK-stack.
Add the neuron-device-plugin and neuron-monitor daemonsets to the JARK-stack and make sure they only schedule one neuron nodes
Add the MLFlow blueprint as an addon to the JARK-stack.
Risks
The biggest risk is backwards compatibility. We want to ensure minimal/no changes for the existing blueprints/users.
Benefits
Maintained, use-case based philosophy of Data on EKS
Much higher maintainability/modernization of blueprints
Community Note
What is the outcome that you are trying to reach?
Currently, there are a few blueprints to deploy the different AI/ML examples:
It would be great to consolidate them into one blueprint and expose the different addons as components that can be enabled/disabled
Describe the solution you would like
The proposed solution would be one blueprint that would deploy a consistent infrastructure with optional infrastructure components:
Optional infrastructure addons:
And the optional AI/ML components:
And then allow deploying the optional examples
Describe alternatives you have considered
Additional context
The text was updated successfully, but these errors were encountered: