Hello! I'm the mini mummi! 🦛
The Mummi Operator is intended to run MiniMummi.
- go version v1.22.0+
- docker version 17.03+.
- kubectl version v1.11.3+.
- Access to a Kubernetes v1.11.3+ cluster.
You can create a cluster locally (if your computer is chonky and can handle it) or use AWS. Here is locally:
kind create cluster --config ./examples/kind-config.yaml
And for AWS (recommended for most cases):
eksctl create cluster --config-file examples/eks-config-6.yaml
aws eks update-kubeconfig --region us-east-2 --name mini-mummi
# hpc6a cpu instances
# cpu createsim container pull: 70-80 seconds
# runtime: 21-?24 minutes
# cpu cganalysis pull: 93 seconds
# cganalysis
# debug issue with rabbitmq disconnecting
# add max cganalysis check, should send request back to operator to terminate everything but the registry.
eksctl create cluster --config-file examples/eks-config-hpc6a.yaml
aws eks update-kubeconfig --region us-east-2 --name mini-mummi
# Or with GPUs
eksctl create cluster --config-file examples/eks-config-gpu-6.yaml
aws eks update-kubeconfig --region us-east-1 --name mini-mummi-gpu
Kind Only
If you are using kind, you will want to load your images. If you are using AWS (and on our account with the registry) then you'll be able to pull them to the cluster. Note that we are going to load the images to make our lives easier (otherwise we need to include them with pull secrets). You might need to login and pull these first:
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 633731392008.dkr.ecr.us-east-1.amazonaws.com
docker pull 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:rabbitmq
docker pull 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:mlserver
docker pull 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:wfmanager
docker pull 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:createsims
docker pull 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:cganalysis
And then load from your local machine. When we run this on EKS, we will likely have easy access to our private registry.
kind load docker-image 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:rabbitmq
kind load docker-image 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:mlserver
kind load docker-image 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:wfmanager
kind load docker-image 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:createsims
kind load docker-image 633731392008.dkr.ecr.us-east-1.amazonaws.com/mini-mummi:cganalysis
The operator is built via its manifest in dist. For development:
make test-deploy-recreate
For non-development:
kubectl apply -f examples/dist/mummi-operator.yaml
# Without GPU
kubectl apply -f examples/test-aws/mummi.yaml
Test that you see the GPU devices:
kubectl get nodes -o json | grep nvidia.com/gpu
# More specific
kubectl get nodes -o json | jq -r .items[].status.capacity | grep nvidia
If you are unfortunate enough to need to use this:
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v24.9.1 --set driver.enabled=false
# Check labels and GPUs (you should see nvidia.com/gpu)
kubectl get pods -n gpu-operator
kubectl get nodes -o json | jq '.items[].metadata.labels'
kubectl apply -f examples/test-aws/gpu-mummi.yaml
Here is how to get the charts installed to the namespace and uninstall:
# Show the name generated in the gpu-operator namesapce
helm list -n gpu-operator
# Uninstall the chart
helm uninstall -n gpu-operator gpu-operator-1736103287
Note that if the workflow manager isn't connecting, it's some race condition:
[retry=1/100] No RPC server is listening on queue mummi_queue_mummiusr, retrying in 5 secondes ...
You should be able to delete the pod and it will be recreated.
kubectl delete pod mummi-sample-wfmanager-64d87ddb87-6w8lb
If you want to delete the deployment, and note that jobs are not tied to the Mummi Operator (intentionally) so you can delete them separately:
kubectl delete -f examples/test-aws/mummi.yaml
# Or for GPU
kubectl delete -f examples/test-aws/gpu-mummi.yaml
kubectl delete jobs --all
That is done so if the workflow manager or mlserver (or another component) needs to be nuked, we won't lose running jobs.
eksctl delete cluster --config-file examples/eks-config-6.yaml --wait
eksctl delete cluster --config-file examples/eks-config-gpu-6.yaml --wait
These are some design decisions I've made (of course open to discussion):
- State is derived from Kubernetes, and not relying on some filesystem state
- We assume jobs don't need to be paused / resumed / reclaimed like on HPC
- Internal: all of the controller logic, etc. should be internal
- I'm trying to add kubernetes functionality in a way that doesn't disturb (change) core mummi. E.g., entrypoints and environment variables.
- If/when the operator is deleted, jobs (createsim and cganalysis) are not. I think this might make sense if the orchestration needs update without destroying the jobs.
- But the job state can be re-discovered by a newly deployed operator
- Variables and functions to derive customization for Mummi should all derive from the spec (e.g., so the many templates can be populate just using it)
- Instead of all assets for a deployment in one config map or secret, I am separating them out. This will allow more pointed update (if needed) and more transparency to the developer user.
Note that this design was further refactored into the [state machine operator](https://github.com/converged-computing/state-machine operator). This model uses a state machine, and there is no mummi logic (or code) required for the workflow manager. Each mummi job step is just a modular container for the state machine to use.
You can shell into the rabbitmq pod to test the connection:
root@rabbitmq:/# openssl s_client -connect rabbitmq.mummi-sample.default.svc.cluster.local:5671 -servername rabbitmq.mummi-sample.default.svc.cluster.local
Connecting to 10.244.0.7
CONNECTED(00000003)
write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 355 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
This TLS version forbids renegotiation.
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.
See LICENSE, COPYRIGHT, and NOTICE for details.
SPDX-License-Identifier: (MIT)
LLNL-CODE- 842614