You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apologies if I have missed this topic, and it is already covered in the documentation/CRDs.
When using an NVIDIA accelerator in aideployments.premlabs.io, I was wondering if it is possible to provide multiple GPUs.
In a cluster with multiple GPUs:
Will aideployments.premlabs.io be able to select a MIG?
I skimmed through the codebase briefly and I observed that an nvidia.com/gpu label is hard codded.
Sorry to say, but we found a number of frictions with MIG. One being that it broke DeepSpeed-MII because of the way it selected the active GPU. I patched that and you can simply not use DeepSpeed-MII, however I think it is likely similar issues will come up.
Another issue is performance; assigning multiple MIG devices to one pod is not the same as removing MIG and assigning the whole GPU. Exactly what the implications are I'm not sure, we didn't test it. Meanwhile removing MIG or changing the partitions requires a node restart at our last check.
Something to note is that if the model is smaller than the VRAM size then a number of the inference engines will still make full use of the VRAM as a cache. So it is not a complete waste to over allocate VRAM to one model if that is what is motivating MIG use.
Apologies if I have missed this topic, and it is already covered in the documentation/CRDs.
When using an NVIDIA accelerator in
aideployments.premlabs.io
, I was wondering if it is possible to provide multiple GPUs.In a cluster with multiple GPUs:
Only a single GPU was requested by the deployment:
In this case, I have used a deployment from https://github.com/premAI-io/prem-operator/blob/main/examples/big-agi.yaml.
I would be interested in providing more GPUs to a single deployment.
Also, related to MIGs, in a cluster where GPUs are not labeled as
nvidia.com/gpu
:Will
aideployments.premlabs.io
be able to select a MIG?I skimmed through the codebase briefly and I observed that an
nvidia.com/gpu
label is hard codded.prem-operator/controllers/constants/labels.go
Line 4 in 0322a6b
(Apologies again, I am not able to test it myself at the moment on the MIG cluster).
The text was updated successfully, but these errors were encountered: