-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Request] Support for Nvidia vGPU drivers #461
Comments
I took that statement more as meaning that these drivers are provided by Google for use with Compute Engine, vs. other types of VMs that Google Cloud offer, basically a compatibility warning, less than a usage restriction, but I'm not a lawyer |
I think the vGPU guest drivers are not gated on NVIDIA's site (I'm not sure if vendors have specific flavors of the driver) but this appears to be a general version of the driver we could use. https://www.nvidia.com/en-us/drivers/details/156511/ We'd also need to install and run the |
I can ask internally what distribution vendors are expected or encouraged to do. GPU-Operator does support managing vGPU components but I haven't looked at the details (what the chart does, what the operator does). I should try to re-engage with my prototype to make more of the operator work on Talos out of the box. |
@jfroy were you able to hear anything on your side about GRID/vGPU drivers or operator support for secure distros like Talos? If you have time to share what is still holding the operator back from working I'd love to understand the limitations more so we can help if needed. |
The operator chart supports installing vGPU Manager and vGPU Device Manager and configuring them via the There hasn't been much progress since my last message (just been busy with other stuff). I still hope to get more answers for you. I've also re-engaged with my open PRs to get the basic operator to work better on Talos (namely NVIDIA/nvidia-container-toolkit#700). NVIDIA/gpu-operator#1285 is also now in progress, which obsoletes a prior PR I had sent. I believe that using native CDI support is the way to go on Talos. |
So, we need a separate kernel driver branch, right? Is there a redistributable version of this, not behind a login? GCP distributes .run files for GRID drivers, but apparently they are to be used by GCP customers on GCP only. AWS has a similar published GRID driver bucket. Does GPU operator support the guest side (nvidia-gridd or alternative for licensing)? vGPU manager seems to be the hypervisor part. |
The chart and CRD do have keys for licensing configuration (including NLS), which gets injected into driver containers. I don't think the operator does or even can handle VM licensing (e.g. kubevirt). I am less sure what happens for pod workloads, I will look into it. |
Would it be possible to add support for the vGPU Guest Drivers? This would enable running Talos in a VM with an Nvidia vGPU that has been passed through by the hypervisor.
Nvidia themselves don't publish the vGPU drivers for download without a license, but at least the guest drivers can be downloaded freely from the Google Cloud Platform https://cloud.google.com/compute/docs/gpus/grid-drivers-table
I'm not sure if Google would be OK with adding their CDN to a CI/CD pipeline for Talos images, tho.
Please note that Nvidia vGPU drivers are not the same as Nvidia Enterprise GPU drivers. They are not interchangeable and have separate purposes. The currently included Nvidia Drivers do not work for this purpose.
The text was updated successfully, but these errors were encountered: