-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installing the NVIDIA Driver and CUDA Toolkit failed #31
Comments
The log is similar to your previous issue. Did your solve the previous problem? |
Hi @Tan-YiFan Though I guessed it is related to the kernel version since the guest kernel version is 6.2.0, it was not. Now, I changed the guest OS kernel to 5.19-snp-guest, but the installation of the driver is still failed. |
Could @moconnor725 @rnertney help solve this issue? Before Nvidia experts contact you, you could try with this comment |
|
thank you for the reply.. though when CC mode is off, the installation is still failed.. |
Could you boot a traditional VM instead of SEV-SNP VM? Set docc to false in launch_vm.shH100 cc set to off.
|
on the traditional VM( |
Could you install the driver on the host? Before installing, set ccmode to off. |
btw, when I see the nvidia-log, it says
that is the module compilation is okay. but the loading is failed. so on the guest VM, to see the secure boot is enabled/disabled,
on the host,
when launching the guest VM, can I set the secure boot is disabled? |
|
Actually, the installation succeeded, but If you cannot run |
Actually, the installation succeeded, but modprobe nvidia.ko failed. After disabling CC mode on the host(kernel: 5.19.0-rc6-snp-host-c4daeffce56e), I failed to install NVIDIA driver ( the nvidia-installer.log says
|
The error is The source should be at |
the kernel of the host machine is 5.19.0-rc6-snp-host-c4daeffce56e, which is built by the NVIDIA deployment guide, and there is no source about that kernel in you mean that I should return to the basic kernel before installing NVIDIA driver on the host (in my case, it is 5.15)? |
Return to 5.15 shoulfd work.
|
I've reinstall Ubuntu 22.04.3 LTS server on the host, and in the pure state, checklist outputs are as follows.
Then, I installed
|
OK. This experiment shows that the H100 works fine. The H100 works in host, but does not work in VM (even without SEV-SNP). My suggestion is to add debug information in the driver:
This procedure compiles the kernel module as For example:
I guess the problem is from the CPU IOMMU. Note that Nvidia provides a patch for host of AMD servers about IOMMU configuration. The patch works for AMD Zen3 but might fail for your AMD Zen4. |
you mean chaning the code
to
and then, re-run btw, this issue (#28) has also AMD Zen4, but he succeeded.. |
Yes |
I will update VBIOS first, and then re-try.. |
after I updated VBIOS from It seems works.
Thanks. |
hi, where is the doc for upgrading VBIOS? Many thanks! |
Hi,
I recently started configuring AMD-SEV-SNP with H100 GPU and tried to follow the official deployment guide document. but there is something error.
My machine's specs:
SYSTEM: GIGABYTE
CPU: Dual AMD EPYC 9224 16-Core Processor
GPU: H100 10de:2331
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel
unitl p.25, I succeeded the deployment guide.
but, when I tried to install the NVIDIA Driver and CUDA Toolkit on page 26, it failed.
the
/var/log/nvidia-installer
saysfull log can be found at here
how can I fix it..?
The text was updated successfully, but these errors were encountered: