Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installing the NVIDIA Driver and CUDA Toolkit failed #31

Closed
seungsoo-lee opened this issue Jan 2, 2024 · 23 comments
Closed

Installing the NVIDIA Driver and CUDA Toolkit failed #31

seungsoo-lee opened this issue Jan 2, 2024 · 23 comments

Comments

@seungsoo-lee
Copy link

seungsoo-lee commented Jan 2, 2024

Hi,

I recently started configuring AMD-SEV-SNP with H100 GPU and tried to follow the official deployment guide document. but there is something error.

My machine's specs:

SYSTEM: GIGABYTE
CPU: Dual AMD EPYC 9224 16-Core Processor
GPU: H100 10de:2331
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel

unitl p.25, I succeeded the deployment guide.
but, when I tried to install the NVIDIA Driver and CUDA Toolkit on page 26, it failed.

the /var/log/nvidia-installer says

     LD [M]  /tmp/selfgz1210/NVIDIA-Linux-x86_64-535.86.10/kernel-open/nvidia-uvm.ko
   make[1]: Leaving directory '/usr/src/linux-headers-5.19.0-rc6-snp-guest-c4daeffce56e'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[    9.059353] audit: type=1400 audit(1704159783.488:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=705 comm="apparmor_parser"
[    9.059357] audit: type=1400 audit(1704159783.488:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=705 comm="apparmor_parser"
[    9.060369] audit: type=1400 audit(1704159783.492:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=702 comm="apparmor_parser"
[    9.060373] audit: type=1400 audit(1704159783.492:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=702 comm="apparmor_parser"
[   12.607062] loop3: detected capacity change from 0 to 8
[   12.607267] Dev loop3: unable to read RDB block 8
[   12.608258]  loop3: unable to read partition table
[   12.608263] loop3: partition table beyond EOD, truncated
[   13.245756] fbcon: Taking over console
[   13.299647] Console: switching to colour frame buffer device 128x48
[  132.090302] nvidia: loading out-of-tree module taints kernel.
[  132.092232] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[  132.124068] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[  132.124076] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2331)
               NVRM: installed in this system is not supported by open
               NVRM: nvidia.ko because it does not include the required GPU
               NVRM: System Processor (GSP).
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the driver README, available on
               NVRM: the Linux graphics driver download page at
               NVRM: www.nvidia.com.
[  137.470645] nvidia: probe of 0000:01:00.0 failed with error -1
[  137.470765] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  137.470768] NVRM: None of the NVIDIA devices were initialized.
[  137.471172] nvidia-nvlink: Unregistered Nvlink Core, major device number 236

full log can be found at here

how can I fix it..?

@Tan-YiFan
Copy link

The log is similar to your previous issue. Did your solve the previous problem?

@seungsoo-lee
Copy link
Author

seungsoo-lee commented Jan 2, 2024

Hi @Tan-YiFan

Though I guessed it is related to the kernel version since the guest kernel version is 6.2.0, it was not.

Now, I changed the guest OS kernel to 5.19-snp-guest, but the installation of the driver is still failed.

@Tan-YiFan
Copy link

Could @moconnor725 @rnertney help solve this issue?

Before Nvidia experts contact you, you could try with this comment

@seungsoo-lee
Copy link
Author

seungsoo-lee commented Jan 2, 2024

@Tan-YiFan

btw, when installing the NVIDIA Driver and CUDA Toolkit,
should I install Kernel Objects > nvidia-fs? (by default it is not to be installed as follows)

image

@Tan-YiFan
Copy link

nvidia-fs is not required.

@seungsoo-lee
Copy link
Author

@Tan-YiFan

thank you for the reply..

though when CC mode is off, the installation is still failed..

@Tan-YiFan
Copy link

Tan-YiFan commented Jan 2, 2024 via email

@seungsoo-lee
Copy link
Author

@Tan-YiFan

on the traditional VM(docc=false), it was also failed to install the driver.
the error message is same.

@Tan-YiFan
Copy link

Could you install the driver on the host? Before installing, set ccmode to off.

@seungsoo-lee
Copy link
Author

@Tan-YiFan

btw,

when I see the nvidia-log, it says

-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'.

that is the module compilation is okay. but the loading is failed.

so on the guest VM, to see the secure boot is enabled/disabled,

(guest) $ mokutil --sb-state
This system doesn't support Secure Boot

on the host,

(host) $ mokutil --sb-state
SecureBoot disabled
Platform is in Setup Mode

when launching the guest VM, can I set the secure boot is disabled?

@Tan-YiFan
Copy link

Unable to determine if Secure Boot is enabled could be ignored.

@Tan-YiFan
Copy link

Actually, the installation succeeded, but modprobe nvidia.ko failed.

If you cannot run nvidia-smi on the host (after ccmode set to off), it is not likely to work successfully in a virtual machine.

@seungsoo-lee
Copy link
Author

seungsoo-lee commented Jan 2, 2024

@Tan-YiFan

Actually, the installation succeeded, but modprobe nvidia.ko failed.
--> I meant this is for the guest VM, not the host.

After disabling CC mode on the host(kernel: 5.19.0-rc6-snp-host-c4daeffce56e), I failed to install NVIDIA driver (sudo sh cuda_12.2.1_535.86.10_linux.run -m=kernel-open) on the host..

the nvidia-installer.log says

nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Tue Jan  2 13:04:04 2024
installer version: 535.86.10

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --ui=none
    --no-questions
    --accept-license
    --disable-nouveau
    --no-cc-version-check
    --install-libglvnd
    --kernel-module-build-directory=kernel-open

Using built-in stream user interface
-> Detected 96 CPUs online; setting concurrency level to 32.
-> Installing NVIDIA driver version 535.86.10.
-> An alternate method of installing the NVIDIA driver was detected. (This is usually a package provided by your distributor.) A driver installed via that method may integrate better with your system than a driver installed by nvidia-installer.

Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:

The NVIDIA driver provided by Ubuntu can be installed by launching the "Software & Updates" application, and by selecting the NVIDIA driver from the "Additional Drivers" tab.


(Answer: Continue installation)
-> For some distributions, Nouveau can be disabled by adding a file in the modprobe configuration directory.  Would you like nvidia-installer to attempt to create this modprobe file for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written.  For some distributions, this may be sufficient to disable Nouveau; other distributions may require modification of the initial ramdisk.  Please reboot your system and attempt NVIDIA driver installation again.  Note if you later wish to re-enable Nouveau, you will need to delete these files: /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf, /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
-> Performing CC sanity check with CC="/usr/bin/cc".
-> Performing CC check.
ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

@Tan-YiFan
Copy link

The error is ERROR: Unable to find the kernel source tree for the currently running kernel. Please make sure you have installed the kernel source files for your kernel and that they are properly configured. If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.

The source should be at /usr/src.

@seungsoo-lee
Copy link
Author

seungsoo-lee commented Jan 3, 2024

@Tan-YiFan

the kernel of the host machine is 5.19.0-rc6-snp-host-c4daeffce56e, which is built by the NVIDIA deployment guide, and there is no source about that kernel in /usr/src. In my /usr/src/, there are linux-headers-5.15.0-91 linux-headers-5.15.0-91-generic python3.10.

you mean that I should return to the basic kernel before installing NVIDIA driver on the host (in my case, it is 5.15)?

@Tan-YiFan
Copy link

Tan-YiFan commented Jan 3, 2024 via email

@seungsoo-lee
Copy link
Author

@Tan-YiFan

I've reinstall Ubuntu 22.04.3 LTS server on the host, and in the pure state, checklist outputs are as follows.

cclab@ubuntu-h100:~$ lspci | grep NVIDIA
44:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
cclab@ubuntu-h100:~$ sudo python3 /shared/nvtrust/host_tools/python/gpu_cc_tool.py --gpu-name=H100 --query-cc-settings
NVIDIA GPU Tools version 535.86.06
Topo:
  PCI 0000:40:01.1 0x1022:0x14ab
   PCI 0000:41:00.0 0x1000:0xc030
    PCI 0000:42:01.0 0x1000:0xc030
     GPU 0000:44:00.0 H100-PCIE 0x2331 BAR0 0x50042000000
2024-01-03,03:19:51.398 INFO     Selected GPU 0000:44:00.0 H100-PCIE 0x2331 BAR0 0x50042000000
2024-01-03,03:19:51.470 INFO     GPU 0000:44:00.0 H100-PCIE 0x2331 BAR0 0x50042000000 CC settings:
2024-01-03,03:19:51.470 INFO       enable = 0
2024-01-03,03:19:51.470 INFO       enable-devtools = 0
2024-01-03,03:19:51.470 INFO       enable-allow-inband-control = 1
2024-01-03,03:19:51.470 INFO       enable-devtools-allow-inband-control = 1
cclab@ubuntu-h100:~$ uname -r
5.15.0-91-generic

Then, I installed cuda_12.2.1_535.86.10_linux.run on the host (by sudo sh cuda_12.2.1_535.86.10_linux.run -m=kernel-open).
the outputs are as follows.

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-12.2/

Please make sure that
 -   PATH includes /usr/local/cuda-12.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.2/lib64, or, add /usr/local/cuda-12.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log

cclab@ubuntu-h100:~$ nvidia-smi
Wed Jan  3 03:24:07 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               Off | 00000000:44:00.0 Off |                    0 |
| N/A   37C    P0              78W / 350W |      4MiB / 81559MiB |     13%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@Tan-YiFan
Copy link

OK. This experiment shows that the H100 works fine.

The H100 works in host, but does not work in VM (even without SEV-SNP).

My suggestion is to add debug information in the driver:

  • Change the host kernel to 5.19-sev-snp
  • Attach the GPU to VM (as the deployment guide says)
  • In the VM, run git clone https://github.com/NVIDIA/open-gpu-kernel-modules.git -b 535.86.10.
  • Run cd open-gpu-kernel-modules; make -j $(nproc)
  • Run insmod kernel-open/nvidia.ko

This procedure compiles the kernel module as sh cuda_12.2.1_535.86.10_linux.run -m=kernel-open does. The difference is that you can add debug prints in the source.

For example:

  • The log NVRM: installed in this system is not supported by open nvidia.ko because it does not include the required GPU is caused by gpumgrIsDeviceRmFirmwareCapable returns false.
  • In gpumgrIsDeviceRmFirmwareCapable, it check whether the chip is firmware capable at _gpumgrIsRmFirmwareCapableChip and _gpumgrIsVgxRmFirmwareCapableChip. The version should be GH100, which is higher than TU100 and GA100. You can print the value of DRF_VAL(xxx) at code.

I guess the problem is from the CPU IOMMU. Note that Nvidia provides a patch for host of AMD servers about IOMMU configuration. The patch works for AMD Zen3 but might fail for your AMD Zen4.

@seungsoo-lee
Copy link
Author

@Tan-YiFan

you mean chaning the code

static NvBool _gpumgrIsRmFirmwareCapableChip(NvU32 pmcBoot42)
{
    return (DRF_VAL(_PMC, _BOOT_42, _ARCHITECTURE, pmcBoot42) >= NV_PMC_BOOT_42_ARCHITECTURE_TU100);
}

static NvBool _gpumgrIsVgxRmFirmwareCapableChip(NvU32 pmcBoot42)
{
    return (DRF_VAL(_PMC, _BOOT_42, _ARCHITECTURE, pmcBoot42) >= NV_PMC_BOOT_42_ARCHITECTURE_GA100) &&
           (DRF_VAL(_PMC, _BOOT_42, _CHIP_ID, pmcBoot42) > NV_PMC_BOOT_42_CHIP_ID_GA100);
}

to

{
    NV_PRINTF(LEVEL_INFO, "versionA %d\n", DRF_VAL(_PMC, _BOOT_42, _ARCHITECTURE, pmcBoot42));
    return (DRF_VAL(_PMC, _BOOT_42, _ARCHITECTURE, pmcBoot42) >= NV_PMC_BOOT_42_ARCHITECTURE_TU100);
}

static NvBool _gpumgrIsVgxRmFirmwareCapableChip(NvU32 pmcBoot42)
{
    NV_PRINTF(LEVEL_INFO, "versionB %d\n", DRF_VAL(_PMC, _BOOT_42, _ARCHITECTURE, pmcBoot42));
    return (DRF_VAL(_PMC, _BOOT_42, _ARCHITECTURE, pmcBoot42) >= NV_PMC_BOOT_42_ARCHITECTURE_GA100) &&
           (DRF_VAL(_PMC, _BOOT_42, _CHIP_ID, pmcBoot42) > NV_PMC_BOOT_42_CHIP_ID_GA100);
}

and then, re-run make -j $(nproc) && insmod kernel-open/nvidia.ko?..

btw,

this issue (#28) has also AMD Zen4, but he succeeded..

@Tan-YiFan
Copy link

Yes

@seungsoo-lee
Copy link
Author

@Tan-YiFan

I will update VBIOS first, and then re-try..

@seungsoo-lee
Copy link
Author

after I updated VBIOS from 96.00.30.00.01 to 96.00.5E.00.03.

It seems works.

  • NVIDIA driver and CUDA installation is okay.
  • after rebooting the guest VM, nvidia-smi works fine too.

Thanks.

@jianlinjiang
Copy link

after I updated VBIOS from 96.00.30.00.01 to 96.00.5E.00.03.

It seems works.

  • NVIDIA driver and CUDA installation is okay.
  • after rebooting the guest VM, nvidia-smi works fine too.

Thanks.

hi, where is the doc for upgrading VBIOS? Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants