Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OVMF PCI Host bridge error on snp-latest #56

Closed
clauverjat opened this issue May 29, 2024 · 15 comments
Closed

OVMF PCI Host bridge error on snp-latest #56

clauverjat opened this issue May 29, 2024 · 15 comments

Comments

@clauverjat
Copy link

clauverjat commented May 29, 2024

Hello,

I am trying to run an AMD SEV SNP Confidential VM with a Nvidia H100 in CC mode. Specifically, I am trying to make it work with the snp-latest branch from the AMDESE/AMDSEV repo (i.e. the branch that uses Linux kernel version 6.9.0-rc7). I am hitting an error when trying to boot the guest with the PCI passthrough. The error occurs before reaching the linux kernel since it happens in the guest firmware initialization (OVMF).

Steps to reproduce

git clone https://github.com/AMDESE/AMDSEV.git
cd AMDSEV/
git checkout snp-latest

Then I patched common.sh adding the following lines (in order to add configure the kernel appropriately):

run_cmd ./scripts/config --enable CONFIG_CRYPTO_ECC
run_cmd ./scripts/config --enable CONFIG_CRYPTO_ECDH
run_cmd ./scripts/config --enable CONFIG_CRYPTO_ECDSA
run_cmd ./scripts/config --enable CONFIG_CGROUP_MISC

Installing the build depencies :

sudo apt install -y python3-venv ninja-build libglib2.0-dev python-is-python3 nasm iasl flex bison libelf-dev debhelper libslirp-dev

# From Nvidia deployment guide:
sudo apt install -y ninja-build iasl nasm flex bison openssl dkms autoconf \
zlib1g-dev python3-pip libncurses-dev libssl-dev libelf-dev libudev-dev libpci-dev \
libiberty-dev libtool libpango1.0-dev libjpeg8-dev \
libpixman-1-dev libcairo2-dev libgif-dev libglib2.0-dev git-lfs jq

Then building

./build.sh --package

I am not applying the two linux patches from the nvtrust repo (I figure they might no longer be needed since I am building a newer kernel)

$ patch -p1 -l < ../../iommu_pagefault.patch
$ patch -p1 -l < ../../iommu_pagesize.patch

Preparing the host

sudo cp kvm.conf /etc/modprobe.d/
cd snp-release-<date>/
sudo ./install.sh

Rebooting and selecting the right kernel

ubuntu@host:~$ uname -r
6.9.0-rc7-snp-host-05b10142ac6a

Following the "Preparing the Guest" steps (omitted here)

Configuring the GPU
Obtain some information

$ lspci -n -d 10de:
82:00.0 0302: 10de:2331 (rev a1)

Setting the GPU in dev CC mode and activating PCI VFIO for the device.

sudo python3 ./nvidia_gpu_tools.py --gpu=0 --query-cc-mode
sudo python3 ./nvidia_gpu_tools.py --gpu=0 --set-cc-mode=devtools --reset-after-cc-mode-switch
sudo sh -c "echo 10de 2331 > /sys/bus/pci/drivers/vfio-pci/new_id

Launch guest VM
Set the appropriate values in the vars :

export AMD_SEV_DIR=/path/to/AMDSEV/snp-release-<date>
export VDD_IMAGE=path/to/disk.qcow2

#Hardware Settings

export NVIDIA_GPU=82:00.0
export MEM=64 #in GBs
export FWDPORT=9899

Launching guest (with sev-snp and GPU passthrough)

cp $AMD_SEV_DIR/usr/local/share/qemu/OVMF_VARS.fd $AMD_SEV_DIR/usr/local/share/qemu/myguest.fd
sudo $AMD_SEV_DIR/usr/local/bin/qemu-system-x86_64 \
  -machine confidential-guest-support=sev0,vmport=off \
  -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1 \
  -enable-kvm -nographic -no-reboot \
  -cpu EPYC-v4 -machine q35 -smp 12,maxcpus=31 -m ${MEM}G,slots=2,maxmem=512G \
  -bios $AMD_SEV_DIR/usr/local/share/qemu/OVMF_CODE.fd \
  -drive if=pflash,format=raw,unit=0,file=$AMD_SEV_DIR/usr/local/share/qemu/myguest.fd \
  -drive file=$VDD_IMAGE,if=none,id=disk0,format=qcow2 \
  -device virtio-scsi-pci,id=scsi0,disable-legacy=on,iommu_platform=true \
  -device scsi-hd,drive=disk0 \
  -device virtio-net-pci,disable-legacy=on,iommu_platform=true,netdev=vmnic,romfile= \
  -netdev user,id=vmnic,hostfwd=tcp::$FWDPORT-:22 \
  -device pcie-root-port,id=pci.1,bus=pcie.0 \
  -device vfio-pci,host=$NVIDIA_GPU,bus=pci.1 \
  -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=262144

The boot fails with an OVMF error :

[...]
ProtectUefiImage failed to create image properties record
Select Item: 0x0
FW CFG Signature: 0x554D4551
Select Item: 0x1
FW CFG Revision: 0x3
QemuFwCfg interface (DMA) is supported.
Select Item: 0x19
Select Item: 0x19
PciHostBridgeUtilityInitRootBridge: populated root bus 0, with room for 255 subordinate bus(es)
RootBridge: PciRoot(0x0)
  Support/Attr: 70069 / 70069
    DmaAbove4G: No
NoExtConfSpace: No
     AllocAttr: 3 (CombineMemPMem Mem64Decode)
           Bus: 0 - FF Translation=0
            Io: 6000 - FFFF Translation=0
           Mem: 80000000 - DFFFFFFF Translation=0
    MemAbove4G: C000000000 - FFFFFFFFFF Translation=0
          PMem: FFFFFFFFFFFFFFFF - 0 Translation=0
   PMemAbove4G: FFFFFFFFFFFFFFFF - 0 Translation=0
PciHostBridgeDxe: IntersectMemoryDescriptor: desc [FD00000000, 10000000000) type 1 cap 8000000000026000 conflicts with aperture [C000000000, 10000000000) cap 1

ASSERT_EFI_ERROR (Status = Invalid Parameter)
ASSERT [PciHostBridgeDxe] /home/ubuntu/corentin/AMDSEV/ovmf/MdeModulePkg/Bus/Pci/PciHostBridgeDxe/PciHostBridge.c(550): !(((INTN)(RETURN_STATUS)(Status)) < 0)

Note that the error disappears if one boots the guest without trying the pass the GPU device. Also there is no error if I keep everything but remove the option "-fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=262144" (but not really useful since I cannot use the device then).

Any idea on how to solve this issue ?

Thanks

@benschlueter
Copy link

benschlueter commented May 31, 2024

AFAIK it is caused by a too-big MMIO address space. The h100 needs at least 128GiB but 256GiB are requested and cause the issue, so I just adjusted the value to be something smaller than 256 but larger than 128 and it worked.

-fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=151072 \

@clauverjat
Copy link
Author

clauverjat commented May 31, 2024

Thanks @benschlueter replacing the fw_cmd with yours solved the OVMF issue. I am not there yet though, with the OVMF out of the way, I can boot my OS. Now I've got errors from the Nvidia driver.

[   14.886113] nvidia: loading out-of-tree module taints kernel.
[   14.886129] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   14.886244] sev-guest sev-guest: Initialized SEV guest driver (using vmpck_id 0)
[   15.093909] nvidia-nvlink: Nvlink Core is being initialized, major device number 239

[   15.094619] ppdev: user-space parallel port driver
[   15.096579] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2331)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 555.42.02 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[   15.097749] nvidia 0000:01:00.0: probe with driver nvidia failed with error -1
[   15.097799] NVRM: The NVIDIA probe routine failed for 1 device(s).
[   15.097811] NVRM: None of the NVIDIA devices were initialized.
[   15.098306] nvidia-nvlink: Unregistered Nvlink Core, major device number 239

And here is the full dmesg output : https://gist.github.com/clauverjat/2e9f46dff7665d07d4bc5f651ca3d59b

Note that I identified a potential cause for the issue. My GPU VBIOS version is 96.00.30.00.01 (obtained with nvidia-smi), but a minimal version of 96.00.5E.00.00 is required according to the Nvidia deployment guide. I am not sure though if an old VBIOS should result in such an error, since in the deployment guide "Validating State and Versions" comes after the driver installation

Looking at other issues on this repo, it seems that I might now hit the same problem as in #31 , with a "GPU not supported by the driver" error message.

@Tan-YiFan
Copy link

Please upgrade the VBIOS by contacting the vendor.

@clauverjat
Copy link
Author

Hi @Tan-YiFan

This is ongoing already. We are renting a baremetal server, so we had to ask our cloud provider which then contacted their vendor. I hope it'll happen quickly, but, reasonably, I know it might take a while before they apply the VBIOS upgrade. So in the meantime, can you confirm that an old VBIOS can result in that kind of error ? I want to be sure that I go as far as I can without the VBIOS upgrade.

@rnertney
Copy link
Collaborator

Also, please note that the NVIDIA Trusted Computing Solutions (such as CC) are only currently supported on our Enterprise drivers, this currently encompasses:

550.54.14
550.54.15
550.90.07

I will get this list added to our documentation and ensure it is up to date.

@clauverjat
Copy link
Author

clauverjat commented Aug 20, 2024

Hello @rnertney,

Our system got a VBIOS update and we are now running with VBIOS version 96.00.9F.00.01, which should be compatible. After the update, I can run the nvtrust guest tools. I ran the guest_tools/attestation_sdk/tests/SmallCombinedTest.py script and got an output with "GPU Attested Successfully", which seems promising.
However, we are unable to use the GPU for compute (GPU in ERR! state).

We've tested with two different CUDA/driver combinations:

  • Driver Version: 560.28.03 | CUDA Version: 12.6
  • Driver Version: 550.54.14 | CUDA Version: 12.4

This brings me to my first question : Is Confidential Compute (CC) compatible with the latest 560 drivers? Your last message suggests that CC might only work with the 550 drivers, but this seems a bit strange to me.

If the 560 driver is indeed supposed to work, here’s the dmesg output while the system is running driver version 560. It includes NVRM nvAssertFailed errors that are likely related to the issue: dmesg output.

@Tan-YiFan
Copy link

Please check:

  1. You have enabled the cc mode in the host.
  2. Run nvidia-persistenced --uvm-persistence-mode before any GPU related command including nvidia-smi
  3. The guest kernel has loaded aesni related kernel module

I also suggest trying 535.104.05 module.

@clauverjat
Copy link
Author

clauverjat commented Aug 20, 2024

Hi,

I've checked your items :

  1. You have enabled the cc mode in the host.

I am running the following before starting the guest VM

$ sudo modprobe vfio_pci
# Set CC mode to on and resets the GPU 
$ sudo python3 ./nvidia_gpu_tools.py --gpu=0 --set-cc-mode=on --reset-after-cc-mode-switch
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu=0', '--set-cc-mode=on', '--reset-after-cc-mode-switch']
2024-08-20,13:47:50.119 WARNING  GPU 0000:82:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
GPUs:
  0 GPU 0000:82:00.0 H100-PCIE 0x2331 BAR0 0x110042000000
Other:
Topo:
  PCI 0000:80:03.1 0x1022:0x14a5
   GPU 0000:82:00.0 H100-PCIE 0x2331 BAR0 0x110042000000
2024-08-20,13:47:50.120 INFO     Selected GPU 0000:82:00.0 H100-PCIE 0x2331 BAR0 0x110042000000
2024-08-20,13:47:50.120 WARNING  GPU 0000:82:00.0 H100-PCIE 0x2331 BAR0 0x110042000000 has CC mode on, some functionality may not work
2024-08-20,13:47:50.233 INFO     GPU 0000:82:00.0 H100-PCIE 0x2331 BAR0 0x110042000000 CC mode set to on. It will be active after GPU reset.
2024-08-20,13:47:52.623 INFO     GPU 0000:82:00.0 H100-PCIE 0x2331 BAR0 0x110042000000 was reset to apply the new CC mode.
2024-08-20,13:47:52.624 WARNING  GPU 0000:82:00.0 H100-PCIE 0x2331 BAR0 0x110042000000 restoring power control to auto
$ sudo sh -c "echo 10de 2331 > /sys/bus/pci/drivers/vfio-pci/new_id"
  1. Run nvidia-persistenced --uvm-persistence-mode before any GPU related command including nvidia-smi

I have configured the nvidia-persistenced systemd service to enable uvm persistence mode.
I've also checked that it works as expected by checking the service status as the first thing upon connecting.

ubuntu@ubuntu:~$ systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; static)
     Active: active (running) since Tue 2024-08-20 14:13:47 UTC; 12min ago
    Process: 797 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose (code=exited, status=0/SUCCESS)
   Main PID: 802 (nvidia-persiste)
      Tasks: 1 (limit: 75931)
     Memory: 36.7M (peak: 37.3M)
        CPU: 6.133s
     CGroup: /system.slice/nvidia-persistenced.service
             └─802 /usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose

Aug 20 14:13:41 ubuntu systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
Aug 20 14:13:41 ubuntu nvidia-persistenced[802]: Verbose syslog connection opened
Aug 20 14:13:41 ubuntu nvidia-persistenced[802]: Now running with user ID 109 and group ID 112
Aug 20 14:13:41 ubuntu nvidia-persistenced[802]: Started (802)
Aug 20 14:13:41 ubuntu nvidia-persistenced[802]: device 0000:01:00.0 - registered
Aug 20 14:13:47 ubuntu nvidia-persistenced[802]: device 0000:01:00.0 - Enabled UVM Persistence mode.
Aug 20 14:13:47 ubuntu nvidia-persistenced[802]: device 0000:01:00.0 - persistence mode enabled.
Aug 20 14:13:47 ubuntu nvidia-persistenced[802]: device 0000:01:00.0 - NUMA memory onlined.
Aug 20 14:13:47 ubuntu nvidia-persistenced[802]: Local RPC services initialized
Aug 20 14:13:47 ubuntu systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.
  1. The guest kernel has loaded aesni related kernel module

It seems okay to me.

lsmod | grep "aesni"
aesni_intel           356352  62
crypto_simd            16384  1 aesni_intel

Here are more details about the config with which I got the issue.
Configuration :

  • Guest OS : Ubuntu 24.04
  • Kernel: 6.9.0-snp-guest
  • Installed driver: 550.90.07
  • Installed CUDA toolkit : 12.6 (nvidia-smi prints CUDA Version: 12.4 though)

Installation instructions :

sudo apt-get update
sudo apt-get install -y build-essential
# Install Nvidia Driver 550 and CUDA
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# Install the toolkit
sudo apt-get -y install cuda-toolkit-12-6
# Install the Driver
sudo apt install nvidia-driver-550-server-open

# Configure nvidia-persistenced service : 
# Edit /usr/lib/systemd/system/nvidia-persistenced.service
# Change:
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
# to this:
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose

# Reboot

Testing :
nvidia-smi output is normal :

(base) ubuntu@ubuntu:~$ nvidia-smi
Tue Aug 20 14:41:59 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:01:00.0 Off |                    0 |
| N/A   47C    P0             52W /  350W |      24MiB /  81559MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But I cannot run a simple pytorch code :

(base) ubuntu@ubuntu:~$ python test_torch.py
/home/ubuntu/miniconda3/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:258: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/utils/tensor_numpy.cpp:84.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
cpu
Traceback (most recent call last):
  File "/home/ubuntu/test_torch.py", line 14, in <module>
    x = x.to(torch.device('cuda'))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/lib/python3.12/site-packages/torch/cuda/__init__.py", line 314, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

dmesg :

[...]
[   17.246677] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[   17.246688] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.90.07  Release Build  (dvs-builder@U16-I2-C05-15-3)  Fri May 31 09:44:37 UTC 2024
[   17.277191] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  550.90.07  Release Build  (dvs-builder@U16-I2-C05-15-3)  Fri May 31 09:34:25 UTC 2024
[   17.555316] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   17.555320] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[   17.594237] nvidia-uvm: Loaded the UVM driver, major device number 237.
[   18.165843] EXT4-fs (sda16): mounted filesystem 1c15ffaa-ebf1-48e7-b5cf-289147811b5e r/w with ordered data mode. Quota mode: none.
[   18.312743] audit: type=1400 audit(1724163220.960:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="brave" pid=665 comm="apparmor_parser"
[   18.312943] audit: type=1400 audit(1724163220.960:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="buildah" pid=666 comm="apparmor_parser"
[   18.312956] audit: type=1400 audit(1724163220.960:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="Discord" pid=662 comm="apparmor_parser"
[   18.313116] audit: type=1400 audit(1724163220.960:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="busybox" pid=667 comm="apparmor_parser"
[   18.313287] audit: type=1400 audit(1724163220.960:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name=4D6F6E676F444220436F6D70617373 pid=663 comm="apparmor_parser"
[   18.313462] audit: type=1400 audit(1724163220.960:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="1password" pid=661 comm="apparmor_parser"
[   18.313639] audit: type=1400 audit(1724163220.961:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="chrome" pid=671 comm="apparmor_parser"
[   18.313761] audit: type=1400 audit(1724163220.961:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="QtWebEngineProcess" pid=664 comm="apparmor_parser"
[   18.313936] audit: type=1400 audit(1724163220.961:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="cam" pid=668 comm="apparmor_parser"
[   18.314083] audit: type=1400 audit(1724163220.961:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="ch-checkns" pid=669 comm="apparmor_parser"
[   18.570455] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[   18.570868] Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[   18.570989] Loaded X.509 cert 'wens: 61c038651aabdcf94bd0ac7ff06c7248db18c600'
[   18.572120] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[   18.572128] cfg80211: failed to load regulatory.db
[   18.965686] ACPI Warning: \_SB.PCI0.S20.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[   19.179762] NET: Registered PF_QIPCRTR protocol family
[   19.262233] loop0: detected capacity change from 0 to 8
[   21.901383] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[   21.901396] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[   23.634589] workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND
[   29.237329] systemd-journald[422]: File /var/log/journal/addc554fecb541989e4bf464179acafb/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.

Additional information

As you suggested, I am starting to test with the 535.104.05 driver and will get back to you.

@Tan-YiFan
Copy link

Tan-YiFan commented Aug 21, 2024

Did you execute nvidia-smi conf-compute -srs 1 (set ready state to 1)

@clauverjat
Copy link
Author

clauverjat commented Aug 21, 2024

Hi @Tan-YiFan,

Did you execute nvidia-smi conf-compute -srs 1 (set ready state to 1)

No I didn't, should I ? (I am gonna try next)

In the meantime here are the results of my test with driver 535.104.05

Test with driver 535.104.05

Summary: Nvidia-persistenced failed

Config of the guest:

  • OS: Ubuntu 22.04 (jammy)
  • Kernel: 6.9.0-snp-guest
  • Installed driver: 535.104.05 (as suggested)
  • Installed CUDA toolkit : 12.2

Guest configuration (after the proper snp kernel where installed):

wget https://us.download.nvidia.com/tesla/535.104.05/nvidia-driver-local-repo-ubuntu2204-535.104.05_1.0-1_amd64.deb
sudo dpkg -i nvidia-driver-local-repo-ubuntu2204-535.104.05_1.0-1_amd64.deb

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-12-2

# Configuring nvidia-persistenced service
# ...

# Reboot

Result

ubuntu@ubuntu:~$ sudo systemctl status nvidia-persistenced
× nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static)
     Active: failed (Result: exit-code) since Wed 2024-08-21 13:50:17 UTC; 5min ago
    Process: 664 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose (code=exited, status=1/FAILURE)
    Process: 728 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
        CPU: 178ms

Aug 21 13:50:17 ubuntu systemd[1]: Starting NVIDIA Persistence Daemon...
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: Verbose syslog connection opened
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: Now running with user ID 113 and group ID 121
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: Started (666)
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: device 0000:01:00.0 - registered
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: device 0000:01:00.0 - failed to open.
Aug 21 13:50:17 ubuntu nvidia-persistenced[664]: nvidia-persistenced failed to initialize. Check syslog for more details.
Aug 21 13:50:17 ubuntu systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
Aug 21 13:50:17 ubuntu systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Aug 21 13:50:17 ubuntu systemd[1]: Failed to start NVIDIA Persistence Daemon.

ubuntu@ubuntu:~$ nvidia-smi 
No devices were found

ubuntu@ubuntu:~$ sudo dmesg
# beginning of the dmesg output is omitted to save space
[   18.024330] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[   18.024344] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  560.28.03  Thu Jul 18 19:32:18 UTC 2024
[   18.158345] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  560.28.03  Thu Jul 18 20:27:27 UTC 2024
[   18.191985] audit: type=1400 audit(1724248216.707:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="ubuntu_pro_apt_news" pid=593 comm="apparmor_parser"
[   18.205549] audit: type=1400 audit(1724248216.719:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=597 comm="apparmor_parser"
[   18.205589] audit: type=1400 audit(1724248216.719:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=597 comm="apparmor_parser"
[   18.213258] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   18.226107] audit: type=1400 audit(1724248216.743:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=596 comm="apparmor_parser"
[   18.295478] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   18.297093] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   18.318091] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[   18.320482] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[   18.340114] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[   18.371352] nvidia-uvm: Loaded the UVM driver, major device number 237.
[   18.740885] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   18.743372] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   18.819872] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   18.821706] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   19.292517] loop3: detected capacity change from 0 to 8
[   50.323402] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   50.325597] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   83.061967] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   83.066463] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   85.968745] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   85.973705] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   86.102415] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   86.107021] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   86.171687] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   86.173727] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   86.271069] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   86.274944] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   86.381771] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   86.386366] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   86.475908] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[   86.480248] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  379.333352] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[  379.338595] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  382.846493] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[  382.850442] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

Additional information

535.104.05-nvidia-bug-report.log.gz

@Tan-YiFan
Copy link

You should execute nvidia-smi conf-compute -srs 1 before you run python code.

The log shows driver version is 560.28.03. Please uninstall existing driver before install a new one. After reboot, you would see the latest-installed version.

Cuda 12.3 (driver 54x.xx) and 12.5 (driver 56x.xx) are merging new features of nvidia cc and thus not recommended for install. Cuda 12.2.2 (driver 535.104.05) and 12.4 (driver 550.xx) are stable regarding cc feature, while 12.4 has better performance in cc.

@clauverjat
Copy link
Author

You should execute nvidia-smi conf-compute -srs 1 before you run python code.

Oh, that did the trick! Setting the Confidential Compute GPU to Ready State was exactly what I needed to do.

I retried on Ubuntu Jammy with the NVIDIA driver version 550.90.07 and CUDA 12.6, and after running the nvidia-smi conf-compute -srs 1 command, my script executed successfully.

Surprisingly, I had run almost identical instructions on a system with VBIOS 96.00.74.00.1C (so without your command), and I could execute the script. This makes me wonder if I might have inadvertently triggered the Ready State in some other way, or if perhaps older VBIOS versions don’t require this step to enable the GPU?

In any case, it’s great to see everything working with driver 550.90.07.

That said, I’m still seeing the following kernel errors in dmesg:

[   19.441219] ACPI Warning: \_SB.PCI0.S20.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[   19.703498] loop3: detected capacity change from 0 to 8
[   22.302161] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[   22.302173] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967

I don't know what to make of them... Is it expected ? Should I try to get rid of these?

Thanks for your help!

@clauverjat
Copy link
Author

The log shows driver version is 560.28.03. Please uninstall existing driver before install a new one. After reboot, you would see the latest-installed version.

Hmm, I shared the logs from different attempts with different VM and configuration, which log are you talking about? If you are talking about the dmesg output from this message :

If the 560 driver is indeed supposed to work, here’s the dmesg output while the system is running driver version 560. It includes NVRM nvAssertFailed errors that are likely related to the issue: dmesg output.

Then this is normal driver version 560 is expected. But in all cases the driver announced should match the logs.

Please uninstall existing driver before install a new one. After reboot, you would see the latest-installed version.

The new attempt I did today was on fresh VM to make sure of the version in use.

Anyway, thanks for providing information about the recommended drivers for using CC mode. I will stick to driver 550.xx then.

@Tan-YiFan
Copy link

[ 19.441219] ACPI Warning: _SB.PCI0.S20.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[ 19.703498] loop3: detected capacity change from 0 to 8
[ 22.302161] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[ 22.302173] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967

These lines are expected.

Setting cc mode to devtools (instead of on) would automatically set ready state to on. Executing the attestation script would also set cc mode to on (you can search "ready" in guest_tools/ of this repo).

The log of 535.104.05 seems to have 560.28.03 installed. But that doesn't matter.

You can close the issue if everything goes well.

@clauverjat
Copy link
Author

These lines are expected.

That's reassuring.

Setting cc mode to devtools (instead of on) would automatically set ready state to on. Executing the attestation script would also set cc mode to on (you can search "ready" in guest_tools/ of this repo).

Makes sense. That could explain my previous experience.

The log of 535.104.05 seems to have 560.28.03 installed. But that doesn't matter.

Indeed you're right. I missed that. Though since it works now it doesn't matter.

Thanks for your help @Tan-YiFan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants