-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OVMF PCI Host bridge error on snp-latest #56
Comments
AFAIK it is caused by a too-big MMIO address space. The h100 needs at least 128GiB but 256GiB are requested and cause the issue, so I just adjusted the value to be something smaller than 256 but larger than 128 and it worked.
|
Thanks @benschlueter replacing the fw_cmd with yours solved the OVMF issue. I am not there yet though, with the OVMF out of the way, I can boot my OS. Now I've got errors from the Nvidia driver.
And here is the full dmesg output : https://gist.github.com/clauverjat/2e9f46dff7665d07d4bc5f651ca3d59b Note that I identified a potential cause for the issue. My GPU VBIOS version is 96.00.30.00.01 (obtained with nvidia-smi), but a minimal version of 96.00.5E.00.00 is required according to the Nvidia deployment guide. I am not sure though if an old VBIOS should result in such an error, since in the deployment guide "Validating State and Versions" comes after the driver installation Looking at other issues on this repo, it seems that I might now hit the same problem as in #31 , with a "GPU not supported by the driver" error message. |
Please upgrade the VBIOS by contacting the vendor. |
Hi @Tan-YiFan This is ongoing already. We are renting a baremetal server, so we had to ask our cloud provider which then contacted their vendor. I hope it'll happen quickly, but, reasonably, I know it might take a while before they apply the VBIOS upgrade. So in the meantime, can you confirm that an old VBIOS can result in that kind of error ? I want to be sure that I go as far as I can without the VBIOS upgrade. |
Also, please note that the NVIDIA Trusted Computing Solutions (such as CC) are only currently supported on our Enterprise drivers, this currently encompasses: 550.54.14 I will get this list added to our documentation and ensure it is up to date. |
Hello @rnertney, Our system got a VBIOS update and we are now running with VBIOS version 96.00.9F.00.01, which should be compatible. After the update, I can run the nvtrust guest tools. I ran the We've tested with two different CUDA/driver combinations:
This brings me to my first question : Is Confidential Compute (CC) compatible with the latest 560 drivers? Your last message suggests that CC might only work with the 550 drivers, but this seems a bit strange to me. If the 560 driver is indeed supposed to work, here’s the dmesg output while the system is running driver version 560. It includes NVRM nvAssertFailed errors that are likely related to the issue: dmesg output. |
Please check:
I also suggest trying 535.104.05 module. |
Hi, I've checked your items :
I am running the following before starting the guest VM
I have configured the nvidia-persistenced systemd service to enable uvm persistence mode.
It seems okay to me.
Here are more details about the config with which I got the issue.
Installation instructions :
Testing :
But I cannot run a simple pytorch code :
dmesg :
Additional informationAs you suggested, I am starting to test with the 535.104.05 driver and will get back to you. |
Did you execute |
Hi @Tan-YiFan,
No I didn't, should I ? (I am gonna try next) In the meantime here are the results of my test with driver 535.104.05 Test with driver 535.104.05Summary: Nvidia-persistenced failed Config of the guest:
Guest configuration (after the proper snp kernel where installed):wget https://us.download.nvidia.com/tesla/535.104.05/nvidia-driver-local-repo-ubuntu2204-535.104.05_1.0-1_amd64.deb
sudo dpkg -i nvidia-driver-local-repo-ubuntu2204-535.104.05_1.0-1_amd64.deb
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-12-2
# Configuring nvidia-persistenced service
# ...
# Reboot Resultubuntu@ubuntu:~$ sudo systemctl status nvidia-persistenced
× nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static)
Active: failed (Result: exit-code) since Wed 2024-08-21 13:50:17 UTC; 5min ago
Process: 664 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose (code=exited, status=1/FAILURE)
Process: 728 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
CPU: 178ms
Aug 21 13:50:17 ubuntu systemd[1]: Starting NVIDIA Persistence Daemon...
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: Verbose syslog connection opened
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: Now running with user ID 113 and group ID 121
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: Started (666)
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: device 0000:01:00.0 - registered
Aug 21 13:50:17 ubuntu nvidia-persistenced[666]: device 0000:01:00.0 - failed to open.
Aug 21 13:50:17 ubuntu nvidia-persistenced[664]: nvidia-persistenced failed to initialize. Check syslog for more details.
Aug 21 13:50:17 ubuntu systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
Aug 21 13:50:17 ubuntu systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Aug 21 13:50:17 ubuntu systemd[1]: Failed to start NVIDIA Persistence Daemon.
ubuntu@ubuntu:~$ nvidia-smi
No devices were found
ubuntu@ubuntu:~$ sudo dmesg
# beginning of the dmesg output is omitted to save space
[ 18.024330] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 18.024344] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 560.28.03 Thu Jul 18 19:32:18 UTC 2024
[ 18.158345] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 560.28.03 Thu Jul 18 20:27:27 UTC 2024
[ 18.191985] audit: type=1400 audit(1724248216.707:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="ubuntu_pro_apt_news" pid=593 comm="apparmor_parser"
[ 18.205549] audit: type=1400 audit(1724248216.719:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=597 comm="apparmor_parser"
[ 18.205589] audit: type=1400 audit(1724248216.719:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=597 comm="apparmor_parser"
[ 18.213258] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 18.226107] audit: type=1400 audit(1724248216.743:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=596 comm="apparmor_parser"
[ 18.295478] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 18.297093] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 18.318091] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[ 18.320482] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[ 18.340114] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 18.371352] nvidia-uvm: Loaded the UVM driver, major device number 237.
[ 18.740885] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 18.743372] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 18.819872] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 18.821706] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 19.292517] loop3: detected capacity change from 0 to 8
[ 50.323402] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 50.325597] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 83.061967] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 83.066463] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 85.968745] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 85.973705] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 86.102415] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 86.107021] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 86.171687] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 86.173727] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 86.271069] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 86.274944] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 86.381771] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 86.386366] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 86.475908] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 86.480248] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 379.333352] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 379.338595] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 382.846493] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x38:880)
[ 382.850442] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 Additional information |
You should execute The log shows driver version is 560.28.03. Please uninstall existing driver before install a new one. After reboot, you would see the latest-installed version. Cuda 12.3 (driver 54x.xx) and 12.5 (driver 56x.xx) are merging new features of nvidia cc and thus not recommended for install. Cuda 12.2.2 (driver 535.104.05) and 12.4 (driver 550.xx) are stable regarding cc feature, while 12.4 has better performance in cc. |
Oh, that did the trick! Setting the Confidential Compute GPU to Ready State was exactly what I needed to do. I retried on Ubuntu Jammy with the NVIDIA driver version 550.90.07 and CUDA 12.6, and after running the Surprisingly, I had run almost identical instructions on a system with VBIOS 96.00.74.00.1C (so without your command), and I could execute the script. This makes me wonder if I might have inadvertently triggered the Ready State in some other way, or if perhaps older VBIOS versions don’t require this step to enable the GPU? In any case, it’s great to see everything working with driver 550.90.07. That said, I’m still seeing the following kernel errors in dmesg:
I don't know what to make of them... Is it expected ? Should I try to get rid of these? Thanks for your help! |
Hmm, I shared the logs from different attempts with different VM and configuration, which log are you talking about? If you are talking about the dmesg output from this message :
Then this is normal driver version 560 is expected. But in all cases the driver announced should match the logs.
The new attempt I did today was on fresh VM to make sure of the version in use. Anyway, thanks for providing information about the recommended drivers for using CC mode. I will stick to driver 550.xx then. |
These lines are expected. Setting cc mode to devtools (instead of on) would automatically set ready state to on. Executing the attestation script would also set cc mode to on (you can search "ready" in The log of 535.104.05 seems to have 560.28.03 installed. But that doesn't matter. You can close the issue if everything goes well. |
That's reassuring.
Makes sense. That could explain my previous experience.
Indeed you're right. I missed that. Though since it works now it doesn't matter. Thanks for your help @Tan-YiFan |
Hello,
I am trying to run an AMD SEV SNP Confidential VM with a Nvidia H100 in CC mode. Specifically, I am trying to make it work with the snp-latest branch from the AMDESE/AMDSEV repo (i.e. the branch that uses Linux kernel version 6.9.0-rc7). I am hitting an error when trying to boot the guest with the PCI passthrough. The error occurs before reaching the linux kernel since it happens in the guest firmware initialization (OVMF).
Steps to reproduce
Then I patched
common.sh
adding the following lines (in order to add configure the kernel appropriately):Installing the build depencies :
Then building
I am not applying the two linux patches from the nvtrust repo (I figure they might no longer be needed since I am building a newer kernel)
Preparing the host
Rebooting and selecting the right kernel
Following the "Preparing the Guest" steps (omitted here)
Configuring the GPU
Obtain some information
Setting the GPU in dev CC mode and activating PCI VFIO for the device.
Launch guest VM
Set the appropriate values in the vars :
Launching guest (with sev-snp and GPU passthrough)
The boot fails with an OVMF error :
Note that the error disappears if one boots the guest without trying the pass the GPU device. Also there is no error if I keep everything but remove the option "-fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=262144" (but not really useful since I cannot use the device then).
Any idea on how to solve this issue ?
Thanks
The text was updated successfully, but these errors were encountered: