-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The runtime measurements are not matching #28
Comments
I started having the same problem with the recent commits Seems like the commit 4383b82 still works
|
@thisiskarthikj Seems your commit broke the attestations. |
Thank you for your advice @YurkoWasHere but I tried the old commit and it wouldn't work due to RIM cert revocation. |
The certs are "revoked" because this tech is still in preview and not meant for production. Use the IE: |
Thanks! The problem was solved except that I had to run |
You can try adding I'm not sure what the difference in attestation is. |
@YurkoWasHere I see. Thank you very much for your help :) |
Hi @hiroki-chen , I am also trying to do confidential computing with H100. could you give me your BIOS setting about SEV-SNP? |
@seungsoo-lee I followed the instructions in the deployment guide from Nvidia. The options are listed below.
|
I don't have a V4 AMD but another project ran into issues with the AMD v4s not working with their stack. This may be relevant But i don't think you will be blocked on building and installing kernel by incorrect BIOS settings. In my experience SEV just wont work :) I would also contact support for the board manufacturer to confirm bios settings for SEV-SNP. Sometimes they need a bios upgrade. |
in the guide, IOMMU is enabled. Yours is auto even it works? do you mean that although BIOS provides SEV options, SEV wont work? BIOS that I'm using prvides some SEV-SNP options as follows. Advanced --> |
@seungsoo-lee We are using ASUS workstation: https://servers.asus.com/products/servers/gpu-servers/ESC8000A-E12 Interestingly when we enabled IOMMU the SNP initialization would fail due to "TOO LATE TO ENABLE SNP FOR IOMMU". |
if you did not already try adding
(see /boot/grub/grub.cfg) |
@YurkoWasHere Yes, I tried this before but my system would enter emergency mode immediately after reboot. It was very weird though. |
So strange, Full args im using are |
@hiroki-chen @YurkoWasHere According to kernel-parameters.txt, If iommu is not enabled, could you passthrough an H100 GPU to VM? |
@Tan-YiFan Yes. I can passthrough an H100 to QEMU for some reason if I set IOMMU to auto (perhaps it is BIOS-specific, I guess). |
The output of |
hi @hiroki-chen , I also tried to follow the Confiential Computing Deployment Guide provided by NVIDIA. My machine spec seems similar to yours. But you say that your guest OS is as follows; how to find this information from the document? |
Thanks for the reply. For building the guest kernel, you may clone this repo at branch $ sudo vim /etc/default/grub
GRUB_DEFAULT="1>?" $ cat /boot/grub/grub.cfg | grep menuentry
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 6.6.0-rc1-snp-host-5a170ce1a082' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.6.0-rc1-snp-host-5a170ce1a082-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 6.6.0-rc1-snp-host-5a170ce1a082 (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.6.0-rc1-snp-host-5a170ce1a082-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 6.2.0-39-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.2.0-39-generic-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 6.2.0-39-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.2.0-39-generic-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 5.19.0-rc6-snp-host-c4daeffce56e' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.19.0-rc6-snp-host-c4daeffce56e-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 5.19.0-rc6-snp-host-c4daeffce56e (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.19.0-rc6-snp-host-c4daeffce56e-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 5.15.0-91-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-91-generic-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
menuentry 'Ubuntu, with Linux 5.15.0-91-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-91-generic-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' { Replace the question mark with the desired kernel version (index starting from 0). Then do update-grub and reboot. Hope this helps. |
thanks for the reply! as your advice, I have updated 5.19-gurest kernel on the guest VM. after that, when I tried to do 'Enabling LKCA on the Guest VM' part on the document (p.26) how about your case?, is it okay? |
By default, this command will select the latest kernel. In your case, it is 6.2.0. You can select the kernel version manually via sudo update-initramfs -u -k `uname -r` or simply sudo update-initramfs -u -k all |
now, I changed kernel to 5.19-snp-guest as your advice But failed to install NVIDIA driver and CUDA again..
do you have any idea? |
@seungsoo-lee Which CUDA version are you currently using? Only v535 is compatible with H100. If you are already using the correct version, then consider removing all CUDA drivers, kernel modules, and other packages and re-install the driver again. I have once encountered this issue but managed to fix it by re-installing the driver. |
some confused.. first, second, if so, which host kernel version should be target? |
@seungsoo-lee No. Installing driver on the host is not required. The motivation of installing driver on the host is to check whether the H100 works fine. |
|
you said 'If you are already using the correct version, then consider removing all CUDA drivers, kernel modules, and other packages and re-install the driver again.' please let me know what commands you used |
I have tried to install the NVIDIA driver on the guest VM all day. Finally, I got this
my procedures are as follows. Installing host kernel --> it is okay.
then, installing the NVIDIA driver is succeeded. HOWEVER, after rebooting the guest VM,
|
I have not tested with the latest commit but I don't think its been fixed. So try using this commit instead of the latest: |
you mean when it tries to turn the CC mode on, that [gpu_cc_tool.py] scripts should be used? plus, I wonder whether we should turn the dev-tools mode on also when setting CC mode on. |
I found that when using this command line with the latest commit to enable the nvidia inside the SEV enclave (I think this is similar to what your doing with I started to get an error like yours:
Rolling back to the commit I mentioned corrected this problem. also |
To enable CC i use this line However on my system I been having issues initially with it. Suggestion was to use My work around was to edit
and replace this line
with this line
then run |
in my case, with |
Did you guys run
successfuly in your k8s cluster ? |
Hi @seungsoo-lee, sadly I didn't try with K8S cluster. |
@seungsoo-lee If you do not care about the bugs introduced by the latest commit, you can then just use it as long as the stable commit shows the correct attestation result. |
A very interesting thing I recently found was that when I tried to attest one of the H100 GPUs on my host inside the VM, the SDK V1.2.0 worked fine but v1.1.0 would fail whereas I installed only one GPU, SDK v1.2.0 would report the error but v1.1.0 worked fine. $ lspci -d 10de:
41:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
# 1.2.0: OK 1.1.0: Fail
$ lspci -d 10de:
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
# 1.2.0: Fail 1.1.0: OK |
If you do not try with k8s, how to do confidential computing workloads/examples? |
what command did you try? |
python3 -m verifier.cc_admin --user_mode --allow_hold_cert |
I just tried to run PyTorch examples inside the VM. If one performs ML tasks successfully then confidential computing functionalities are enabled. |
We use pytorch as well directory not in a container. Containerized work loads will be the next step. On the guest you can check the gpu is in the correct state by running
You can also force the GPU into a ready state without running the attestation by using |
Sorry, late to the party here. The summary is that for the following combinations of driver and vBIOS versions, we get index "9" mismatch, correct ?
Is this happening with the latest commit whereas you don't see this issue with an older commit ? Once you confirm, we will try to re-create the setup and try it and will get back to you. Hang in there ! |
Technically it is (for the single-GPU case), but one thing that appears weird to me was that I was able to attest the H100 GPU using the latest commit: h100@h100-cvm:~$ python3 -m verifier.cc_admin --allow_hold_cert --user_mode
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
Driver version fetched : 535.104.05
VBIOS version fetched : 96.00.74.00.1a
Validating GPU certificate chains.
GPU attestation report certificate chain validation successful.
The certificate chain revocation status verification successful.
Authenticating attestation report
The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
Driver version fetched from the attestation report : 535.104.05
VBIOS version fetched from the attestation report : 96.00.74.00.1a
Attestation report signature verification successful.
Attestation report verification successful.
Authenticating the RIMs.
Authenticating Driver RIM
Fetching the driver RIM from the RIM service.
RIM Schema validation passed.
driver RIM certificate chain verification successful.
The certificate chain revocation status verification successful.
driver RIM signature verification successful.
Driver RIM verification successful
Authenticating VBIOS RIM.
Fetching the VBIOS RIM from the RIM service.
RIM Schema validation passed.
vbios RIM certificate chain verification successful.
The certificate chain revocation status verification successful.
vbios RIM signature verification successful.
VBIOS RIM verification successful
Comparing measurements (runtime vs golden)
The runtime measurements are matching with the golden measurements.
GPU is in expected state.
GPU 0 verified successfully.
GPU Attested Successfully This happened after I installed another GPU on the machine: $ lspci -d 10de:
41:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1) And I let the first GPU (41:00.0) bind to VFIO and enabled CC for it. sudo python3 gpu_cc_tool.py --gpu-bdf=41:00 --set-cc-mode on --reset-after-cc-mode-switch 1.2.0 could work for this scenario but not 1.1.0: h100@h100-cvm:/shared/nvtrust/guest_tools/attestation_sdk$ python3 -m verifier.cc_admin --allow_hold_cert --user_mode
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
using the new pinned root cert
VERIFYING GPU : 0
Driver version fetched : 535.104.05
VBIOS version fetched : 96.00.74.00.1a
Validating GPU certificate chains.
GPU attestation report certificate chain validation successful.
The certificate chain revocation status verification successful.
Authenticating attestation report
The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
Driver version fetched from the attestation report : 535.104.05
VBIOS version fetched from the attestation report : 96.00.74.00.1a
Attestation report signature verification successful.
Attestation report verification successful.
Authenticating the RIMs.
Authenticating Driver RIM
RIM Schema validation passed.
driver RIM certificate chain verification successful.
WARNING: THE CERTIFICATE NVIDIA Reference Value L3 GH100 001 IS REVOKED WITH THE STATUS AS 'CERTIFICATE_HOLD'.
The certificate chain revocation status verification was not successful but continuing.
driver RIM signature verification successful.
Driver RIM verification successful
Authenticating VBIOS RIM.
RIM Schema validation passed.
vbios RIM certificate chain verification successful.
WARNING: THE CERTIFICATE NVIDIA Reference Value L3 GH100 001 IS REVOKED WITH THE STATUS AS 'CERTIFICATE_HOLD'.
The certificate chain revocation status verification was not successful but continuing.
vbios RIM signature verification successful.
VBIOS RIM verification successful
Comparing measurements (runtime vs golden)
The runtime measurements are not matching with the
golden measurements at the following indexes(starting from 0) :
[
9,
36
]
The verification of GPU 0 resulted in failure.
GPU Attestation failed SummaryMulti-GPUSDK 1.2.0: Works for multiple GPUs (although I used only one GPU inside the CVM) Single-GPUSDK 1.2.0: Measurement mismatch for index 9 The GPU worked on CC mode. |
I'm little confused. So, you mean that
|
Steps 1-3 are meant for preparatory purposes: to confirm that the GPU works in CC mode.
Not necessarily (the script might be buggy or you are running in user mode but setting to sudo nvidia-smi conf-compute -srs 1 |
Still have the same issue with latest commit :(
|
@YurkoWasHere I believe they haven't updated yet |
Hi, Is this problem resolved?
|
Unfortunately i no longer have the H100 paired with an AMD, but moved on to Intel which required the newest version. Last i checked (about a month ago) it still was not working If you still are having issues try using the older commit |
Actually, I'm not using AMD, my environment is Intel+TDX CVM,GPU is H800。
|
@yunbo-xufeng I'm not sure if the old commit works for H800 but H100 is supported :/ Did you try the 4383b82 commit? If you tried with that commit and remote attestation failed then I think you'll probably have to wait for NVIDIA's team to fix this issue. |
@hiroki-chen @yunbo-xufeng Measurements mismatch could be an issue with RIM file itself. We will take a look and get back to you. |
@yunbo-xufeng Can you get me the version of nvidia_gpu_tools.py that you are using ? python3 nvidia_gpu_tools.py --help | grep version |
Hi,
Thanks for supporting confidential computing on H100 GPUs! This work is wonderful.
I recently started configuring AMD-SEV-SNP with H100 GPU and tried to do some small demos on my machine. Everything went on smoothly except that the attestation validation went awry.
My machine's specs:
CPU: Dual AMD EPYC 9124 16-Core Processor
GPU: H100 10de:2331 (vbios: 96.00.74.00.1A cuda: 12.2 nvidia driver: 535.86.10)
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel
I tried to run
/attestation_sdk/tests/LocalGPUTest.py
but encountered the following error:The error is
x-nv-gpu-measurements-match
withThe output of the CC mode on the host machine looks like below.
$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings NVIDIA GPU Tools version 535.86.06 Topo: PCI 0000:60:01.1 0x1022:0x14ab GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000 2023-12-07,22:13:09.865 INFO Selected GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000 2023-12-07,22:13:09.865 WARNING GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000 has CC mode on, some functionality may not work 2023-12-07,22:13:09.936 INFO GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000 CC settings: 2023-12-07,22:13:09.937 INFO enable = 1 2023-12-07,22:13:09.937 INFO enable-devtools = 0 2023-12-07,22:13:09.937 INFO enable-allow-inband-control = 1 2023-12-07,22:13:09.937 INFO enable-devtools-allow-inband-control = 1
I also tried to set the
cc-mode
todevtools
but it didn't help.Do you have any ideas on the error? Any help is more than appreciated!
The text was updated successfully, but these errors were encountered: