Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The runtime measurements are not matching #28

Open
hiroki-chen opened this issue Dec 8, 2023 · 74 comments
Open

The runtime measurements are not matching #28

hiroki-chen opened this issue Dec 8, 2023 · 74 comments

Comments

@hiroki-chen
Copy link

Hi,

Thanks for supporting confidential computing on H100 GPUs! This work is wonderful.

I recently started configuring AMD-SEV-SNP with H100 GPU and tried to do some small demos on my machine. Everything went on smoothly except that the attestation validation went awry.

My machine's specs:

CPU: Dual AMD EPYC 9124 16-Core Processor
GPU: H100 10de:2331 (vbios: 96.00.74.00.1A cuda: 12.2 nvidia driver: 535.86.10)
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel

I tried to run /attestation_sdk/tests/LocalGPUTest.py but encountered the following error:

h100@h100-cvm:/shared/nvtrust/guest_tools/attestation_sdk/tests$ python3 ./LocalGPUTest.py 
[LocalGPUTest] node name : thisNode1
[['LOCAL_GPU_CLAIMS', <Devices.GPU: 2>, <Environment.LOCAL: 2>, '', '', '']]
[LocalGPUTest] call attest() - expecting True
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce specified by user
VERIFYING GPU : 0
        Driver version fetched : 535.86.10
        VBIOS version fetched : 96.00.74.00.1a
        Validating GPU certificate chains.
                GPU attestation report certificate chain validation successful.
                        The certificate chain revocation status verification successful.
        Authenticating attestation report
                The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
                Driver version fetched from the attestation report : 535.86.10
                VBIOS version fetched from the attestation report : 96.00.74.00.1a
                Attestation report signature verification successful.
                Attestation report verification successful.
        Authenticating the RIMs.
                Authenticating Driver RIM
                        Fetching the driver RIM from the RIM service.
                        RIM Schema validation passed.
                        driver RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        driver RIM signature verification successful.
                        Driver RIM verification successful
                Authenticating VBIOS RIM.
                        Fetching the VBIOS RIM from the RIM service.
                        RIM Schema validation passed.
                        vbios RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        vbios RIM signature verification successful.
                        VBIOS RIM verification successful
        Comparing measurements (runtime vs golden)
                        The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
                        [
                        9
                        ]
The verification of GPU 0 resulted in failure.
        GPU Attestation failed
False
[LocalGPUTest] token : [["JWT", "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJOVi1BdHRlc3RhdGlvbi1TREsiLCJpYXQiOjE3MDIwMDUxNTcsImV4cCI6bnVsbH0._J81r7wl6FiVF3uxZL5mKeKuOWPxBsb6-zgdpZ5TJdA"], {"LOCAL_GPU_CLAIMS": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ4LW52LWdwdS1hdmFpbGFibGUiOnRydWUsIngtbnYtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC1hdmFpbGFibGUiOnRydWUsIngtbnYtZ3B1LWluZm8tZmV0Y2hlZCI6dHJ1ZSwieC1udi1ncHUtYXJjaC1jaGVjayI6dHJ1ZSwieC1udi1ncHUtcm9vdC1jZXJ0LWF2YWlsYWJsZSI6dHJ1ZSwieC1udi1ncHUtY2VydC1jaGFpbi12ZXJpZmllZCI6dHJ1ZSwieC1udi1ncHUtb2NzcC1jZXJ0LWNoYWluLXZlcmlmaWVkIjp0cnVlLCJ4LW52LWdwdS1vY3NwLXNpZ25hdHVyZS12ZXJpZmllZCI6dHJ1ZSwieC1udi1ncHUtY2VydC1vY3NwLW5vbmNlLW1hdGNoIjp0cnVlLCJ4LW52LWdwdS1jZXJ0LWNoZWNrLWNvbXBsZXRlIjp0cnVlLCJ4LW52LWdwdS1tZWFzdXJlbWVudC1hdmFpbGFibGUiOnRydWUsIngtbnYtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC1wYXJzZWQiOnRydWUsIngtbnYtZ3B1LW5vbmNlLW1hdGNoIjp0cnVlLCJ4LW52LWdwdS1hdHRlc3RhdGlvbi1yZXBvcnQtZHJpdmVyLXZlcnNpb24tbWF0Y2giOnRydWUsIngtbnYtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC12Ymlvcy12ZXJzaW9uLW1hdGNoIjp0cnVlLCJ4LW52LWdwdS1hdHRlc3RhdGlvbi1yZXBvcnQtdmVyaWZpZWQiOnRydWUsIngtbnYtZ3B1LWRyaXZlci1yaW0tc2NoZW1hLWZldGNoZWQiOnRydWUsIngtbnYtZ3B1LWRyaXZlci1yaW0tc2NoZW1hLXZhbGlkYXRlZCI6dHJ1ZSwieC1udi1ncHUtZHJpdmVyLXJpbS1jZXJ0LWV4dHJhY3RlZCI6dHJ1ZSwieC1udi1ncHUtZHJpdmVyLXJpbS1zaWduYXR1cmUtdmVyaWZpZWQiOnRydWUsIngtbnYtZ3B1LWRyaXZlci1yaW0tZHJpdmVyLW1lYXN1cmVtZW50cy1hdmFpbGFibGUiOnRydWUsIngtbnYtZ3B1LWRyaXZlci12Ymlvcy1yaW0tZmV0Y2hlZCI6dHJ1ZSwieC1udi1ncHUtdmJpb3MtcmltLXNjaGVtYS12YWxpZGF0ZWQiOnRydWUsIngtbnYtZ3B1LXZiaW9zLXJpbS1jZXJ0LWV4dHJhY3RlZCI6dHJ1ZSwieC1udi1ncHUtdmJpb3MtcmltLXNpZ25hdHVyZS12ZXJpZmllZCI6dHJ1ZSwieC1udi1ncHUtdmJpb3MtcmltLWRyaXZlci1tZWFzdXJlbWVudHMtYXZhaWxhYmxlIjp0cnVlLCJ4LW52LWdwdS12Ymlvcy1pbmRleC1uby1jb25mbGljdCI6dHJ1ZSwieC1udi1ncHUtbWVhc3VyZW1lbnRzLW1hdGNoIjpmYWxzZSwieC1udi1ncHUtdXVpZCI6IkdQVS1kNDNlYWM4Zi02MzExLTk1ZTgtYjI4ZS04ZGE1ZmQ1ZTE4MGIifQ.g0ktblAfvDGsfaMuFMxn8MJb3KZPK-7fyoZWrBIuSuY"}]
[LocalGPUTest] call validate_token() - expecting True
        [ERROR] Invalid token. Authorized claims does not match the appraisal policy:  x-nv-gpu-measurements-match
False

The error is x-nv-gpu-measurements-match with

        Comparing measurements (runtime vs golden)
                        The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
                        [
                        9
                        ]

The output of the CC mode on the host machine looks like below.

$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings

NVIDIA GPU Tools version 535.86.06
Topo:
  PCI 0000:60:01.1 0x1022:0x14ab
   GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000
2023-12-07,22:13:09.865 INFO     Selected GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000
2023-12-07,22:13:09.865 WARNING  GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000 has CC mode on, some functionality may not work
2023-12-07,22:13:09.936 INFO     GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x8c042000000 CC settings:
2023-12-07,22:13:09.937 INFO       enable = 1
2023-12-07,22:13:09.937 INFO       enable-devtools = 0
2023-12-07,22:13:09.937 INFO       enable-allow-inband-control = 1
2023-12-07,22:13:09.937 INFO       enable-devtools-allow-inband-control = 1

I also tried to set the cc-mode to devtools but it didn't help.

Do you have any ideas on the error? Any help is more than appreciated!

@YurkoWasHere
Copy link

I started having the same problem with the recent commits

Seems like the commit 4383b82 still works

Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
	Driver version fetched : 535.104.05
	VBIOS version fetched : 96.00.74.00.1c
	Validating GPU certificate chains.
		GPU attestation report certificate chain validation successful.
			The certificate chain revocation status verification successful.
	Authenticating attestation report
		The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
		Driver version fetched from the attestation report : 535.104.05
		VBIOS version fetched from the attestation report : 96.00.74.00.1c
		Attestation report signature verification successful.
		Attestation report verification successful.
	Authenticating the RIMs.
		Authenticating Driver RIM
			Fetching the driver RIM from the RIM service.
			RIM Schema validation passed.
			driver RIM certificate chain verification successful.
			The certificate chain revocation status verification successful.
			driver RIM signature verification successful.
			Driver RIM verification successful
		Authenticating VBIOS RIM.
			Fetching the VBIOS RIM from the RIM service.
			RIM Schema validation passed.
			vbios RIM certificate chain verification successful.
			The certificate chain revocation status verification successful.
			vbios RIM signature verification successful.
			VBIOS RIM verification successful
	Comparing measurements (runtime vs golden)
			The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
			[
			9
			]
	GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
	GPU Attestation failed

@YurkoWasHere
Copy link

@thisiskarthikj Seems your commit broke the attestations.

9ad90fd

@hiroki-chen
Copy link
Author

Thank you for your advice @YurkoWasHere but I tried the old commit and it wouldn't work due to RIM cert revocation.

@YurkoWasHere
Copy link

YurkoWasHere commented Dec 8, 2023

@hiroki-chen

The certs are "revoked" because this tech is still in preview and not meant for production.

Use the --allow_hold_cert parameter to bypass this specific revocation type check.

IE:
python3 -m verifier.cc_admin --allow_hold_cert

@hiroki-chen
Copy link
Author

@hiroki-chen

The certs are "revoked" because this tech is still in preview and not meant for production.

Use the --allow_hold_cert parameter to bypass this specific revocation type check.

IE: python3 -m verifier.cc_admin --allow_hold_cert

Thanks! The problem was solved except that I had to run python3 as sudo.

@YurkoWasHere
Copy link

@hiroki-chen

You can try adding --user_mode for a non-sudo version of the command.

I'm not sure what the difference in attestation is.

@hiroki-chen
Copy link
Author

@YurkoWasHere I see. Thank you very much for your help :)

@seungsoo-lee
Copy link

Hi @hiroki-chen ,

I am also trying to do confidential computing with H100.
Similar to yours, my machine is on dual AMD EPYC 9224 and H100 (running on GIGABYTE systems)

could you give me your BIOS setting about SEV-SNP?
because I got stuck on installing kernel phase....

@hiroki-chen
Copy link
Author

hiroki-chen commented Dec 13, 2023

Hi @hiroki-chen ,

I am also trying to do confidential computing with H100. Similar to yours, my machine is on dual AMD EPYC 9224 and H100 (running on GIGABYTE systems)

could you give me your BIOS setting about SEV-SNP? because I got stuck on installing kernel phase....

@seungsoo-lee I followed the instructions in the deployment guide from Nvidia.

The options are listed below.

Advanced -->
      AMD CBS ->
          CPU Common ->
              SEV ASID Count -> 509 ASIDs
              SEV-ES ASID space Limit Control -> Manual
              SEV-ES ASID space limit -> 100
              SNP Memory Coverage -> Enabled
              SMEE -> Enabled
NBIO common ->
      SEV-SNP Support -> Enabled
      IOMMU -> auto

@YurkoWasHere
Copy link

I don't have a V4 AMD but another project ran into issues with the AMD v4s not working with their stack.

This may be relevant
https://github.com/AMDESE/AMDSEV/tree/snp-latest?tab=readme-ov-file#upgrading-from-519-based-snp-hypervisorhost-kernels

But i don't think you will be blocked on building and installing kernel by incorrect BIOS settings. In my experience SEV just wont work :)

I would also contact support for the board manufacturer to confirm bios settings for SEV-SNP. Sometimes they need a bios upgrade.

@seungsoo-lee
Copy link

@hiroki-chen

in the guide, IOMMU is enabled. Yours is auto even it works?
And, can you let me know what's your BIOS systems (e.g., Supermicro or GIGABYTE..)?

@YurkoWasHere

do you mean that although BIOS provides SEV options, SEV wont work?

BIOS that I'm using prvides some SEV-SNP options as follows.

Advanced -->
AMD CBS ->
CPU Common ->
(not provide) SEV ASID Count -> 509 ASIDs
(not provide) SEV-ES ASID space Limit Control -> Manual
SEV-ES ASID space limit -> 100
SNP Memory Coverage -> Enabled
SMEE -> Enabled
NBIO common ->
SEV-SNP Support -> Enabled
IOMMU -> auto

@hiroki-chen
Copy link
Author

@seungsoo-lee We are using ASUS workstation: https://servers.asus.com/products/servers/gpu-servers/ESC8000A-E12

Interestingly when we enabled IOMMU the SNP initialization would fail due to "TOO LATE TO ENABLE SNP FOR IOMMU".

@YurkoWasHere
Copy link

@hiroki-chen

if you did not already try adding amd_iommu=on kernel argument to you grub linux line?

linux /vmlinuz-5.19.0-rc6-snp-host-c4daeffce56e root= [.....] amd_iommu=on

(see /boot/grub/grub.cfg)

@hiroki-chen
Copy link
Author

@hiroki-chen

if you did not already try adding amd_iommu=on kernel argument to you grub linux line?

linux /vmlinuz-5.19.0-rc6-snp-host-c4daeffce56e root= [.....] amd_iommu=on

(see /boot/grub/grub.cfg)

@YurkoWasHere Yes, I tried this before but my system would enter emergency mode immediately after reboot. It was very weird though.

@YurkoWasHere
Copy link

So strange,

Full args im using are
mem_encrypt=on kvm_amd.sev=1 kvm_amd.sev-snp=1 amd_kvm.sev-es=1 amd_iommu=on vfio-pci.disable_idle_d3=1

@Tan-YiFan
Copy link

@hiroki-chen @YurkoWasHere According to kernel-parameters.txt, amd_iommu does not take "on" as input. Maybe we should add iommu=pt?

If iommu is not enabled, could you passthrough an H100 GPU to VM?

@hiroki-chen
Copy link
Author

@Tan-YiFan Yes. I can passthrough an H100 to QEMU for some reason if I set IOMMU to auto (perhaps it is BIOS-specific, I guess).

@hiroki-chen
Copy link
Author

@Tan-YiFan Yes. I can passthrough an H100 to QEMU for some reason if I set IOMMU to auto (perhaps it is BIOS-specific, I guess).

The output of /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.19.0-rc6-snp-host-c4daeffce56e root=UUID=[...] ro vfio-pci.disable_idle_d3=1

@seungsoo-lee
Copy link

seungsoo-lee commented Dec 28, 2023

hi @hiroki-chen ,

I also tried to follow the Confiential Computing Deployment Guide provided by NVIDIA.
And, now I stucked on installing NVIDIA driver on the guest VM.
it says ERROR: Unable to load the kernel module 'nvidia.ko'.

My machine spec seems similar to yours.
Actually, based on the document, when we installed the guest VM, its kernel version is 6.2.0-39-generic.

But you say that your guest OS is as follows;
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel

how to find this information from the document?
how to build the guest kernel?

@hiroki-chen
Copy link
Author

@seungsoo-lee

Thanks for the reply. For building the guest kernel, you may clone this repo at branch sev-snp-devel and then follow Preparing to Build the Kernel section to build the kernels. You will find the guest kernel under snp-release-[built date]/linux/guest. Then you launch the CVM, scp the deb packages to it, and install the kernels. Be sure to modify grub so that the 5.19 kernel is selected.

$ sudo vim /etc/default/grub
GRUB_DEFAULT="1>?" 
$ cat /boot/grub/grub.cfg | grep menuentry
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
        menuentry 'Ubuntu, with Linux 6.6.0-rc1-snp-host-5a170ce1a082' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.6.0-rc1-snp-host-5a170ce1a082-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
        menuentry 'Ubuntu, with Linux 6.6.0-rc1-snp-host-5a170ce1a082 (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.6.0-rc1-snp-host-5a170ce1a082-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
        menuentry 'Ubuntu, with Linux 6.2.0-39-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.2.0-39-generic-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
        menuentry 'Ubuntu, with Linux 6.2.0-39-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.2.0-39-generic-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
        menuentry 'Ubuntu, with Linux 5.19.0-rc6-snp-host-c4daeffce56e' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.19.0-rc6-snp-host-c4daeffce56e-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
        menuentry 'Ubuntu, with Linux 5.19.0-rc6-snp-host-c4daeffce56e (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.19.0-rc6-snp-host-c4daeffce56e-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
        menuentry 'Ubuntu, with Linux 5.15.0-91-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-91-generic-advanced-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {
        menuentry 'Ubuntu, with Linux 5.15.0-91-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-91-generic-recovery-f34e1b77-9ec2-4ced-8893-bf54d352ba3a' {

Replace the question mark with the desired kernel version (index starting from 0). Then do update-grub and reboot.

Hope this helps.

@seungsoo-lee
Copy link

seungsoo-lee commented Dec 29, 2023

@hiroki-chen

thanks for the reply!

as your advice, I have updated 5.19-gurest kernel on the guest VM.

after that,

when I tried to do 'Enabling LKCA on the Guest VM' part on the document (p.26)
sudo update-initramfs -u says update-initramfs: Generating /boot/initrd.img-6.2.0-39-generic not 5.19-snp-guest.

how about your case?, is it okay?

@hiroki-chen
Copy link
Author

@seungsoo-lee

By default, this command will select the latest kernel. In your case, it is 6.2.0. You can select the kernel version manually via

sudo update-initramfs -u -k `uname -r`

or simply

sudo update-initramfs -u -k all

@seungsoo-lee
Copy link

@hiroki-chen

now,

I changed kernel to 5.19-snp-guest as your advice
and sudo update-initramfs -u -k all also.

But failed to install NVIDIA driver and CUDA again..
it says

   make[1]: Leaving directory '/usr/src/linux-headers-5.19.0-rc6-snp-guest-c4daeffce56e'
-> done.
-> Kernel module compilation complete.
-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such device
-> Kernel messages:
[    9.059353] audit: type=1400 audit(1704159783.488:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=705 comm="apparmor_parser"
[    9.059357] audit: type=1400 audit(1704159783.488:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=705 comm="apparmor_parser"
[    9.060369] audit: type=1400 audit(1704159783.492:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=702 comm="apparmor_parser"
[    9.060373] audit: type=1400 audit(1704159783.492:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=702 comm="apparmor_parser"
[   12.607062] loop3: detected capacity change from 0 to 8
[   12.607267] Dev loop3: unable to read RDB block 8
[   12.608258]  loop3: unable to read partition table
[   12.608263] loop3: partition table beyond EOD, truncated
[   13.245756] fbcon: Taking over console
[   13.299647] Console: switching to colour frame buffer device 128x48
[  132.090302] nvidia: loading out-of-tree module taints kernel.
[  132.092232] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[  132.124068] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[  132.124076] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2331)
               NVRM: installed in this system is not supported by open
               NVRM: nvidia.ko because it does not include the required GPU
               NVRM: System Processor (GSP).
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the driver README, available on
               NVRM: the Linux graphics driver download page at
               NVRM: www.nvidia.com.
[  137.470645] nvidia: probe of 0000:01:00.0 failed with error -1
[  137.470765] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  137.470768] NVRM: None of the NVIDIA devices were initialized.
[  137.471172] nvidia-nvlink: Unregistered Nvlink Core, major device number 236

do you have any idea?

@hiroki-chen
Copy link
Author

@seungsoo-lee Which CUDA version are you currently using? Only v535 is compatible with H100. If you are already using the correct version, then consider removing all CUDA drivers, kernel modules, and other packages and re-install the driver again. I have once encountered this issue but managed to fix it by re-installing the driver.

@seungsoo-lee
Copy link

@hiroki-chen

some confused..

first, cuda_12.2.1_535.86.10_linux.run should be installed on the host before installing it on the guest VM?

second, if so, which host kernel version should be target?
by default, the host (ubuntu 22.04.3 LTS server) kernel version is 5.15. and we also have 5.19-snp-host kernel.

@Tan-YiFan
Copy link

@hiroki-chen

some confused..

first, cuda_12.2.1_535.86.10_linux.run should be installed on the host before installing it on the guest VM?

second, if so, which host kernel version should be target? by default, the host (ubuntu 22.04.3 LTS server) kernel version is 5.15. and we also have 5.19-snp-host kernel.

@seungsoo-lee No. Installing driver on the host is not required. The motivation of installing driver on the host is to check whether the H100 works fine.

@seungsoo-lee
Copy link

@hiroki-chen
some confused..
first, cuda_12.2.1_535.86.10_linux.run should be installed on the host before installing it on the guest VM?
second, if so, which host kernel version should be target? by default, the host (ubuntu 22.04.3 LTS server) kernel version is 5.15. and we also have 5.19-snp-host kernel.

@seungsoo-lee No. Installing driver on the host is not required. The motivation of installing driver on the host is to check whether the H100 works fine.

here

@seungsoo-lee
Copy link

@hiroki-chen

you said 'If you are already using the correct version, then consider removing all CUDA drivers, kernel modules, and other packages and re-install the driver again.'

please let me know what commands you used

@seungsoo-lee
Copy link

seungsoo-lee commented Jan 3, 2024

I have tried to install the NVIDIA driver on the guest VM all day.
(remove/reinstall host guest and repeat..)

Finally, I got this

cclab@guest:~$ sudo sh cuda_12.2.1_535.86.10_linux.run -m=kernel-open
===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-12.2/

Please make sure that
 -   PATH includes /usr/local/cuda-12.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.2/lib64, or, add /usr/local/cuda-12.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log
cclab@guest:~$ sudo nvidia-persistenced
cclab@guest:~$ ps -aux | grep nvidia-persistenced
root       10413 19.8  0.0   5320  1840 ?        Ss   11:52   0:04 nvidia-persistenced
cclab      10440  0.0  0.0   6612  2404 pts/0    S+   11:53   0:00 grep --color=auto nvidia-persistenced
cclab@guest:~$ nvidia-smi conf-compute -f
CC status: ON
cclab@guest:~$ nvidia-smi -q | grep VBIOS
    VBIOS Version                         : 96.00.30.00.01

my procedures are as follows.

Installing host kernel --> it is okay.
then, after preparing and launching guest VM (ubuntu 22.04.2 as described in the deployment document),

  • installing 5.19-snp-guest kernel to the guest VM
  • right after, removing 6.2.0 kernel from the guest VM

then, installing the NVIDIA driver is succeeded.

HOWEVER,

after rebooting the guest VM,

cclab@guest:~$ sudo nvidia-persistenced
nvidia-persistenced failed to initialize. Check syslog for more details.

cclab@guest:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

@YurkoWasHere
Copy link

I have not tested with the latest commit but I don't think its been fixed.

So try using this commit instead of the latest:

4383b82

@seungsoo-lee
Copy link

Hi @YurkoWasHere

you mean when it tries to turn the CC mode on, that [gpu_cc_tool.py] scripts should be used?

plus, I wonder whether we should turn the dev-tools mode on also when setting CC mode on.

@YurkoWasHere
Copy link

I found that when using this command line with the latest commit to enable the nvidia inside the SEV enclave

(I think this is similar to what your doing with LocalGPUTest.py)
python3 -m verifier.cc_admin --allow_hold_cert

I started to get an error like yours:

			The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
			[
			9
			]

Rolling back to the commit I mentioned corrected this problem.

also dev-tools should not be on.

@YurkoWasHere
Copy link

To enable CC i use this line
python3 gpu_cc_tool.py --gpu-name H100-PCIE --reset-after-cc-mode-switch --set-cc-mode=on

However on my system I been having issues initially with it. Suggestion was to use sysfs instead of devmem but the command line argument seemed to be broken.

My work around was to edit

/shared/nvtrust/host_tools/python/gpu_cc_tool.py

and replace this line

mmio_access_type = "devmem"

with this line

mmio_access_type = "sysfs"

then run gpu_cc_tool.py as above

@seungsoo-lee
Copy link

@YurkoWasHere

in my case, with dev-tools off, LocalGPUTest.py outputs true (no errors). then, is it okay to use the latest commit?

@seungsoo-lee
Copy link

@YurkoWasHere @hiroki-chen

Did you guys run

nvidia-kata-manager-jvlrq
nvidia-sandbox-device-plugin-daemonset-pq87l
nvidia-sandbox-validator-ppfck
nvidia-vfio-manager-rplgt

successfuly in your k8s cluster ?

@hiroki-chen
Copy link
Author

@YurkoWasHere @hiroki-chen

Did you guys run

nvidia-kata-manager-jvlrq
nvidia-sandbox-device-plugin-daemonset-pq87l
nvidia-sandbox-validator-ppfck
nvidia-vfio-manager-rplgt

successfuly in your k8s cluster ?

Hi @seungsoo-lee, sadly I didn't try with K8S cluster.

@hiroki-chen
Copy link
Author

@YurkoWasHere

in my case, with dev-tools off, LocalGPUTest.py outputs true (no errors). then, is it okay to use the latest commit?

@seungsoo-lee If you do not care about the bugs introduced by the latest commit, you can then just use it as long as the stable commit shows the correct attestation result.

@hiroki-chen
Copy link
Author

@YurkoWasHere @seungsoo-lee

A very interesting thing I recently found was that when I tried to attest one of the H100 GPUs on my host inside the VM, the SDK V1.2.0 worked fine but v1.1.0 would fail whereas I installed only one GPU, SDK v1.2.0 would report the error but v1.1.0 worked fine.

$ lspci -d 10de: 
41:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)

# 1.2.0: OK 1.1.0: Fail

$ lspci -d 10de: 
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)

# 1.2.0: Fail 1.1.0: OK

@seungsoo-lee
Copy link

@hiroki-chen

If you do not try with k8s, how to do confidential computing workloads/examples?

@seungsoo-lee
Copy link

A very interesting thing I recently found was that when I tried to attest one of the H100 GPUs on my host inside the VM, the SDK V1.2.0 worked fine but v1.1.0 would fail whereas I installed only one GPU, SDK v1.2.0 would report the error but v1.1.0 worked fine.

what command did you try?

@hiroki-chen
Copy link
Author

A very interesting thing I recently found was that when I tried to attest one of the H100 GPUs on my host inside the VM, the SDK V1.2.0 worked fine but v1.1.0 would fail whereas I installed only one GPU, SDK v1.2.0 would report the error but v1.1.0 worked fine.

what command did you try?

python3 -m verifier.cc_admin --user_mode --allow_hold_cert

@hiroki-chen
Copy link
Author

@hiroki-chen

If you do not try with k8s, how to do confidential computing workloads/examples?

I just tried to run PyTorch examples inside the VM. If one performs ML tasks successfully then confidential computing functionalities are enabled.

@YurkoWasHere
Copy link

We use pytorch as well directory not in a container. Containerized work loads will be the next step.

On the guest you can check the gpu is in the correct state by running
nvidia-smi conf-compute -grs

# nvidia-smi conf-compute -grs

Confidential Compute GPUs Ready state: ready

You can also force the GPU into a ready state without running the attestation by using -srs, but without a valid attestation you cant be confident about the environment.

@thisiskarthikj
Copy link
Collaborator

@YurkoWasHere @hiroki-chen

Sorry, late to the party here. The summary is that for the following combinations of driver and vBIOS versions, we get index "9" mismatch, correct ?

    Driver version : 535.86.10
    VBIOS version : 96.00.74.00.1a

    Driver version : 535.104.05
VBIOS version : 96.00.74.00.1c

Is this happening with the latest commit whereas you don't see this issue with an older commit ? Once you confirm, we will try to re-create the setup and try it and will get back to you. Hang in there !

@hiroki-chen
Copy link
Author

hiroki-chen commented Jan 7, 2024

@YurkoWasHere @hiroki-chen

Sorry, late to the party here. The summary is that for the following combinations of driver and vBIOS versions, we get index "9" mismatch, correct ?

    Driver version : 535.86.10
    VBIOS version : 96.00.74.00.1a

    Driver version : 535.104.05
VBIOS version : 96.00.74.00.1c

Is this happening with the latest commit whereas you don't see this issue with an older commit ? Once you confirm, we will try to re-create the setup and try it and will get back to you. Hang in there !

@thisiskarthikj

Technically it is (for the single-GPU case), but one thing that appears weird to me was that I was able to attest the H100 GPU using the latest commit:

h100@h100-cvm:~$ python3 -m verifier.cc_admin --allow_hold_cert --user_mode
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
        Driver version fetched : 535.104.05
        VBIOS version fetched : 96.00.74.00.1a
        Validating GPU certificate chains.
                GPU attestation report certificate chain validation successful.
                        The certificate chain revocation status verification successful.
        Authenticating attestation report
                The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
                Driver version fetched from the attestation report : 535.104.05
                VBIOS version fetched from the attestation report : 96.00.74.00.1a
                Attestation report signature verification successful.
                Attestation report verification successful.
        Authenticating the RIMs.
                Authenticating Driver RIM
                        Fetching the driver RIM from the RIM service.
                        RIM Schema validation passed.
                        driver RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        driver RIM signature verification successful.
                        Driver RIM verification successful
                Authenticating VBIOS RIM.
                        Fetching the VBIOS RIM from the RIM service.
                        RIM Schema validation passed.
                        vbios RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        vbios RIM signature verification successful.
                        VBIOS RIM verification successful
        Comparing measurements (runtime vs golden)
                        The runtime measurements are matching with the golden measurements.                            
                GPU is in expected state.
        GPU 0 verified successfully.
        GPU Attested Successfully

This happened after I installed another GPU on the machine:

$ lspci -d 10de:
41:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)

And I let the first GPU (41:00.0) bind to VFIO and enabled CC for it.

sudo python3 gpu_cc_tool.py --gpu-bdf=41:00 --set-cc-mode on --reset-after-cc-mode-switch

1.2.0 could work for this scenario but not 1.1.0:

h100@h100-cvm:/shared/nvtrust/guest_tools/attestation_sdk$ python3 -m verifier.cc_admin --allow_hold_cert --user_mode
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
using the new pinned root cert
VERIFYING GPU : 0
        Driver version fetched : 535.104.05
        VBIOS version fetched : 96.00.74.00.1a
        Validating GPU certificate chains.
                GPU attestation report certificate chain validation successful.
                The certificate chain revocation status verification successful.
        Authenticating attestation report
                The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
                Driver version fetched from the attestation report : 535.104.05
                VBIOS version fetched from the attestation report : 96.00.74.00.1a
                Attestation report signature verification successful.
                Attestation report verification successful.
        Authenticating the RIMs.
                Authenticating Driver RIM
                        RIM Schema validation passed.
                        driver RIM certificate chain verification successful.
                        WARNING: THE CERTIFICATE NVIDIA Reference Value L3 GH100 001 IS REVOKED WITH THE STATUS AS 'CERTIFICATE_HOLD'.
                The certificate chain revocation status verification was not successful but continuing.
                        driver RIM signature verification successful.
                        Driver RIM verification successful
                Authenticating VBIOS RIM.
                        RIM Schema validation passed.
                        vbios RIM certificate chain verification successful.
                        WARNING: THE CERTIFICATE NVIDIA Reference Value L3 GH100 001 IS REVOKED WITH THE STATUS AS 'CERTIFICATE_HOLD'.
                The certificate chain revocation status verification was not successful but continuing.
                        vbios RIM signature verification successful.
                        VBIOS RIM verification successful
        Comparing measurements (runtime vs golden)
                        The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
                        [
                        9, 
                        36
                        ]
The verification of GPU 0 resulted in failure.
        GPU Attestation failed

Summary

Multi-GPU

SDK 1.2.0: Works for multiple GPUs (although I used only one GPU inside the CVM)
SDK 1.1.0: Measurement mismatch for index 9 and 36.

Single-GPU

SDK 1.2.0: Measurement mismatch for index 9
SDK 1.1.0: Works fine.

The GPU worked on CC mode.

@seungsoo-lee
Copy link

@hiroki-chen @YurkoWasHere

I'm little confused. So, you mean that

  • first, on the guest VM, the attestation test should be executed and the outputs should be True.

  • second, on the guest VM, after installing pytorch, and run one of the ML sample codes (is it okay to run very simple code?)

  • third, after the code runs, Confidential Compute GPUs Ready state becomes Ready.

  • fourth, k8s workloads can be successfully deployed.

@hiroki-chen
Copy link
Author

hiroki-chen commented Jan 8, 2024

@hiroki-chen @YurkoWasHere

I'm little confused. So, you mean that

  • first, on the guest VM, the attestation test should be executed and the outputs should be True.
  • second, on the guest VM, after installing pytorch, and run one of the ML sample codes (is it okay to run very simple code?)
  • third, after the code runs, Confidential Compute GPUs Ready state becomes Ready.
  • fourth, k8s workloads can be successfully deployed.

Steps 1-3 are meant for preparatory purposes: to confirm that the GPU works in CC mode.

after the code runs, Confidential Compute GPUs Ready state becomes Ready.

Not necessarily (the script might be buggy or you are running in user mode but setting to Ready requires admin privilege). You could enable it manually via

sudo nvidia-smi conf-compute -srs 1

@YurkoWasHere
Copy link

Still have the same issue with latest commit :(

root@(none):/init.d# python3 -m verifier.cc_admin --allow_hold_cert
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
	Driver version fetched : 535.104.05
	VBIOS version fetched : 96.00.74.00.1c
	Validating GPU certificate chains.
		GPU attestation report certificate chain validation successful.
			The certificate chain revocation status verification successful.
	Authenticating attestation report
		The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
		Driver version fetched from the attestation report : 535.104.05
		VBIOS version fetched from the attestation report : 96.00.74.00.1c
		Attestation report signature verification successful.
		Attestation report verification successful.
	Authenticating the RIMs.
		Authenticating Driver RIM
			Fetching the driver RIM from the RIM service.
			RIM Schema validation passed.
			driver RIM certificate chain verification successful.
			The certificate chain revocation status verification successful.
			driver RIM signature verification successful.
			Driver RIM verification successful
		Authenticating VBIOS RIM.
			Fetching the VBIOS RIM from the RIM service.
			RIM Schema validation passed.
			vbios RIM certificate chain verification successful.
			The certificate chain revocation status verification successful.
			vbios RIM signature verification successful.
			VBIOS RIM verification successful
	Comparing measurements (runtime vs golden)
			The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
			[
			9
			]
	GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
	GPU Attestation failed
root@(none):/init.d# 

@hiroki-chen
Copy link
Author

Still have the same issue with latest commit :(

root@(none):/init.d# python3 -m verifier.cc_admin --allow_hold_cert
Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
	Driver version fetched : 535.104.05
	VBIOS version fetched : 96.00.74.00.1c
	Validating GPU certificate chains.
		GPU attestation report certificate chain validation successful.
			The certificate chain revocation status verification successful.
	Authenticating attestation report
		The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
		Driver version fetched from the attestation report : 535.104.05
		VBIOS version fetched from the attestation report : 96.00.74.00.1c
		Attestation report signature verification successful.
		Attestation report verification successful.
	Authenticating the RIMs.
		Authenticating Driver RIM
			Fetching the driver RIM from the RIM service.
			RIM Schema validation passed.
			driver RIM certificate chain verification successful.
			The certificate chain revocation status verification successful.
			driver RIM signature verification successful.
			Driver RIM verification successful
		Authenticating VBIOS RIM.
			Fetching the VBIOS RIM from the RIM service.
			RIM Schema validation passed.
			vbios RIM certificate chain verification successful.
			The certificate chain revocation status verification successful.
			vbios RIM signature verification successful.
			VBIOS RIM verification successful
	Comparing measurements (runtime vs golden)
			The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
			[
			9
			]
	GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
	GPU Attestation failed
root@(none):/init.d# 

@YurkoWasHere I believe they haven't updated yet

@yunbo-xufeng
Copy link

yunbo-xufeng commented Jun 5, 2024

Hi,

Is this problem resolved?
I met exactly the same issue with the latest commit:

Number of GPUs available : 1
-----------------------------------
Fetching GPU 0 information from GPU driver.
Using the Nonce generated by Local GPU Verifier
VERIFYING GPU : 0
	Driver version fetched : 535.104.05
	VBIOS version fetched : 96.00.74.00.1f
	Validating GPU certificate chains.
		GPU attestation report certificate chain validation successful.
			The certificate chain revocation status verification successful.
	Authenticating attestation report
		The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
		Driver version fetched from the attestation report : 535.104.05
		VBIOS version fetched from the attestation report : 96.00.74.00.1f
		Attestation report signature verification successful.
		Attestation report verification successful.
	Authenticating the RIMs.
		Authenticating Driver RIM
			Fetching the driver RIM from the RIM service.
			RIM Schema validation passed.
			driver RIM certificate chain verification successful.
			The certificate chain revocation status verification successful.
			driver RIM signature verification successful.
			Driver RIM verification successful
		Authenticating VBIOS RIM.
			Fetching the VBIOS RIM from the RIM service.
			RIM Schema validation passed.
			vbios RIM certificate chain verification successful.
			The certificate chain revocation status verification successful.
			vbios RIM signature verification successful.
			VBIOS RIM verification successful
	Comparing measurements (runtime vs golden)
			The runtime measurements are not matching with the
                        golden measurements at the following indexes(starting from 0) :
			[
			9
			]
	GPU Ready state is already NOT READY
The verification of GPU 0 resulted in failure.
	GPU Attestation failed

@YurkoWasHere
Copy link

Unfortunately i no longer have the H100 paired with an AMD, but moved on to Intel which required the newest version. Last i checked (about a month ago) it still was not working

If you still are having issues try using the older commit
git checkout 4383b82

@yunbo-xufeng
Copy link

Unfortunately i no longer have the H100 paired with an AMD, but moved on to Intel which required the newest version. Last i checked (about a month ago) it still was not working

If you still are having issues try using the older commit git checkout 4383b82

Actually, I'm not using AMD, my environment is Intel+TDX CVM,GPU is H800。
And I also run the remote test, looks like gpu measurement is also not matched:

[RemoteGPUTest] node name : thisNode1
[['REMOTE_GPU_CLAIMS', <Devices.GPU: 2>, <Environment.REMOTE: 5>, 'https://nras.attestation.nvidia.com/v1/attest/gpu', '', '']]
[RemoteGPUTest] call attest() - expecting True
generate_evidence
Fetching GPU 0 information from GPU driver.
Calling NRAS to attest GPU evidence...
**** Attestation Successful ****
Entity Attestation Token is eyJraWQiOiJudi1lYXQta2lkLXByb2QtMjAyNDA2MDQyMzU4NDY4NTgtZjk4MjYwYzYtZmVlOC00ZTU3LWJlMDEtMTliNWE1YTkwNTc0IiwiYWxnIjoiRVMzODQifQ.eyJzdWIiOiJOVklESUEtR1BVLUFUVEVTVEFUSU9OIiwic2VjYm9vdCI6dHJ1ZSwieC1udmlkaWEtZ3B1LW1hbnVmYWN0dXJlciI6Ik5WSURJQSBDb3Jwb3JhdGlvbiIsIngtbnZpZGlhLWF0dGVzdGF0aW9uLXR5cGUiOiJHUFUiLCJpc3MiOiJodHRwczovL25yYXMuYXR0ZXN0YXRpb24ubnZpZGlhLmNvbSIsImVhdF9ub25jZSI6IjkzMUQ4REQwQUREMjAzQUMzRDhCNEZCREU3NUUxMTUyNzhFRUZDRENFQUM1Qjg3NjcxQTc0OEYzMjM2NERGQ0IiLCJ4LW52aWRpYS1hdHRlc3RhdGlvbi1kZXRhaWxlZC1yZXN1bHQiOnsieC1udmlkaWEtZ3B1LWRyaXZlci1yaW0tc2NoZW1hLXZhbGlkYXRlZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LXZiaW9zLXJpbS1jZXJ0LXZhbGlkYXRlZCI6dHJ1ZSwieC1udmlkaWEtbWlzbWF0Y2gtbWVhc3VyZW1lbnQtcmVjb3JkcyI6W3siaW5kZXgiOjksImdvbGRlblNpemUiOjQ4LCJnb2xkZW5WYWx1ZSI6IjA1OWIzMmU3MTJhMTUzZjQ5MGRiZmI3OTc2YTllMjc1ZDc4OWUyOGJkNDgwM2MzNTdkZWYyYjYxMjMzMjdjNDMwNTI2YmZhZWNjMjAwZjQ5NmQ0ZTE0OWZjNWVhZGUwMyIsInJ1bnRpbWVTaXplIjo0OCwicnVudGltZVZhbHVlIjoiN2YzZTkzODI3ODU1MTNjMTkzMmRmY2M5ZTg3ZjZlZjZiZjVmZWZlODgxNDRjNmVhNDg1MzllNjVmOTM3MDEzZGQ3MzQ5MTQ0ZTVmNDM5ZGNlYTQwMWRhYzI2ZTVjMDk4In1dLCJ4LW52aWRpYS1ncHUtYXR0ZXN0YXRpb24tcmVwb3J0LWNlcnQtY2hhaW4tdmFsaWRhdGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtZHJpdmVyLXJpbS1zY2hlbWEtZmV0Y2hlZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC1wYXJzZWQiOnRydWUsIngtbnZpZGlhLWdwdS1ub25jZS1tYXRjaCI6dHJ1ZSwieC1udmlkaWEtZ3B1LXZiaW9zLXJpbS1zaWduYXR1cmUtdmVyaWZpZWQiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLXNpZ25hdHVyZS12ZXJpZmllZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LWFyY2gtY2hlY2siOnRydWUsIngtbnZpZGlhLWF0dGVzdGF0aW9uLXdhcm5pbmciOm51bGwsIngtbnZpZGlhLWdwdS1tZWFzdXJlbWVudHMtbWF0Y2giOmZhbHNlLCJ4LW52aWRpYS1taXNtYXRjaC1pbmRleGVzIjpbOV0sIngtbnZpZGlhLWdwdS1hdHRlc3RhdGlvbi1yZXBvcnQtc2lnbmF0dXJlLXZlcmlmaWVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLXNjaGVtYS12YWxpZGF0ZWQiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLWNlcnQtdmFsaWRhdGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLXNjaGVtYS1mZXRjaGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLW1lYXN1cmVtZW50cy1hdmFpbGFibGUiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLWRyaXZlci1tZWFzdXJlbWVudHMtYXZhaWxhYmxlIjp0cnVlfSwieC1udmlkaWEtdmVyIjoiMS4wIiwibmJmIjoxNzE3NTczOTg5LCJ4LW52aWRpYS1ncHUtZHJpdmVyLXZlcnNpb24iOiI1MzUuMTA0LjA1IiwiZGJnc3RhdCI6ImRpc2FibGVkIiwiaHdtb2RlbCI6IkdIMTAwIEEwMSBHU1AgQlJPTSIsIm9lbWlkIjoiNTcwMyIsIm1lYXNyZXMiOiJjb21wYXJpc29uLWZhaWwiLCJleHAiOjE3MTc1Nzc1ODksImlhdCI6MTcxNzU3Mzk4OSwieC1udmlkaWEtZWF0LXZlciI6IkVBVC0yMSIsInVlaWQiOiI0NzkxNTk0NTUzMTI5NDk4MDg5NzcwMzQwMTMxMzA5NTEzNzI5MDc0Mjc3OTc4NzYiLCJ4LW52aWRpYS1ncHUtdmJpb3MtdmVyc2lvbiI6Ijk2LjAwLjc0LjAwLjFGIiwianRpIjoiODQ1ZWU4ZDQtN2IzMS00M2E3LWI1NjQtNmI4ZGNjMmVkY2JmIn0.5ecQ6aopvHsTuCXN9tqfmZKVTAB4VzW5auoNgwVlSeGNbqXoSm8PmEsRmQLO6btjeyTOV-iNixJnDqjbuNjuR8_qRw5uWwLAZUd-cJAwLYjmOPPKObJbDF1H8TalDOC2
True
[RemoteGPUTest] token : [["JWT", "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJOVi1BdHRlc3RhdGlvbi1TREsiLCJpYXQiOjE3MTc1NzM5ODUsImV4cCI6bnVsbH0.2UbZe2h_TLQTad2QjDIKIXuaSiTpDM7oj3nlhRNlqqY"], {"REMOTE_GPU_CLAIMS": "eyJraWQiOiJudi1lYXQta2lkLXByb2QtMjAyNDA2MDQyMzU4NDY4NTgtZjk4MjYwYzYtZmVlOC00ZTU3LWJlMDEtMTliNWE1YTkwNTc0IiwiYWxnIjoiRVMzODQifQ.eyJzdWIiOiJOVklESUEtR1BVLUFUVEVTVEFUSU9OIiwic2VjYm9vdCI6dHJ1ZSwieC1udmlkaWEtZ3B1LW1hbnVmYWN0dXJlciI6Ik5WSURJQSBDb3Jwb3JhdGlvbiIsIngtbnZpZGlhLWF0dGVzdGF0aW9uLXR5cGUiOiJHUFUiLCJpc3MiOiJodHRwczovL25yYXMuYXR0ZXN0YXRpb24ubnZpZGlhLmNvbSIsImVhdF9ub25jZSI6IjkzMUQ4REQwQUREMjAzQUMzRDhCNEZCREU3NUUxMTUyNzhFRUZDRENFQUM1Qjg3NjcxQTc0OEYzMjM2NERGQ0IiLCJ4LW52aWRpYS1hdHRlc3RhdGlvbi1kZXRhaWxlZC1yZXN1bHQiOnsieC1udmlkaWEtZ3B1LWRyaXZlci1yaW0tc2NoZW1hLXZhbGlkYXRlZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LXZiaW9zLXJpbS1jZXJ0LXZhbGlkYXRlZCI6dHJ1ZSwieC1udmlkaWEtbWlzbWF0Y2gtbWVhc3VyZW1lbnQtcmVjb3JkcyI6W3siaW5kZXgiOjksImdvbGRlblNpemUiOjQ4LCJnb2xkZW5WYWx1ZSI6IjA1OWIzMmU3MTJhMTUzZjQ5MGRiZmI3OTc2YTllMjc1ZDc4OWUyOGJkNDgwM2MzNTdkZWYyYjYxMjMzMjdjNDMwNTI2YmZhZWNjMjAwZjQ5NmQ0ZTE0OWZjNWVhZGUwMyIsInJ1bnRpbWVTaXplIjo0OCwicnVudGltZVZhbHVlIjoiN2YzZTkzODI3ODU1MTNjMTkzMmRmY2M5ZTg3ZjZlZjZiZjVmZWZlODgxNDRjNmVhNDg1MzllNjVmOTM3MDEzZGQ3MzQ5MTQ0ZTVmNDM5ZGNlYTQwMWRhYzI2ZTVjMDk4In1dLCJ4LW52aWRpYS1ncHUtYXR0ZXN0YXRpb24tcmVwb3J0LWNlcnQtY2hhaW4tdmFsaWRhdGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtZHJpdmVyLXJpbS1zY2hlbWEtZmV0Y2hlZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LWF0dGVzdGF0aW9uLXJlcG9ydC1wYXJzZWQiOnRydWUsIngtbnZpZGlhLWdwdS1ub25jZS1tYXRjaCI6dHJ1ZSwieC1udmlkaWEtZ3B1LXZiaW9zLXJpbS1zaWduYXR1cmUtdmVyaWZpZWQiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLXNpZ25hdHVyZS12ZXJpZmllZCI6dHJ1ZSwieC1udmlkaWEtZ3B1LWFyY2gtY2hlY2siOnRydWUsIngtbnZpZGlhLWF0dGVzdGF0aW9uLXdhcm5pbmciOm51bGwsIngtbnZpZGlhLWdwdS1tZWFzdXJlbWVudHMtbWF0Y2giOmZhbHNlLCJ4LW52aWRpYS1taXNtYXRjaC1pbmRleGVzIjpbOV0sIngtbnZpZGlhLWdwdS1hdHRlc3RhdGlvbi1yZXBvcnQtc2lnbmF0dXJlLXZlcmlmaWVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLXNjaGVtYS12YWxpZGF0ZWQiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLWNlcnQtdmFsaWRhdGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLXNjaGVtYS1mZXRjaGVkIjp0cnVlLCJ4LW52aWRpYS1ncHUtdmJpb3MtcmltLW1lYXN1cmVtZW50cy1hdmFpbGFibGUiOnRydWUsIngtbnZpZGlhLWdwdS1kcml2ZXItcmltLWRyaXZlci1tZWFzdXJlbWVudHMtYXZhaWxhYmxlIjp0cnVlfSwieC1udmlkaWEtdmVyIjoiMS4wIiwibmJmIjoxNzE3NTczOTg5LCJ4LW52aWRpYS1ncHUtZHJpdmVyLXZlcnNpb24iOiI1MzUuMTA0LjA1IiwiZGJnc3RhdCI6ImRpc2FibGVkIiwiaHdtb2RlbCI6IkdIMTAwIEEwMSBHU1AgQlJPTSIsIm9lbWlkIjoiNTcwMyIsIm1lYXNyZXMiOiJjb21wYXJpc29uLWZhaWwiLCJleHAiOjE3MTc1Nzc1ODksImlhdCI6MTcxNzU3Mzk4OSwieC1udmlkaWEtZWF0LXZlciI6IkVBVC0yMSIsInVlaWQiOiI0NzkxNTk0NTUzMTI5NDk4MDg5NzcwMzQwMTMxMzA5NTEzNzI5MDc0Mjc3OTc4NzYiLCJ4LW52aWRpYS1ncHUtdmJpb3MtdmVyc2lvbiI6Ijk2LjAwLjc0LjAwLjFGIiwianRpIjoiODQ1ZWU4ZDQtN2IzMS00M2E3LWI1NjQtNmI4ZGNjMmVkY2JmIn0.5ecQ6aopvHsTuCXN9tqfmZKVTAB4VzW5auoNgwVlSeGNbqXoSm8PmEsRmQLO6btjeyTOV-iNixJnDqjbuNjuR8_qRw5uWwLAZUd-cJAwLYjmOPPKObJbDF1H8TalDOC2"}]
[RemoteGPUTest] call validate_token() - expecting True
***** Validating Signature using JWKS endpont https://nras.attestation.nvidia.com/.well-known/jwks.json ******
Decoded Token  {
  "sub": "NVIDIA-GPU-ATTESTATION",
  "secboot": true,
  "x-nvidia-gpu-manufacturer": "NVIDIA Corporation",
  "x-nvidia-attestation-type": "GPU",
  "iss": "https://nras.attestation.nvidia.com",
  "eat_nonce": "931D8DD0ADD203AC3D8B4FBDE75E115278EEFCDCEAC5B87671A748F32364DFCB",
  "x-nvidia-attestation-detailed-result": {
    "x-nvidia-gpu-driver-rim-schema-validated": true,
    "x-nvidia-gpu-vbios-rim-cert-validated": true,
    "x-nvidia-mismatch-measurement-records": [
      {
        "index": 9,
        "goldenSize": 48,
        "goldenValue": "059b32e712a153f490dbfb7976a9e275d789e28bd4803c357def2b6123327c430526bfaecc200f496d4e149fc5eade03",
        "runtimeSize": 48,
        "runtimeValue": "7f3e9382785513c1932dfcc9e87f6ef6bf5fefe88144c6ea48539e65f937013dd7349144e5f439dcea401dac26e5c098"
      }
    ],
    "x-nvidia-gpu-attestation-report-cert-chain-validated": true,
    "x-nvidia-gpu-driver-rim-schema-fetched": true,
    "x-nvidia-gpu-attestation-report-parsed": true,
    "x-nvidia-gpu-nonce-match": true,
    "x-nvidia-gpu-vbios-rim-signature-verified": true,
    "x-nvidia-gpu-driver-rim-signature-verified": true,
    "x-nvidia-gpu-arch-check": true,
    "x-nvidia-attestation-warning": null,
    "x-nvidia-gpu-measurements-match": false,
    "x-nvidia-mismatch-indexes": [
      9
    ],
    "x-nvidia-gpu-attestation-report-signature-verified": true,
    "x-nvidia-gpu-vbios-rim-schema-validated": true,
    "x-nvidia-gpu-driver-rim-cert-validated": true,
    "x-nvidia-gpu-vbios-rim-schema-fetched": true,
    "x-nvidia-gpu-vbios-rim-measurements-available": true,
    "x-nvidia-gpu-driver-rim-driver-measurements-available": true
  },
  "x-nvidia-ver": "1.0",
  "nbf": 1717573989,
  "x-nvidia-gpu-driver-version": "535.104.05",
  "dbgstat": "disabled",
  "hwmodel": "GH100 A01 GSP BROM",
  "oemid": "5703",
  "measres": "comparison-fail",
  "exp": 1717577589,
  "iat": 1717573989,
  "x-nvidia-eat-ver": "EAT-21",
  "ueid": "479159455312949808977034013130951372907427797876",
  "x-nvidia-gpu-vbios-version": "96.00.74.00.1F",
  "jti": "845ee8d4-7b31-43a7-b564-6b8dcc2edcbf"
}
***** JWT token signature is valid. *****
	[ERROR] Invalid token. Authorized claims does not match the appraisal policy:  x-nvidia-gpu-measurements-match
False

@hiroki-chen
Copy link
Author

@yunbo-xufeng I'm not sure if the old commit works for H800 but H100 is supported :/ Did you try the 4383b82 commit? If you tried with that commit and remote attestation failed then I think you'll probably have to wait for NVIDIA's team to fix this issue.

@thisiskarthikj
Copy link
Collaborator

@hiroki-chen @yunbo-xufeng Measurements mismatch could be an issue with RIM file itself. We will take a look and get back to you.

@thisiskarthikj
Copy link
Collaborator

@yunbo-xufeng Can you get me the version of nvidia_gpu_tools.py that you are using ?

python3 nvidia_gpu_tools.py --help | grep version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants