Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel LA57 Support #6

Open
Champ-Goblem opened this issue Apr 8, 2024 · 2 comments
Open

Intel LA57 Support #6

Champ-Goblem opened this issue Apr 8, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@Champ-Goblem
Copy link

We are attempting to test out PVM on GCP through their newer machine types (N4/C3) however when trying to load the PVM kernel module we see the following error in the kernel logs:

[  152.852592] kvm_pvm: Supporting for LA57 host is not fully implemented yet.

We managed to avoid this error by disabling 5-level paging support when compiling the kernel (CONFIG_X86_5LEVEL=n), however, it would be good to know if support for this will be added at some point, and ideally to have this as a way to track that support.

Thanks

@laijs
Copy link

laijs commented Apr 9, 2024

@Champ-Goblem Thanks for report, we are adding the support of it. We will report the progresses here as possible as we can.

@bysui bysui added the enhancement New feature or request label Apr 9, 2024
bysui added a commit that referenced this issue Apr 25, 2024
For a 4-level paging mode PVM guest, only the top 128TB is canonical.
When KASLR is enabled for 5-level page tables, this range overlaps with
the KASLR entropy range. Therefore, set the end address to -128TB to
reserve a range for the PVM guest. Regarding the KASAN area, the size is
sufficient.

Signed-off-by: Hou Wenlong <[email protected]>
Link: #6
bysui added a commit that referenced this issue Apr 25, 2024
…ld SP role

The 'role.host_mmu_la57_top_p4d' bit is only allowed for L4 SP and
should not be inherited when calculating the child SP role. Otherwise,
wrong spte will be set in drop_parent_pte() and it will result in a
broken SPT.

Signed-off-by: Hou Wenlong <[email protected]>
Link: #6
bysui added a commit that referenced this issue Apr 25, 2024
When 5-level paging mode is enabled on the host, the guest can be either
in 4-level paging mode or 5-level paging mode. For 4-level paging mode,
only the topmost 128TB is canonical. Therefore, the hypervisor needs to
reserve two ranges: one in the vmalloc area for the 5-level paging mode
guest, and another in the topmost 128TB for the 4-level paging mode
guest. If the allocation of the range for the 5-level paging mode guest
fails, then 5-level paging mode is disabled for the guest.

Signed-off-by: Hou Wenlong <[email protected]>
Link: #6
bysui added a commit that referenced this issue Apr 25, 2024
…l paging mode

According to the PVM specification, a flag in the PVM_HC_LOAD_PGTBL
hypercall is allowed to directly change the paging mode. Therefore, add
the missing flags when the guest is in 5-level paging mode. This
preparation is done to support 5-level paging mode guests.

Signed-off-by: Hou Wenlong <[email protected]>
Link: #6
bysui added a commit that referenced this issue Apr 25, 2024
Similar to the 4-level paging mode guest, the 5-level paging mode guest
should lie within the allowed range provided by the hypervisor.

Signed-off-by: Hou Wenlong <[email protected]>
Link: #6
bysui added a commit that referenced this issue Apr 25, 2024
The 5-level paging mode is enabled in compressed kernel booting and it
uses the CPUID instruction to detect 5-level paging support. For PVM
guest, pvm_cpuid() should be used instead of the CPUID instruction, so
detect PVM hypervisor support early in configure_5level_paging().
Additionally, relocation for PVM guest during booting should be avoided.
This is because there is only the first 4G identity mapping, and if
physical address randomization is enabled, a #PF exception will occur if
the chosen output address is over the first 4G range. Therefore, for
simplification, physical address randomization should be avoided. As for
virtual address randomization, it should occur after entering the kernel
entry.

Signed-off-by: Hou Wenlong <[email protected]>
Link: #6
@bysui bysui self-assigned this Apr 25, 2024
@bysui
Copy link
Collaborator

bysui commented Apr 25, 2024

Hi @Champ-Goblem , We have provided basic support for 5-level paging mode hosts and 5-level paging mode guests. I have tested booting 4-level paging mode guests and 5-level paging mode guests on a 5-level paging mode host, but it's not fully tested, as we don't have CI/CD now. Please note that migrating a 4-level paging mode guest from a 5-level paging mode host to 4-level paging mode host is currently not successful.

If any problems occur, please feel free to report them. Thanks!

pojntfx added a commit to loopholelabs/linux-pvm-ci that referenced this issue Apr 26, 2024
pojntfx pushed a commit to loopholelabs/linux-pvm that referenced this issue Oct 18, 2024
…te_call_indirect

kprobe_emulate_call_indirect currently uses int3_emulate_call to emulate
indirect calls. However, int3_emulate_call always assumes the size of
the call to be 5 bytes when calculating the return address. This is
incorrect for register-based indirect calls in x86, which can be either
2 or 3 bytes depending on whether REX prefix is used. At kprobe runtime,
the incorrect return address causes control flow to land onto the wrong
place after return -- possibly not a valid instruction boundary. This
can lead to a panic like the following:

[    7.308204][    C1] BUG: unable to handle page fault for address: 000000000002b4d8
[    7.308883][    C1] #PF: supervisor read access in kernel mode
[    7.309168][    C1] #PF: error_code(0x0000) - not-present page
[    7.309461][    C1] PGD 0 P4D 0
[    7.309652][    C1] Oops: 0000 [#1] SMP
[    7.309929][    C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.7.0-rc5-trace-for-next virt-pvm#6
[    7.310397][    C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[    7.311068][    C1] RIP: 0010:__common_interrupt+0x52/0xc0
[    7.311349][    C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3
[    7.312512][    C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046
[    7.312899][    C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001
[    7.313334][    C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4
[    7.313702][    C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482
[    7.314146][    C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023
[    7.314509][    C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000
[    7.314951][    C1] FS:  0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[    7.315396][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.315691][    C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0
[    7.316153][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    7.316508][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    7.316948][    C1] Call Trace:
[    7.317123][    C1]  <IRQ>
[    7.317279][    C1]  ? __die_body+0x64/0xb0
[    7.317482][    C1]  ? page_fault_oops+0x248/0x370
[    7.317712][    C1]  ? __wake_up+0x96/0xb0
[    7.317964][    C1]  ? exc_page_fault+0x62/0x130
[    7.318211][    C1]  ? asm_exc_page_fault+0x22/0x30
[    7.318444][    C1]  ? __cfi_native_send_call_func_single_ipi+0x10/0x10
[    7.318860][    C1]  ? default_idle+0xb/0x10
[    7.319063][    C1]  ? __common_interrupt+0x52/0xc0
[    7.319330][    C1]  common_interrupt+0x78/0x90
[    7.319546][    C1]  </IRQ>
[    7.319679][    C1]  <TASK>
[    7.319854][    C1]  asm_common_interrupt+0x22/0x40
[    7.320082][    C1] RIP: 0010:default_idle+0xb/0x10
[    7.320309][    C1] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 66 90 0f 00 2d 09 b9 3b 00 fb f4 <fa> c3 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 e9
[    7.321449][    C1] RSP: 0018:ffffc9000009bee8 EFLAGS: 00000256
[    7.321808][    C1] RAX: ffff88813bca8b68 RBX: 0000000000000001 RCX: 000000000001ef0c
[    7.322227][    C1] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000001ef0c
[    7.322656][    C1] RBP: ffffc9000009bef8 R08: 8000000000000000 R09: 00000000000008c2
[    7.323083][    C1] R10: 0000000000000000 R11: ffffffff81058e70 R12: 0000000000000000
[    7.323530][    C1] R13: ffff8881002b30c0 R14: 0000000000000000 R15: 0000000000000000
[    7.323948][    C1]  ? __cfi_lapic_next_deadline+0x10/0x10
[    7.324239][    C1]  default_idle_call+0x31/0x50
[    7.324464][    C1]  do_idle+0xd3/0x240
[    7.324690][    C1]  cpu_startup_entry+0x25/0x30
[    7.324983][    C1]  start_secondary+0xb4/0xc0
[    7.325217][    C1]  secondary_startup_64_no_verify+0x179/0x17b
[    7.325498][    C1]  </TASK>
[    7.325641][    C1] Modules linked in:
[    7.325906][    C1] CR2: 000000000002b4d8
[    7.326104][    C1] ---[ end trace 0000000000000000 ]---
[    7.326354][    C1] RIP: 0010:__common_interrupt+0x52/0xc0
[    7.326614][    C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3
[    7.327570][    C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046
[    7.327910][    C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001
[    7.328273][    C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4
[    7.328632][    C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482
[    7.329223][    C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023
[    7.329780][    C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000
[    7.330193][    C1] FS:  0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[    7.330632][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.331050][    C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0
[    7.331454][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    7.331854][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    7.332236][    C1] Kernel panic - not syncing: Fatal exception in interrupt
[    7.332730][    C1] Kernel Offset: disabled
[    7.333044][    C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

The relevant assembly code is (from objdump, faulting address
highlighted):

ffffffff8102ed9d:       41 ff d3                  call   *%r11
ffffffff8102eda0:       65 48 <8b> 05 30 c7 ff    mov    %gs:0x7effc730(%rip),%rax

The emulation incorrectly sets the return address to be ffffffff8102ed9d
+ 0x5 = ffffffff8102eda2, which is the 8b byte in the middle of the next
mov. This in turn causes incorrect subsequent instruction decoding and
eventually triggers the page fault above.

Instead of invoking int3_emulate_call, perform push and jmp emulation
directly in kprobe_emulate_call_indirect. At this point we can obtain
the instruction size from p->ainsn.size so that we can calculate the
correct return address.

Link: https://lore.kernel.org/all/[email protected]/

Fixes: 6256e66 ("x86/kprobes: Use int3 instead of debug trap for single-step")
Cc: [email protected]
Signed-off-by: Jinghao Jia <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants