-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVM host kernel panic after restore from snapshot in Cloud Hypervisor on EC2 #7
Comments
Hi @pojntfx , Thank you for your CI/CD testing for PVM. We really appreciate it. Actually, there is a major problem with live migration between different hosts currently. As can be seen from the error log of Cloud Hypervisor, the restoration of diff --git a/arch/x86/kvm/pvm/host_mmu.c b/arch/x86/kvm/pvm/host_mmu.c
index 35e97f4f7055..047e7679fe2d 100644
--- a/arch/x86/kvm/pvm/host_mmu.c
+++ b/arch/x86/kvm/pvm/host_mmu.c
@@ -35,8 +35,11 @@ static int __init guest_address_space_init(void)
return -1;
}
- pvm_va_range_l4 = get_vm_area_align(DEFAULT_RANGE_L4_SIZE, PT_L4_SIZE,
- VM_ALLOC|VM_NO_GUARD);
+ //pvm_va_range_l4 = get_vm_area_align(DEFAULT_RANGE_L4_SIZE, PT_L4_SIZE,
+ // VM_ALLOC|VM_NO_GUARD);
+ pvm_va_range_l4 = __get_vm_area_caller(DEFAULT_RANGE_L4_SIZE, VM_ALLOC|VM_NO_GUARD,
+ VMALLOC_END - DEFAULT_RANGE_L4_SIZE, VMALLOC_END,
+ __builtin_return_address(0));
if (!pvm_va_range_l4)
return -1;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6e4b95f24bd8..bf89f9184b62 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2622,6 +2622,7 @@ struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags,
return __get_vm_area_node(size, 1, PAGE_SHIFT, flags, start, end,
NUMA_NO_NODE, GFP_KERNEL, caller);
}
+EXPORT_SYMBOL_GPL(__get_vm_area_caller); I've tried to use the provided host config file, but I cannot reproduce the issue. Without the workaround, the restoration fails, and a new VM boots instead. However, with the workaround in place, the restoration is successful on the second host. From the host kernel crash log, it appears that the first problem is due to a NULL pointer access in Additionally, based on the log, it seems that the issue may be related to diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29413cb2f090..72b2a0964df8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5517,12 +5517,20 @@ static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
*/
u64 supported_xcr0 = vcpu->arch.guest_supported_xcr0 |
XFEATURE_MASK_FPSSE;
+ union fpregs_state *ustate = (void *)state;
if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
return;
fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, state, size,
supported_xcr0, vcpu->arch.pkru);
+
+ pr_info("during getting:\n guest xcr0: %llx, host xcr0: %llx, supported_xcr0: %llx\n",
+ vcpu->arch.xcr0, host_xcr0, supported_xcr0);
+ pr_info("guest pkru: %x, host pkru: %x\n",
+ vcpu->arch.pkru, vcpu->arch.host_pkru);
+ pr_info("xfeatures: %llx, xcomp_bv: %llx\n",
+ ustate->xsave.header.xfeatures, ustate->xsave.header.xcomp_bv);
}
static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
@@ -5535,9 +5543,17 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
{
+ union fpregs_state *ustate = (void *)guest_xsave->region;
+
if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
return 0;
+ pr_info("during settting:\n guest xcr0: %llx, host xcr0: %llx, supported_xcr0: %llx\n",
+ vcpu->arch.xcr0, host_xcr0, kvm_caps.supported_xcr0);
+ pr_info("guest pkru: %x, host pkru: %x\n",
+ vcpu->arch.pkru, vcpu->arch.host_pkru);
+ pr_info("xfeatures: %llx, xcomp_bv: %llx\n",
+ ustate->xsave.header.xfeatures, ustate->xsave.header.xcomp_bv);
return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
guest_xsave->region,
kvm_caps.supported_xcr0,
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 271dcb2deabc..21403b6e12a6 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -6,6 +6,7 @@
#include <xen/xen.h>
#include <asm/fpu/api.h>
+#include <asm/fpu/xcr.h>
#include <asm/sev.h>
#include <asm/traps.h>
#include <asm/kdebug.h>
@@ -121,8 +122,18 @@ static bool ex_handler_sgx(const struct exception_table_entry *fixup,
static bool ex_handler_fprestore(const struct exception_table_entry *fixup,
struct pt_regs *regs)
{
+ static bool once;
regs->ip = ex_fixup_addr(fixup);
+ if (boot_cpu_has(X86_FEATURE_XSAVE) && !once) {
+ struct xregs_state *state = (void *)regs->di;
+
+ once = true;
+ pr_info("xcr0 is %llx\n", xgetbv(XCR_XFEATURE_ENABLED_MASK));
+ pr_info("xfeatures: %llx, xcomp_bv: %llx\n",
+ state->header.xfeatures, state->header.xcomp_bv);
+ }
+
WARN_ONCE(1, "Bad FPU state detected at %pB, reinitializing FPU registers.",
(void *)instruction_pointer(regs)); |
…bug cross-host restore issues
Hi! Thanks a lot for your suggestions, sadly it doesn't look like I can't $ free -h
total used free shared buff/cache available
Mem: 3.6Gi 396Mi 2.9Gi 620Ki 576Mi 3.2Gi
Swap: 3.6Gi 0B 3.6Gi
$ sudo modprobe kvm-pvm
modprobe: ERROR: could not insert 'kvm_pvm': Cannot allocate memory # Kernel log
[ 510.203704] vmap allocation for size 17592186044416 failed: use vmalloc=<size> to increase size I've added the patches you've posted above to the CI's AWS configs (see https://github.com/loopholelabs/linux-pvm-ci/blob/master/patches/add-xsave-debug-logs.patch and https://github.com/loopholelabs/linux-pvm-ci/blob/master/patches/use-fixed-pvm-range.patch); to reproduce, run: uname -r # Get installed kernels - make sure that there is at least one more kernel than the one you're removing!
sudo rpm -e kernel-6.7.0_rc6_pvm_host_fedora_aws-1.x86_64
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot sudo dnf clean all
sudo dnf upgrade -y --refresh
sudo dnf install -y kernel-6.7.0_rc6_pvm_host_fedora_aws-1.x86_64
sudo grubby --set-default /boot/vmlinuz-6.7.0-rc6-pvm-host-fedora-aws
sudo grubby --args="pti=off nokaslr lapic=notscdeadline" --update-kernel /boot/vmlinuz-6.7.0-rc6-pvm-host-fedora-aws
sudo tee /etc/modprobe.d/kvm-intel-amd-blacklist.conf <<EOF
blacklist kvm-intel
blacklist kvm-amd
EOF
echo "kvm-pvm" | sudo tee /etc/modules-load.d/kvm-pvm.conf
sudo reboot I've also tried it with a larger (512M) $ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.7.0-rc6-pvm-host-fedora-aws root=UUID=2f1b4fb2-54ec-4124-a3b4-1776614e5e4c ro rootflags=subvol=root no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 pti=off vmalloc=512M Let me know if there is anything I can do to help debugging! |
Sorry, I made a mistake that I was mixed up when I wrote the reply. The workaround works in my old kernel version. I forgot that The correct workaround can look like this, based on the latest 'pvm' branch with 5-level paging mode support. Additionally, you have to disable KASLR for the host kernel by adding diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 1526747bedf2..0a0a13784403 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -713,6 +713,27 @@ static void __init x86_report_nx(void)
}
}
+#ifdef CONFIG_X86_64
+static void __init x86_reserve_vmalloc_range(void)
+{
+ static struct vm_struct pvm;
+ unsigned long size = 32UL << 39;
+
+ if (pgtable_l5_enabled())
+ size = 32UL << 48;
+
+ pvm.addr = (void *)(VMALLOC_END + 1 - size);
+ pvm.size = size;
+ pvm.flags = VM_ALLOC | VM_NO_GUARD;
+
+ vm_area_add_early(&pvm);
+}
+#else
+static void __init x86_reserve_vmalloc_range(void)
+{
+}
+#endif
+
/*
* Determine if we were loaded by an EFI loader. If so, then we have also been
* passed the efi memmap, systab, etc., so we should use these data structures
@@ -955,6 +976,7 @@ void __init setup_arch(char **cmdline_p)
* defined and before each memory section base is used.
*/
kernel_randomize_memory();
+ x86_reserve_vmalloc_range();
#ifdef CONFIG_X86_32
/* max_low_pfn get updated here */
diff --git a/arch/x86/kvm/pvm/host_mmu.c b/arch/x86/kvm/pvm/host_mmu.c
index a60a7c78ca5a..3bda09f1de69 100644
--- a/arch/x86/kvm/pvm/host_mmu.c
+++ b/arch/x86/kvm/pvm/host_mmu.c
@@ -51,9 +51,8 @@ static int __init guest_address_space_init(void)
pml4_index_start = L4_PT_INDEX(PVM_GUEST_MAPPING_START);
pml4_index_end = L4_PT_INDEX(RAW_CPU_ENTRY_AREA_BASE);
- pvm_va_range = get_vm_area_align(DEFAULT_RANGE_L5_SIZE, PT_L5_SIZE,
- VM_ALLOC|VM_NO_GUARD);
- if (!pvm_va_range) {
+ pvm_va_range = find_vm_area((void *)(VMALLOC_END + 1 - DEFAULT_RANGE_L5_SIZE));
+ if (!pvm_va_range || pvm_va_range->size != DEFAULT_RANGE_L5_SIZE) {
pml5_index_start = 0x1ff;
pml5_index_end = 0x1ff;
} else {
@@ -62,9 +61,8 @@ static int __init guest_address_space_init(void)
(u64)pvm_va_range->size);
}
} else {
- pvm_va_range = get_vm_area_align(DEFAULT_RANGE_L4_SIZE, PT_L4_SIZE,
- VM_ALLOC|VM_NO_GUARD);
- if (!pvm_va_range)
+ pvm_va_range = find_vm_area((void *)(VMALLOC_END + 1 - DEFAULT_RANGE_L4_SIZE));
+ if (!pvm_va_range || pvm_va_range->size != DEFAULT_RANGE_L4_SIZE)
return -1;
pml4_index_start = L4_PT_INDEX((u64)pvm_va_range->addr);
@@ -133,8 +131,6 @@ int __init host_mmu_init(void)
void host_mmu_destroy(void)
{
- if (pvm_va_range)
- free_vm_area(pvm_va_range);
if (host_mmu_root_pgd)
free_page((unsigned long)(void *)host_mmu_root_pgd);
if (host_mmu_la57_top_p4d)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6e4b95f24bd8..3fead6a4f5c9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2680,6 +2680,7 @@ struct vm_struct *find_vm_area(const void *addr)
return va->vm;
}
+EXPORT_SYMBOL_GPL(find_vm_area); After further investigation, I was able to reproduce the NULL pointer access in The main problem still lies in the failed restoration of diff --git a/arch/x86/kvm/pvm/pvm.c b/arch/x86/kvm/pvm/pvm.c
index 466f989cbcc3..2c83bb3251b6 100644
--- a/arch/x86/kvm/pvm/pvm.c
+++ b/arch/x86/kvm/pvm/pvm.c
@@ -1193,10 +1193,13 @@ static int pvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
* user memory region before the VM entry.
*/
pvm->msr_vcpu_struct = data;
- if (!data)
+ if (!data) {
kvm_gpc_deactivate(&pvm->pvcs_gpc);
- else if (kvm_gpc_activate(&pvm->pvcs_gpc, data, PAGE_SIZE))
+ } else if (kvm_gpc_activate(&pvm->pvcs_gpc, data, PAGE_SIZE)) {
+ if (msr_info->host_initiated)
+ kvm_make_request(KVM_REQ_GPC_REFRESH, vcpu);
return 1;
+ }
break;
case MSR_PVM_SUPERVISOR_RSP:
pvm->msr_supervisor_rsp = msr_info->data; |
Happy to report that this workaround fixed it! We just managed to snapshot/restore across two different EC2 instances with your two patches applied (we had to make some minor adjustments to get them to work around some syntax issues: https://github.com/loopholelabs/linux-pvm-ci/blob/master/patches/use-fixed-pvm-range.patch and https://github.com/loopholelabs/linux-pvm-ci/blob/master/patches/fix-xsave-restore.patch). No FPU bugs or kernel crashes happened on the guest or the host :) Demo video: pvm-ec2-migration.mp4 |
Note EDIT: Not an issue with PVM - is actually an issue with Cloud Hypervisor not being able to mask certain CPU features that aren't available on both hosts, it works fine with our fork of Firecracker - see #7 (comment) for more information and disregard this comment
Here is the error log from $ cd ~/Projects/pvm-experimentation/ && rm -f /tmp/cloud-hypervisor.sock && cloud-hypervisor --api-socket /tmp/cloud-hypervisor.sock --restore source_url=file:///home/pojntfx/Downloads/drafter-snapshots
cloud-hypervisor: 11.644681ms: <vmm> ERROR:arch/src/x86_64/mod.rs:558 -- Detected incompatible CPUID entry: leaf=0x7 (subleaf=0x0), register='EBX', compatilbe_check='BitwiseSubset', source VM feature='0xd18f072b', destination VM feature'0xc2f7b'.
Error restoring VM: VmRestore(CpuManager(VcpuCreate(Could not set the vCPU state SetXsaveState(Invalid argument (os error 22))))) CPU infos for the two tested hosts: # EC2
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 7
BogoMIPS: 5999.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid
aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase t
sc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 32 KiB (1 instance)
L1i: 32 KiB (1 instance)
L2: 1 MiB (1 instance)
L3: 35.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0,1
Vulnerabilities:
Gather data sampling: Unknown: Dependent on hypervisor status
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Meltdown: Vulnerable
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Retbleed: Vulnerable
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.7.0-rc6-pvm-host-fedora-aws root=UUID=2f1b4fb2-54ec-4124-a3b4-1776614e5e4c ro rootflags=subvol=root no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 pti=off nokaslr
# GCP
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family: 6
Model: 79
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
Stepping: 0
BogoMIPS: 4400.29
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid
tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust
bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 32 KiB (1 instance)
L1i: 32 KiB (1 instance)
L2: 256 KiB (1 instance)
L3: 55 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Mitigation; PTE Inversion
Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Meltdown: Vulnerable
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Retbleed: Mitigation; IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-6.7.0-rc6-pvm-host-rocky-gcp root=UUID=fe4bce20-90c9-4d54-8d00-70e98ca7a7ac ro net.ifnames=0 biosdevname=0 scsi_mod.use_blk_mq=Y crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M console=ttyS0,115200 pti=off nokaslr I also disabled the CPU compatibility check in Cloud Hypervisor because that failed: diff --git a/arch/src/x86_64/mod.rs b/arch/src/x86_64/mod.rs
index 896a74d2..392e78a5 100644
--- a/arch/src/x86_64/mod.rs
+++ b/arch/src/x86_64/mod.rs
@@ -568,10 +568,9 @@ impl CpuidFeatureEntry {
if compatible {
info!("No CPU incompatibility detected.");
- Ok(())
- } else {
- Err(Error::CpuidCheckCompatibility)
}
+
+ Ok(())
}
}
|
Turns out #7 (comment) wasn't caused by anything PVM related - it's just that restores fail due to different CPU features ( We've also been able to revert this part of the patches you've posted (no difference in resume behavior): loopholelabs/linux-pvm-ci@d28ceb1 - is there a chance that this might be an accidental addition? It looks like a memory leak to us. We've updated the CI repo (https://github.com/loopholelabs/linux-pvm-ci) to reflect this change already. One error we still run into however (with & without reverting this change) is that after a few migrations between hosts (with this specific setup, ~5-7 migrations) VMs start resuming significantly more slowly (sometimes it takes up to a minute). Rebooting the hosts & doing a migration afterwards makes them resume immediately again, even if it's the same snapshot - is there a chance there might be some sort of memory leak etc. that's making PVM not find the necessary memory regions after multiple VMs have been resumed on one host? We're not resuming multiple VMs at the same time, just one after the other (stopping the last one before starting a new migration), yet this still happens. Let me know if there is anything I can do to help debug this! |
No, it was deleted deliberately. As I mentioned earlier, We will make this behavior a kernel booting parameter. If you want to use PVM, you should pass the new parameter to reserve the fixed area in the
Sorry, I'm not clear about the problem. Are you saying that there are only two hosts involved, or more? Assuming there are two hosts, we'll call them host1 and host2. Is the migration always from host1 to host2, or is it bidirectional (host1 <-> host2)? Additionally, does rebooting one or both hosts solve the problem? Have you tried unloading and reloading the PVM modules? Did you encounter any warnings or errors during the unloading? Does the problem persist after reloading the module? Perhaps we can open a new issue to track it more clearly. |
The commit eb49d06 ("KVM: x86/PVM: Store the valid value for MSR_PVM_VCPU_STRUCT unconditionally") aimed to address the failure to restore a snapshot due to the MSR_PVM_VCPU_STRUCT restoration failure by storing the value before kvm_gpc_activate(). However, this fix worked accidentally as the GPC is refreshed by timer IRQ handling instead of adding memslot. If there is no timer IRQ injecting before the first VM entry, it will cause the host to panic due to the NULL pointer access of 'pvcs_gpc.khva'. Therefore, refer to the PVM specification, a GPC refresh request is made if the GPC fails to activate during the MSR setting by the host. For the guest, setting an invalid MSR value will trigger a triple fault. Additionally, a WARN_ON_ONCE() is added in pvm_vcpu_run() to capture unexpected bugs if 'pvcs_gpc.khva' is NULL and MSR value is not NULL. Fixes: eb49d06 ("KVM: x86/PVM: Store the valid value for MSR_PVM_VCPU_STRUCT unconditionally") Signed-off-by: Hou Wenlong <[email protected]> Link: #7
Sorry for the delay in my response, I was OOO for significant parts of last week due to medical reasons.
Thanks a lot! Yes, this makes a lot of sense; a kernel parameter seems like a good way to implement this behavior. From a user perspective, when using migrations, would this mean that the memory region that PVM can use/that VMs can use would need to be reserved ahead of time when loading the KVM module? Would the size of this region also be configurable through this kernel parameter, and would there be limits as to how much memory we could reserve?
This happens when migrating from host 1 to host 2, as well as from host 2 to host 1 - it's bi-directional. Let's say we migrate from host 1 to host 2 - the first migrations works flawlessly. When we stop this migrated VM on host 2, and we migrate the VM from host 1 to host 2 again, the restore takes much longer. If we do this ~5 times, the VM doesn't resume at all/hangs. In this case, if we reboot host 2, and then migrate the VM from host 1 to host 2 again, it resumes again immediately - and then after ~5 times it stops working again. Unloading and then reloading the module has the same effect, migrations start working again after this. So far we weren't able to get this to reproduce on migrations between same instance types (like say two EC2 instances of the same type), but it happens when we migrate between two different instance types (like say EC2 → GCP) - even if it's the same Intel CPU generation. We can't reproduce this with Cloud Hypervisor since we can't resume the VM at all on another host (because it lacks the concept of CPU templates), but we can do so with Firecracker reliably. Any idea what might be causing this? Let me know if you need additional docs to reproduce or if I can help with additional debugging info. |
I'm sorry to hear that. I hope you are feeling better now.
Yes, if someone wants to use migrations, they must ensure that the kernel parameter is set and the range is reserved during booting. The PVM module will try to find the reserved range first; if it doesn't, then it will attempt dynamic allocation and may trigger a warning indicating that migration may not be available. Regarding the size of the region, my colleague suggested that we may use the same format as
Sorry, I don't have ECS/GCP instances, but I will try to see if I can reproduce the problem on my physical machine. However, I have a few more questions. When you say 'stop the migrated VM', do you mean pausing the VM or shutting it down? Are the vCPUs still running when the restored VM is hanging? You can use perf tools to trace the
|
Just a quick update from our side - sorry for the delay. We're still gathering more info on this, but right now we're blocked by work on other parts of the live migration implementation data plane before we can re-test. Regarding this:
Do I understand correctly that for migration to work, the kernel module would need to be loaded with the same As for the testing with |
Yes, you are correct. However, the parameters referred to here are the host kernel boot parameters, not the PVM kernel module parameters. My colleague suggested that it's the administrator's responsibility to provide suitable The
|
Description
We're testing PVM on AWS EC2 and we're running into some issues with restoring snapshots on EC2 (
c5.large
in particular, but this also occurs on all other non-bare metal instance types - both Intel and AMD). Snapshot restores work well on GCP non-bare metal hosts with Cloud Hypervisor (and on bare metal hosts with the Intel KVM module unloaded and the PVM module loaded), but they lead to host kernel panics when used on AWS EC2. There are also some error messages (see further down) related to it not being able to set MSRs for the guest (a similar error occurs for Firecracker's snapshot restores, too!) in the load step before the kernel crashes.Note that this only happens when the snapshot is resumed from a different host than the one that it was created on, so when a snapshot is created on host A, moved from host A to host B and resumed on host B. Both instance types in this are the same (same CPU etc.). It does not occur if the snapshot is simply resumed from the same host that it was created on.
Reproduction
Snapshot/Restore
To reproduce (
35.89.175.21
is the first EC2 instance,34.214.167.180
is the second EC2 host, andrsync
is used to sync the snapshot, rootfs etc. between the VMs):Live Migration
The same issue also occurs with Cloud Hypervisor's live migration:
Full Kernel Panic
Expand section
Cloud Hypervisor Error Log
Firecracker Error Log
The text was updated successfully, but these errors were encountered: