Kernel crash while doing VM snapshot with memory

sbellon

Member
Oct 12, 2021
24
4
8
47
Hello all,

with a fresh install of Proxmox VE 7.0-13, running just three containers and one virtual machine, I got a kernel crash when trying to create a VM snapshot (including memory!). The VM has passed in two NICs via PCI passthrough (one can see the crash somewhere in vfio_iommu...):

Code:
Nov 13 08:21:39 pve pvedaemon[2484322]: <root@pam> snapshot VM 100: Before_Update
Nov 13 08:22:38 pve pve-firewall[1349]: firewall update time (5.467 seconds)
Nov 13 08:22:39 pve pvestatd[1350]: status update time (6.901 seconds)
Nov 13 08:23:28 pve systemd[1]: Started Session 1848 of user root.
Nov 13 08:23:28 pve systemd[1]: session-1848.scope: Succeeded.
Nov 13 08:24:33 pve pvedaemon[2993147]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 31 retries
Nov 13 08:24:38 pve pvestatd[1350]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 31 retries
Nov 13 08:24:38 pve pvestatd[1350]: status update time (6.159 seconds)
Nov 13 08:24:43 pve kernel: [504902.804509] BUG: unable to handle page fault for address: ffffcc10c8909d48
Nov 13 08:24:43 pve kernel: [504902.804516] #PF: supervisor read access in kernel mode
Nov 13 08:24:43 pve kernel: [504902.804518] #PF: error_code(0x0000) - not-present page
Nov 13 08:24:43 pve kernel: [504902.804520] PGD 0 P4D 0
Nov 13 08:24:43 pve kernel: [504902.804522] Oops: 0000 [#1] SMP NOPTI
Nov 13 08:24:43 pve kernel: [504902.804525] CPU: 6 PID: 1504 Comm: kvm Tainted: P S         O      5.11.22-5-pve #1
Nov 13 08:24:43 pve kernel: [504902.804528] Hardware name: Default string Default string/Default string, BIOS 5.13 12/03/2020
Nov 13 08:24:43 pve kernel: [504902.804530] RIP: 0010:kfree+0x6a/0x400
Nov 13 08:24:43 pve kernel: [504902.804536] Code: 80 49 01 dc 0f 82 97 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 68 b8 3a 01 49 01 c4 49 c1 ec 0c 49 c1 e4 06 4c 03 25 46 b8 3a 01 <49> 8b 44 24 08 48 8d 50 ff a8 01 4c 0f 45 e2 49 8b 54 24 08 48 8d
Nov 13 08:24:43 pve kernel: [504902.804539] RSP: 0018:ffffa2d146763d38 EFLAGS: 00010282
Nov 13 08:24:43 pve kernel: [504902.804542] RAX: 00006a9040000000 RBX: ffffa2d164275000 RCX: 0000000081000091
Nov 13 08:24:43 pve kernel: [504902.804544] RDX: 0000000000000001 RSI: ffffffffc03af649 RDI: ffffa2d164275000
Nov 13 08:24:43 pve kernel: [504902.804546] RBP: ffffa2d146763d90 R08: ffff957057326de0 R09: 0000000000000000
Nov 13 08:24:43 pve kernel: [504902.804548] R10: 0000000000000001 R11: 0000000000000000 R12: ffffcc10c8909d40
Nov 13 08:24:43 pve kernel: [504902.804550] R13: ffff9570607193e8 R14: 00007ffcde153b60 R15: ffff95704a1e2400
Nov 13 08:24:43 pve kernel: [504902.804552] FS:  00007f00d74e81c0(0000) GS:ffff95739dd80000(0000) knlGS:0000000000000000
Nov 13 08:24:43 pve kernel: [504902.804554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 13 08:24:43 pve kernel: [504902.804556] CR2: ffffcc10c8909d48 CR3: 000000010304c006 CR4: 00000000003726e0
Nov 13 08:24:43 pve kernel: [504902.804558] Call Trace:
Nov 13 08:24:43 pve kernel: [504902.804562]  ? vfio_iommu_type1_ioctl+0x1099/0x1340 [vfio_iommu_type1]
Nov 13 08:24:43 pve kernel: [504902.804566]  vfio_iommu_type1_ioctl+0x1099/0x1340 [vfio_iommu_type1]
Nov 13 08:24:43 pve kernel: [504902.804570]  ? kvm_vm_ioctl+0x3f8/0xc40 [kvm]
Nov 13 08:24:43 pve kernel: [504902.804610]  vfio_fops_unl_ioctl+0x6b/0x280 [vfio]
Nov 13 08:24:43 pve kernel: [504902.804614]  __x64_sys_ioctl+0x91/0xc0
Nov 13 08:24:43 pve kernel: [504902.804621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 13 08:24:43 pve kernel: [504902.804624] RIP: 0033:0x7f00e189fcc7
Nov 13 08:24:43 pve kernel: [504902.804626] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
Nov 13 08:24:43 pve kernel: [504902.804633] RAX: ffffffffffffffda RBX: 000055f72a96ce70 RCX: 00007f00e189fcc7
Nov 13 08:24:43 pve kernel: [504902.804636] RBP: 0000000000000009 R08: 0000000000000204 R09: 0000000000000000
Nov 13 08:24:43 pve kernel: [504902.804640] R13: 0000000000000001 R14: 000055f72934f0f0 R15: 000055f72934f560
Nov 13 08:24:43 pve kernel: [504902.804683]  processor_thermal_rapl snd_timer rc_core intel_rapl_common snd fb_sys_fops syscopyarea int340x_thermal_zone sysfillrect soundcore rapl ee1004 sysimgblt intel_soc_dts_iosf intel_cstate serio_raw pcspkr efi_pstore intel_wmi_thunderbolt mei_me wmi_bmof mei intel_pch_thermal mac_hid acpi_tad acpi_pad vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c crc32_pclmul psmouse intel_lpss_pci xhci_pci intel_lpss idma64 igb sdhci_pci xhci_pci_renesas i2c_i801 cqhci ahci i2c_algo_bit i2c_smbus dca libahci virt_dma xhci_hcd sdhci wmi video pinctrl_cannonlake [last unloaded: it87]
Nov 13 08:24:43 pve kernel: [504902.804743] CR2: ffffcc10c8909d48
Nov 13 08:24:43 pve kernel: [504902.804745] ---[ end trace 4c8cd0dc28881464 ]---
Nov 13 08:24:43 pve kernel: [504902.920104] RIP: 0010:kfree+0x6a/0x400
Nov 13 08:24:43 pve kernel: [504902.920122] Code: 80 49 01 dc 0f 82 97 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 68 b8 3a 01 49 01 c4 49 c1 ec 0c 49 c1 e4 06 4c 03 25 46 b8 3a 01 <49> 8b 44 24 08 48 8d 50 ff a8 01 4c 0f 45 e2 49 8b 54 24 08 48 8d
Nov 13 08:24:43 pve kernel: [504902.920129] RSP: 0018:ffffa2d146763d38 EFLAGS: 00010282
Nov 13 08:24:43 pve kernel: [504902.920133] RAX: 00006a9040000000 RBX: ffffa2d164275000 RCX: 0000000081000091
Nov 13 08:24:43 pve kernel: [504902.920137] RDX: 0000000000000001 RSI: ffffffffc03af649 RDI: ffffa2d164275000
Nov 13 08:24:43 pve kernel: [504902.920141] RBP: ffffa2d146763d90 R08: ffff957057326de0 R09: 0000000000000000
Nov 13 08:24:43 pve kernel: [504902.920144] R10: 0000000000000001 R11: 0000000000000000 R12: ffffcc10c8909d40
Nov 13 08:24:43 pve kernel: [504902.920148] R13: ffff9570607193e8 R14: 00007ffcde153b60 R15: ffff95704a1e2400
Nov 13 08:24:43 pve kernel: [504902.920152] FS:  00007f00d74e81c0(0000) GS:ffff95739dd80000(0000) knlGS:0000000000000000
Nov 13 08:24:43 pve kernel: [504902.920156] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 13 08:24:43 pve kernel: [504902.920159] CR2: ffffcc10c8909d48 CR3: 000000010304c006 CR4: 00000000003726e0
Nov 13 08:24:45 pve pvestatd[1350]: VM 100 qmp command failed - VM 100 not running

Complete configuration of VM 100 looks like this:

Code:
pve:~# qm config 100
agent: 1
boot: order=scsi0;ide2
cores: 4
cpu: host
hostpci0: 0000:01:00
hostpci1: 0000:02:00
ide2: none,media=cdrom
machine: q35
memory: 8192
name: opnsense
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-100-disk-0,discard=on,size=32G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=5175efd9-1650-49c9-a48b-efc231d066f8
sockets: 1
startup: order=1
vmgenid: f88b4a5a-62b5-42c8-8a2f-58bae48974b8

/proc/cpuinfo (from first core) looks as follows:

Code:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz
stepping        : 11
microcode       : 0xea
cpu MHz         : 1800.000
cache size      : 6144 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple pml ept_mode_based_exec
bugs            : spectre_v1 spectre_v2 spec_store_bypass mds swapgs itlb_multihit srbds
bogomips        : 3600.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Should I be configuring something differently? Or is this a genuine bug? If so, is it correct to report it here (because it's a PVE kernel) or should I report it somewhere else upstream?

Greetings,
Stefan
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!