Reproducible Proxmox Crash on Kernel 6.8.4-2-pve: KVM force termination renders Web UI, SSH, and all running VMs Unresponsive

Apr 27, 2024
4
3
3
Hello everyone :)

I recently upgraded to Proxmox VE 8.2.2 running kernel 6.8.4-2-pve (having a Subscription active). I have experienced 2 seemingly random host crashes/freezes since the upgrade where the Proxmox host became completely unresponsive, but I was unable to determine the cause at the time, especially since it happened at night without any load.

However, I have now found a reproducible way to trigger the crash. The host consistently crashes when attempting to install Ubuntu 22.04 desktop in a VM. I have tested this twice in a row with the same result. The iso Image is the same as before the Proxmox Upgrade to 8.2.2 and worked fine on many installs without any crashes.

Steps to reproduce:
1. Create a new VM with typical settings (4GB RAM, 2 CPU cores, SCSI virtio disk, etc.)
2. Mount the Ubuntu 22.04.03 desktop ISO and start the VM
3. Proceed with the Ubuntu installation

Sometime during the "Installing system" phase, the Proxmox host becomes unresponsive with somewhat between a oops and a panic. Looking at the logs, I see:

Code:
Apr 27 13:42:52 prxmx kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Apr 27 13:42:52 prxmx kernel: #PF: supervisor write access in kernel mode
Apr 27 13:42:52 prxmx kernel: #PF: error_code(0x0002) - not-present page
Apr 27 13:42:52 prxmx kernel: PGD 0 P4D 0
Apr 27 13:42:52 prxmx kernel: Oops: 0002 [#1] PREEMPT SMP NOPT
Apr 27 13:42:52 prxmx kernel: CPU: 0 PID: 1549 Comm: kvm Not tainted 6.8.4-2-pve #1I

...
Apr 27 13:42:52 prxmx kernel: WARNING: CPU: 0 PID: 1549 at kernel/exit.c:820 do_exit+0x8dd/0xae0


1714220506576.png

The full kernel oops log is attached. It looks like the crash occurs in the `blk_flush_complete_seq+0x291` function. Subsequent call trace after the oops shows the kernel then proceeded to forcibly terminate the offending process (kvm, PID 1549).

The Proxmox host was stable before upgrading PVE 8.2.2 with the new Kernel (coming from 8.1.4 if I recall correctly). The WebUI and SSH are unresponsible, attaching a monitor+keyboard and logging in there works which speaks against a full kernel panic, with htop/top you can see kvm (with a different pid) using 100% cpu load.
1714219623803.png


I am not the only one experiencing random freezes/crashes on Proxmox VE 8.2.2, as there are other reports of similar behavior.

Any known workarounds besides using an older kernel?

Let me know if any other details would be helpful for debugging. Thanks!
I hope the gathered information will help you, for me: I need my server back working so I will try to revert to the old kernel.

System Specifications:
  • CPU: Intel Core i9-9900K
  • Motherboard: Gigabyte B360 HD3PLM
  • Kernel: Linux 6.8.4-2-pve (compiled on 2024-04-10T17:36Z)
  • Hosting: Hetzner Dedicated Server
  • Licensed Proxmox with 12 VMs (approximately 3 running)
Code:
lspci
00:00.0 Host bridge: Intel Corporation 8th/9th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S] (rev 0d)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 0d)
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] (rev 02)
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a308 (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 NVMe SSD [Hendrix]
02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 NVMe SSD [Hendrix]
 

Attachments

  • kernel-log.txt
    13 KB · Views: 7
Last edited:
Same issue since upgraded to the newest version (8.2.2) - include kernels

It's occouring only when I starting my VM which owns an i915 GPU (passthrough)

Got a hint: Don't set your CPU type as a "HOST" --> Not solving the issue

Fallback solution: older kernel:
Automatically selected kernels:
6.5.13-5-pve - Works
6.8.1-1-pve - 100CPU VM fails
6.8.4-2-pve - 100CPU VM fails

Pinned kernel:
6.5.13-5-pve
 

Attachments

  • dmesg.txt
    7 KB · Views: 6
Last edited:
  • Like
Reactions: Bent
"Same issue here" on AMD EPYC 7313P, Supermicro H12SSL-i System.

The crash happened when a Gitlab runner started building inside a Debian VM (small system load -> heavy system load)

## journalctl log

Bash:
Apr 28 22:13:54 node01 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Apr 28 22:13:54 node01 kernel: #PF: supervisor write access in kernel mode
Apr 28 22:13:54 node01 kernel: #PF: error_code(0x0002) - not-present page
Apr 28 22:13:54 node01 kernel: PGD 0 P4D 0
Apr 28 22:13:54 node01 kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Apr 28 22:13:54 node01 kernel: CPU: 4 PID: 13075 Comm: kvm Tainted: P           O       6.8.4-2-pve #1
[...]
Apr 28 22:13:54 node01 kernel: ------------[ cut here ]------------
Apr 28 22:13:54 node01 kernel: WARNING: CPU: 4 PID: 13075 at kernel/exit.c:820 do_exit+0x8dd/0xae0
Apr 28 22:13:54 node01 kernel: Modules linked in: tcp_diag inet_diag vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter scsi_transport_iscsi nf_tables 8021q garp mrp softdog bonding tls sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd ast ipmi_ssif rapl wmi_bmof pcspkr ccp i2c_algo_bit acpi_ipmi k10temp ptdma ipmi_si ipmi_devintf ipmi_msghandler joydev input_leds mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 mlx4_ib ib_uverbs hid_generic rndis_host usbmouse cdc_ether ib_core mlx4_en usbnet usbhid mii hid raid1 crc32_pclmul mpt3sas xhci_pci xhci_pci_renesas nvme raid_class tg3 mlx4_core
Apr 28 22:13:54 node01 kernel:  scsi_transport_sas xhci_hcd nvme_core i2c_piix4 nvme_auth wmi
Apr 28 22:13:54 node01 kernel: CPU: 4 PID: 13075 Comm: kvm Tainted: P      D    O       6.8.4-2-pve #1
[...]

## Temporary workaround (no longterm results, yet)

Bash:
apt install proxmox-kernel-6.5.13-5-pve
pve-efiboot-tool kernel pin 6.5.13-5-pve
reboot

thanks to @zolbarna for providing the last working kernel version!
 
Last edited:
Similar issue, since proxmox 8.2.2, ceph 17.2.7-pve3 and kernel 6.8.2 we have had 4 kernel panics over 2 days, twice on systems with no load and twice with systems with load.
Don't have crashlog, but running Supermicro X11DPL-i dual Xeon Silver 4116 cpu
Have reverted to 6.5 kernel will monitor stability

Interesting we also a second cluster, different hardware and not running local ceph. Runs older Dell PowerEdge R710 System Mother Board G1 PV9DG running dual Xeon E5520 CPU. This was stable on new kernel
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!