[SOLVED] Reproducible Proxmox Crash on Kernel 6.8.4-2-pve: KVM force termination renders Web UI, SSH, and all running VMs Unresponsive

Apr 27, 2024
11
5
3
Hello everyone :)

I recently upgraded to Proxmox VE 8.2.2 running kernel 6.8.4-2-pve (having a Subscription active). I have experienced 2 seemingly random host crashes/freezes since the upgrade where the Proxmox host became completely unresponsive, but I was unable to determine the cause at the time, especially since it happened at night without any load.

However, I have now found a reproducible way to trigger the crash. The host consistently crashes when attempting to install Ubuntu 22.04 desktop in a VM. I have tested this twice in a row with the same result. The iso Image is the same as before the Proxmox Upgrade to 8.2.2 and worked fine on many installs without any crashes.

Steps to reproduce:
1. Create a new VM with typical settings (4GB RAM, 2 CPU cores, SCSI virtio disk, etc.)
2. Mount the Ubuntu 22.04.03 desktop ISO and start the VM
3. Proceed with the Ubuntu installation

Sometime during the "Installing system" phase, the Proxmox host becomes unresponsive with somewhat between a oops and a panic. Looking at the logs, I see:

Code:
Apr 27 13:42:52 prxmx kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Apr 27 13:42:52 prxmx kernel: #PF: supervisor write access in kernel mode
Apr 27 13:42:52 prxmx kernel: #PF: error_code(0x0002) - not-present page
Apr 27 13:42:52 prxmx kernel: PGD 0 P4D 0
Apr 27 13:42:52 prxmx kernel: Oops: 0002 [#1] PREEMPT SMP NOPT
Apr 27 13:42:52 prxmx kernel: CPU: 0 PID: 1549 Comm: kvm Not tainted 6.8.4-2-pve #1I

...
Apr 27 13:42:52 prxmx kernel: WARNING: CPU: 0 PID: 1549 at kernel/exit.c:820 do_exit+0x8dd/0xae0


1714220506576.png

The full kernel oops log is attached. It looks like the crash occurs in the `blk_flush_complete_seq+0x291` function. Subsequent call trace after the oops shows the kernel then proceeded to forcibly terminate the offending process (kvm, PID 1549).

The Proxmox host was stable before upgrading PVE 8.2.2 with the new Kernel (coming from 8.1.4 if I recall correctly). The WebUI and SSH are unresponsible, attaching a monitor+keyboard and logging in there works which speaks against a full kernel panic, with htop/top you can see kvm (with a different pid) using 100% cpu load.
1714219623803.png


I am not the only one experiencing random freezes/crashes on Proxmox VE 8.2.2, as there are other reports of similar behavior.

Any known workarounds besides using an older kernel?

Let me know if any other details would be helpful for debugging. Thanks!
I hope the gathered information will help you, for me: I need my server back working so I will try to revert to the old kernel.

System Specifications:
  • CPU: Intel Core i9-9900K
  • Motherboard: Gigabyte B360 HD3PLM
  • Kernel: Linux 6.8.4-2-pve (compiled on 2024-04-10T17:36Z)
  • Hosting: Hetzner Dedicated Server
  • Licensed Proxmox with 12 VMs (approximately 3 running)
Code:
lspci
00:00.0 Host bridge: Intel Corporation 8th/9th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S] (rev 0d)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 0d)
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] (rev 02)
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a308 (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 NVMe SSD [Hendrix]
02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 NVMe SSD [Hendrix]
 

Attachments

Last edited:
Same issue since upgraded to the newest version (8.2.2) - include kernels

It's occouring only when I starting my VM which owns an i915 GPU (passthrough)

Got a hint: Don't set your CPU type as a "HOST" --> Not solving the issue

Fallback solution: older kernel:
Automatically selected kernels:
6.5.13-5-pve - Works
6.8.1-1-pve - 100CPU VM fails
6.8.4-2-pve - 100CPU VM fails

Pinned kernel:
6.5.13-5-pve
 

Attachments

Last edited:
  • Like
Reactions: Bent
"Same issue here" on AMD EPYC 7313P, Supermicro H12SSL-i System.

The crash happened when a Gitlab runner started building inside a Debian VM (small system load -> heavy system load)

## journalctl log

Bash:
Apr 28 22:13:54 node01 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Apr 28 22:13:54 node01 kernel: #PF: supervisor write access in kernel mode
Apr 28 22:13:54 node01 kernel: #PF: error_code(0x0002) - not-present page
Apr 28 22:13:54 node01 kernel: PGD 0 P4D 0
Apr 28 22:13:54 node01 kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Apr 28 22:13:54 node01 kernel: CPU: 4 PID: 13075 Comm: kvm Tainted: P           O       6.8.4-2-pve #1
[...]
Apr 28 22:13:54 node01 kernel: ------------[ cut here ]------------
Apr 28 22:13:54 node01 kernel: WARNING: CPU: 4 PID: 13075 at kernel/exit.c:820 do_exit+0x8dd/0xae0
Apr 28 22:13:54 node01 kernel: Modules linked in: tcp_diag inet_diag vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter scsi_transport_iscsi nf_tables 8021q garp mrp softdog bonding tls sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd ast ipmi_ssif rapl wmi_bmof pcspkr ccp i2c_algo_bit acpi_ipmi k10temp ptdma ipmi_si ipmi_devintf ipmi_msghandler joydev input_leds mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 mlx4_ib ib_uverbs hid_generic rndis_host usbmouse cdc_ether ib_core mlx4_en usbnet usbhid mii hid raid1 crc32_pclmul mpt3sas xhci_pci xhci_pci_renesas nvme raid_class tg3 mlx4_core
Apr 28 22:13:54 node01 kernel:  scsi_transport_sas xhci_hcd nvme_core i2c_piix4 nvme_auth wmi
Apr 28 22:13:54 node01 kernel: CPU: 4 PID: 13075 Comm: kvm Tainted: P      D    O       6.8.4-2-pve #1
[...]

## Temporary workaround (no longterm results, yet)

Bash:
apt install proxmox-kernel-6.5.13-5-pve
pve-efiboot-tool kernel pin 6.5.13-5-pve
reboot

thanks to @zolbarna for providing the last working kernel version!
 
Last edited:
Similar issue, since proxmox 8.2.2, ceph 17.2.7-pve3 and kernel 6.8.2 we have had 4 kernel panics over 2 days, twice on systems with no load and twice with systems with load.
Don't have crashlog, but running Supermicro X11DPL-i dual Xeon Silver 4116 cpu
Have reverted to 6.5 kernel will monitor stability

Interesting we also a second cluster, different hardware and not running local ceph. Runs older Dell PowerEdge R710 System Mother Board G1 PV9DG running dual Xeon E5520 CPU. This was stable on new kernel
 
Last edited:
We have 3 x AMD EPYC 7763 systems with Gigabyte MZ32-AR0 boards and 6 x NVMe SSDs which we updated to the latest kernel version. All of them are randomly freezing on the latest PVE Kernel 6.8.4-3. No errors or issues in IPMI, and nothing in system logs that indicate a problem

We have reverted these nodes back to kernel 6.5.13-5
 
I have the same issue with PVE Kernel 6.8.4-3:

Code:
2024-05-18T22:41:23.909993+03:00 pve2 kernel: [ 9186.693130] BUG: kernel NULL pointer dereference, address: 0000000000000008
2024-05-18T22:41:23.910010+03:00 pve2 kernel: [ 9186.693137] #PF: supervisor write access in kernel mode
2024-05-18T22:41:23.910011+03:00 pve2 kernel: [ 9186.693139] #PF: error_code(0x0002) - not-present page
2024-05-18T22:41:23.910012+03:00 pve2 kernel: [ 9186.693141] PGD 0 P4D 0
2024-05-18T22:41:23.910013+03:00 pve2 kernel: [ 9186.693144] Oops: 0002 [#1] PREEMPT SMP NOPTI
2024-05-18T22:41:23.910013+03:00 pve2 kernel: [ 9186.693147] CPU: 6 PID: 1574 Comm: kvm Tainted: P           O       6.8.4-3-pve #1
2024-05-18T22:41:23.910014+03:00 pve2 kernel: [ 9186.693150] Hardware name: Dell Inc. OptiPlex 3080/0M3F6C, BIOS 2.23.1 12/25/2023
2024-05-18T22:41:23.910015+03:00 pve2 kernel: [ 9186.693152] RIP: 0010:blk_flush_complete_seq+0x291/0x2d0
2024-05-18T22:41:23.910015+03:00 pve2 kernel: [ 9186.693157] Code: 0f b6 f6 49 8d 56 01 49 c1 e6 04 4d 01 ee 48 c1 e2 04 49 8b 4e 10 4c 01 ea 48 39 ca 74 2b 48 8b 4b 50 48 8b 7b 48 48 8d 73 48 <48> 89 4f 08 48 89 39 49 8b 4e 18 49 89 76 18 48 89 53 48 48 89 4b
2024-05-18T22:41:23.910017+03:00 pve2 kernel: [ 9186.693161] RSP: 0018:ffffa9dac34c3a60 EFLAGS: 00010046
2024-05-18T22:41:23.910018+03:00 pve2 kernel: [ 9186.693164] RAX: 0000000000000000 RBX: ffff977b94ce8e00 RCX: ffff977b94ce8e48
2024-05-18T22:41:23.910018+03:00 pve2 kernel: [ 9186.693166] RDX: ffff977b9407a910 RSI: ffff977b94ce8e48 RDI: 0000000000000000
2024-05-18T22:41:23.910019+03:00 pve2 kernel: [ 9186.693168] RBP: ffffa9dac34c3aa0 R08: 0000000000000000 R09: 0000000000000000
2024-05-18T22:41:23.910020+03:00 pve2 kernel: [ 9186.693170] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000029801
2024-05-18T22:41:23.910020+03:00 pve2 kernel: [ 9186.693172] R13: ffff977b9407a900 R14: ffff977b9407a900 R15: ffff977b9463ab20
2024-05-18T22:41:23.910020+03:00 pve2 kernel: [ 9186.693174] FS:  000070869b5f8340(0000) GS:ffff978a80300000(0000) knlGS:0000000000000000
2024-05-18T22:41:23.910021+03:00 pve2 kernel: [ 9186.693177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-05-18T22:41:23.910021+03:00 pve2 kernel: [ 9186.693179] CR2: 0000000000000008 CR3: 000000010c8e6004 CR4: 00000000003726f0
 
In our test cluster (3 Nodes), comprising 2x AMD EPYC 7313P and 1x Intel E5-1650v3, no further crashes have occurred in the past 6 days+ (the last reboot was due to unrelated configuration changes). All nodes are running on Linux 6.8.4-3-pve.

All nodes in this cluster also run ceph version 17.2.7 and host 4 OSDs each (NVMe).

Has anyone else experienced similar feedback?
 
Last edited:
In our test cluster (3 Nodes), comprising 2x AMD EPYC 7313P and 1x Intel E5-1650v3, no further crashes have occurred in the past 6 days+ (the last reboot was due to unrelated configuration changes). All nodes are running on Linux 6.8.4-3-pve.

All nodes in this cluster also run ceph version 17.2.7 and host 4 OSDs each (NVMe).

Has anyone else experienced similar feedback?

The symptoms have changed (for me), though I’m not sure if the underlying issue has.

I installed Ubuntu as described in the first post here, using version 6.8.4-3-pve. The only difference now is that the WebUI is still accessible, though it has extremely long load times, and I can use xtermjs to access the console on the host. However, while writing this post, the WebUI also became unresponsive.
dead webui
Everything else remains the same:
• vncproxy is either dead or not responding. (while the WebUI was still there, as mentioned it died)
• CPU cores are locked at 100%.
• Running VMs are unresponsive via RDP/Wireguard.
• SSH to the host is dead.

It seems the only change in symptoms is that the host remains operational for a few more minutes before becoming completely unresponsive.
(I haven’t checked physical console access like the last time.)
 
Any updates or fixes?
Despite efforts, the issue persists. A suggestion to disable IOMMU, which had been automatically enabled, only temporarily alleviated the unresponsiveness.

Regrettably, the Proxmox support team appears to be unresponsive to this concern. As a subscriber without the support option, I understand their lack of direct assistance. However, it's concerning that this issue seems to affect numerous users without being addressed.
 
Hi,
there is a build of kernel 6.8.8 available on the no-subscription repository since two weeks. That should resolve some of the issues people have reported with 6.8.4. If you have a subscription, you can temporarily enable the repository, run apt update, install the new kernel with apt install proxmox-kernel-6.8 and then disable the repository again and run apt update again.
 
Despite efforts, the issue persists. A suggestion to disable IOMMU, which had been automatically enabled, only temporarily alleviated the unresponsiveness.

Regrettably, the Proxmox support team appears to be unresponsive to this concern. As a subscriber without the support option, I understand their lack of direct assistance. However, it's concerning that this issue seems to affect numerous users without being addressed.
Hi, to add to what @fiona already wrote: In your first message in this thread, you reported a NULL pointer dereference with RIP pointing to blk_flush_complete_seq:
Code:
Apr 27 13:42:52 prxmx kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
[...]
Apr 27 13:42:52 prxmx kernel: RIP: 0010:blk_flush_complete_seq+0x291/0x2d0
This issue in particular should be fixed in kernel 6.8.8-1 (and higher). See [1] for more details. Please test the newer kernel and report back, especially if you still see crashes.

[1] https://forum.proxmox.com/threads/145760/page-8#post-674842
 
  • Like
Reactions: antreos and fiona
I'm still holding off the 6.8 kernel update because of all the problems I am seeing here in the forums and have pinned the 6.5.13-5-pve kernel for now, as it is working fine.
Maybe proxmox staff can issue an 'all clear' message, when it's safe for us no-subscription paupers to run the new kernel ??
Embarrassing isn't it ?
 
I'm still holding off the 6.8 kernel update because of all the problems I am seeing here in the forums and have pinned the 6.5.13-5-pve kernel for now, as it is working fine.
Maybe proxmox staff can issue an 'all clear' message, when it's safe for us no-subscription paupers to run the new kernel ??
Embarrassing isn't it ?
There is never a kernel that works for each possible hardware configuration. There always are kernel regressions, it's just a too big piece of software written in a very old language. Monitoring the forum (in particular the kernel or release announcement threads) for issues reported by people with similar hardware is good, but best is to test if the new kernel works for your setup when you have a maintenance window. Nobody else will be able to guarantee that 100% up front. We do try to minimize issues, but we can only test on the hardware we have and backport fixes that can be identified.
 
  • Like
Reactions: antreos

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!