[SOLVED] Reproducible Proxmox Crash on Kernel 6.8.4-2-pve: KVM force termination renders Web UI, SSH, and all running VMs Unresponsive

antreos · Apr 27, 2024

Hello everyone

I recently upgraded to Proxmox VE 8.2.2 running kernel 6.8.4-2-pve (having a Subscription active). I have experienced 2 seemingly random host crashes/freezes since the upgrade where the Proxmox host became completely unresponsive, but I was unable to determine the cause at the time, especially since it happened at night without any load.

However, I have now found a reproducible way to trigger the crash. The host consistently crashes when attempting to install Ubuntu 22.04 desktop in a VM. I have tested this twice in a row with the same result. The iso Image is the same as before the Proxmox Upgrade to 8.2.2 and worked fine on many installs without any crashes.

Steps to reproduce:
1. Create a new VM with typical settings (4GB RAM, 2 CPU cores, SCSI virtio disk, etc.)
2. Mount the Ubuntu 22.04.03 desktop ISO and start the VM
3. Proceed with the Ubuntu installation

Sometime during the "Installing system" phase, the Proxmox host becomes unresponsive with somewhat between a oops and a panic. Looking at the logs, I see:

Code:

Apr 27 13:42:52 prxmx kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Apr 27 13:42:52 prxmx kernel: #PF: supervisor write access in kernel mode
Apr 27 13:42:52 prxmx kernel: #PF: error_code(0x0002) - not-present page
Apr 27 13:42:52 prxmx kernel: PGD 0 P4D 0
Apr 27 13:42:52 prxmx kernel: Oops: 0002 [#1] PREEMPT SMP NOPT
Apr 27 13:42:52 prxmx kernel: CPU: 0 PID: 1549 Comm: kvm Not tainted 6.8.4-2-pve #1I

...
Apr 27 13:42:52 prxmx kernel: WARNING: CPU: 0 PID: 1549 at kernel/exit.c:820 do_exit+0x8dd/0xae0

The full kernel oops log is attached. It looks like the crash occurs in the `blk_flush_complete_seq+0x291` function. Subsequent call trace after the oops shows the kernel then proceeded to forcibly terminate the offending process (kvm, PID 1549).

The Proxmox host was stable before upgrading PVE 8.2.2 with the new Kernel (coming from 8.1.4 if I recall correctly). The WebUI and SSH are unresponsible, attaching a monitor+keyboard and logging in there works which speaks against a full kernel panic, with htop/top you can see kvm (with a different pid) using 100% cpu load.

I am not the only one experiencing random freezes/crashes on Proxmox VE 8.2.2, as there are other reports of similar behavior.

Any known workarounds besides using an older kernel?

Let me know if any other details would be helpful for debugging. Thanks!
I hope the gathered information will help you, for me: I need my server back working so I will try to revert to the old kernel.

System Specifications:

CPU: Intel Core i9-9900K
Motherboard: Gigabyte B360 HD3PLM
Kernel: Linux 6.8.4-2-pve (compiled on 2024-04-10T17:36Z)
Hosting: Hetzner Dedicated Server
Licensed Proxmox with 12 VMs (approximately 3 running)

Code:

lspci
00:00.0 Host bridge: Intel Corporation 8th/9th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S] (rev 0d)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 0d)
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] (rev 02)
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a308 (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 NVMe SSD [Hendrix]
02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 NVMe SSD [Hendrix]

bbgeek17 · Apr 27, 2024

Could be potentially similar or same as this report:
https://lore.kernel.org/all/CAHj4cs8X31NnOWGVLniT5OWBRtEphxw-AcYrPaygG_uYaoeENQ@mail.gmail.com/T/

At this point in time falling back to previous Kernel is likely the only option.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

zolbarna · Apr 28, 2024

Same issue since upgraded to the newest version (8.2.2) - include kernels

It's occouring only when I starting my VM which owns an i915 GPU (passthrough)

Got a hint: Don't set your CPU type as a "HOST" --> Not solving the issue

Fallback solution: older kernel:
Automatically selected kernels:
6.5.13-5-pve - Works
6.8.1-1-pve - 100CPU VM fails
6.8.4-2-pve - 100CPU VM fails

Pinned kernel:
6.5.13-5-pve

Bent · Apr 29, 2024

"Same issue here" on AMD EPYC 7313P, Supermicro H12SSL-i System.

The crash happened when a Gitlab runner started building inside a Debian VM (small system load -> heavy system load)

## journalctl log

Bash:

Apr 28 22:13:54 node01 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Apr 28 22:13:54 node01 kernel: #PF: supervisor write access in kernel mode
Apr 28 22:13:54 node01 kernel: #PF: error_code(0x0002) - not-present page
Apr 28 22:13:54 node01 kernel: PGD 0 P4D 0
Apr 28 22:13:54 node01 kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Apr 28 22:13:54 node01 kernel: CPU: 4 PID: 13075 Comm: kvm Tainted: P           O       6.8.4-2-pve #1
[...]
Apr 28 22:13:54 node01 kernel: ------------[ cut here ]------------
Apr 28 22:13:54 node01 kernel: WARNING: CPU: 4 PID: 13075 at kernel/exit.c:820 do_exit+0x8dd/0xae0
Apr 28 22:13:54 node01 kernel: Modules linked in: tcp_diag inet_diag vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter scsi_transport_iscsi nf_tables 8021q garp mrp softdog bonding tls sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd ast ipmi_ssif rapl wmi_bmof pcspkr ccp i2c_algo_bit acpi_ipmi k10temp ptdma ipmi_si ipmi_devintf ipmi_msghandler joydev input_leds mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 mlx4_ib ib_uverbs hid_generic rndis_host usbmouse cdc_ether ib_core mlx4_en usbnet usbhid mii hid raid1 crc32_pclmul mpt3sas xhci_pci xhci_pci_renesas nvme raid_class tg3 mlx4_core
Apr 28 22:13:54 node01 kernel:  scsi_transport_sas xhci_hcd nvme_core i2c_piix4 nvme_auth wmi
Apr 28 22:13:54 node01 kernel: CPU: 4 PID: 13075 Comm: kvm Tainted: P      D    O       6.8.4-2-pve #1
[...]

## Temporary workaround (no longterm results, yet)

Bash:

apt install proxmox-kernel-6.5.13-5-pve
pve-efiboot-tool kernel pin 6.5.13-5-pve
reboot

thanks to @zolbarna for providing the last working kernel version!

mrturbs · Apr 29, 2024

Similar issue, since proxmox 8.2.2, ceph 17.2.7-pve3 and kernel 6.8.2 we have had 4 kernel panics over 2 days, twice on systems with no load and twice with systems with load.
Don't have crashlog, but running Supermicro X11DPL-i dual Xeon Silver 4116 cpu
Have reverted to 6.5 kernel will monitor stability

Interesting we also a second cluster, different hardware and not running local ceph. Runs older Dell PowerEdge R710 System Mother Board G1 PV9DG running dual Xeon E5520 CPU. This was stable on new kernel

antreos · May 2, 2024

Old kernel seems to work okay with PVE 8.2.2.

Anyone got an update from the Proxmox team?

ksb · May 2, 2024

antreos said:
Old kernel seems to work okay with PVE 8.2.2.

Anyone got an update from the Proxmox team?

In the other thread Stoiko replied (for Intel systems: intel_iommu=off), details here:
https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760/#post-660020

Advin · May 20, 2024

We have 3 x AMD EPYC 7763 systems with Gigabyte MZ32-AR0 boards and 6 x NVMe SSDs which we updated to the latest kernel version. All of them are randomly freezing on the latest PVE Kernel 6.8.4-3. No errors or issues in IPMI, and nothing in system logs that indicate a problem

We have reverted these nodes back to kernel 6.5.13-5

tgerov · May 20, 2024

I have the same issue with PVE Kernel 6.8.4-3:

Code:

2024-05-18T22:41:23.909993+03:00 pve2 kernel: [ 9186.693130] BUG: kernel NULL pointer dereference, address: 0000000000000008
2024-05-18T22:41:23.910010+03:00 pve2 kernel: [ 9186.693137] #PF: supervisor write access in kernel mode
2024-05-18T22:41:23.910011+03:00 pve2 kernel: [ 9186.693139] #PF: error_code(0x0002) - not-present page
2024-05-18T22:41:23.910012+03:00 pve2 kernel: [ 9186.693141] PGD 0 P4D 0
2024-05-18T22:41:23.910013+03:00 pve2 kernel: [ 9186.693144] Oops: 0002 [#1] PREEMPT SMP NOPTI
2024-05-18T22:41:23.910013+03:00 pve2 kernel: [ 9186.693147] CPU: 6 PID: 1574 Comm: kvm Tainted: P           O       6.8.4-3-pve #1
2024-05-18T22:41:23.910014+03:00 pve2 kernel: [ 9186.693150] Hardware name: Dell Inc. OptiPlex 3080/0M3F6C, BIOS 2.23.1 12/25/2023
2024-05-18T22:41:23.910015+03:00 pve2 kernel: [ 9186.693152] RIP: 0010:blk_flush_complete_seq+0x291/0x2d0
2024-05-18T22:41:23.910015+03:00 pve2 kernel: [ 9186.693157] Code: 0f b6 f6 49 8d 56 01 49 c1 e6 04 4d 01 ee 48 c1 e2 04 49 8b 4e 10 4c 01 ea 48 39 ca 74 2b 48 8b 4b 50 48 8b 7b 48 48 8d 73 48 <48> 89 4f 08 48 89 39 49 8b 4e 18 49 89 76 18 48 89 53 48 48 89 4b
2024-05-18T22:41:23.910017+03:00 pve2 kernel: [ 9186.693161] RSP: 0018:ffffa9dac34c3a60 EFLAGS: 00010046
2024-05-18T22:41:23.910018+03:00 pve2 kernel: [ 9186.693164] RAX: 0000000000000000 RBX: ffff977b94ce8e00 RCX: ffff977b94ce8e48
2024-05-18T22:41:23.910018+03:00 pve2 kernel: [ 9186.693166] RDX: ffff977b9407a910 RSI: ffff977b94ce8e48 RDI: 0000000000000000
2024-05-18T22:41:23.910019+03:00 pve2 kernel: [ 9186.693168] RBP: ffffa9dac34c3aa0 R08: 0000000000000000 R09: 0000000000000000
2024-05-18T22:41:23.910020+03:00 pve2 kernel: [ 9186.693170] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000029801
2024-05-18T22:41:23.910020+03:00 pve2 kernel: [ 9186.693172] R13: ffff977b9407a900 R14: ffff977b9407a900 R15: ffff977b9463ab20
2024-05-18T22:41:23.910020+03:00 pve2 kernel: [ 9186.693174] FS:  000070869b5f8340(0000) GS:ffff978a80300000(0000) knlGS:0000000000000000
2024-05-18T22:41:23.910021+03:00 pve2 kernel: [ 9186.693177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-05-18T22:41:23.910021+03:00 pve2 kernel: [ 9186.693179] CR2: 0000000000000008 CR3: 000000010c8e6004 CR4: 00000000003726f0

Sprinterfreak · May 28, 2024

I wonder if this has something to do with https://lore.kernel.org/all/20240501110907.96950-9-dlemoal@kernel.org/

fiona · May 28, 2024

Hi,

Sprinterfreak said:
I wonder if this has something to do with https://lore.kernel.org/all/20240501110907.96950-9-dlemoal@kernel.org/

the commit that mentions fixing, i.e. 6f8fd758de63 ("block: Restore sector of flush requests") is only present in Linus' kernel v6.10-rc1 (as is the patch you linked to), but not in any Proxmox VE kernel.

Bent · May 28, 2024

In our test cluster (3 Nodes), comprising 2x AMD EPYC 7313P and 1x Intel E5-1650v3, no further crashes have occurred in the past 6 days+ (the last reboot was due to unrelated configuration changes). All nodes are running on Linux 6.8.4-3-pve.

All nodes in this cluster also run ceph version 17.2.7 and host 4 OSDs each (NVMe).

Has anyone else experienced similar feedback?

antreos · Jun 8, 2024

Bent said:
In our test cluster (3 Nodes), comprising 2x AMD EPYC 7313P and 1x Intel E5-1650v3, no further crashes have occurred in the past 6 days+ (the last reboot was due to unrelated configuration changes). All nodes are running on Linux 6.8.4-3-pve.

All nodes in this cluster also run ceph version 17.2.7 and host 4 OSDs each (NVMe).

Has anyone else experienced similar feedback?

The symptoms have changed (for me), though I’m not sure if the underlying issue has.

I installed Ubuntu as described in the first post here, using version 6.8.4-3-pve. The only difference now is that the WebUI is still accessible, though it has extremely long load times, and I can use xtermjs to access the console on the host. However, while writing this post, the WebUI also became unresponsive.

Everything else remains the same:
• vncproxy is either dead or not responding. (while the WebUI was still there, as mentioned it died)
• CPU cores are locked at 100%.
• Running VMs are unresponsive via RDP/Wireguard.
• SSH to the host is dead.

It seems the only change in symptoms is that the host remains operational for a few more minutes before becoming completely unresponsive.
(I haven’t checked physical console access like the last time.)

DaanSelen · Jun 24, 2024

Any updates or fixes?

antreos · Jun 25, 2024

celdrith said:
Any updates or fixes?

Despite efforts, the issue persists. A suggestion to disable IOMMU, which had been automatically enabled, only temporarily alleviated the unresponsiveness.

Regrettably, the Proxmox support team appears to be unresponsive to this concern. As a subscriber without the support option, I understand their lack of direct assistance. However, it's concerning that this issue seems to affect numerous users without being addressed.

fiona · Jun 25, 2024

Hi,
there is a build of kernel 6.8.8 available on the no-subscription repository since two weeks. That should resolve some of the issues people have reported with 6.8.4. If you have a subscription, you can temporarily enable the repository, run apt update, install the new kernel with apt install proxmox-kernel-6.8 and then disable the repository again and run apt update again.

fweber · Jun 26, 2024

antreos said:
Despite efforts, the issue persists. A suggestion to disable IOMMU, which had been automatically enabled, only temporarily alleviated the unresponsiveness.

Regrettably, the Proxmox support team appears to be unresponsive to this concern. As a subscriber without the support option, I understand their lack of direct assistance. However, it's concerning that this issue seems to affect numerous users without being addressed.

Hi, to add to what @fiona already wrote: In your first message in this thread, you reported a NULL pointer dereference with RIP pointing to blk_flush_complete_seq:

Code:

Apr 27 13:42:52 prxmx kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
[...]
Apr 27 13:42:52 prxmx kernel: RIP: 0010:blk_flush_complete_seq+0x291/0x2d0

This issue in particular should be fixed in kernel 6.8.8-1 (and higher). See [1] for more details. Please test the newer kernel and report back, especially if you still see crashes.

[1] https://forum.proxmox.com/threads/145760/page-8#post-674842

intrax · Jun 26, 2024

I'm still holding off the 6.8 kernel update because of all the problems I am seeing here in the forums and have pinned the 6.5.13-5-pve kernel for now, as it is working fine.
Maybe proxmox staff can issue an 'all clear' message, when it's safe for us no-subscription paupers to run the new kernel ??
Embarrassing isn't it ?

fiona · Jun 27, 2024

intrax said:
I'm still holding off the 6.8 kernel update because of all the problems I am seeing here in the forums and have pinned the 6.5.13-5-pve kernel for now, as it is working fine.
Maybe proxmox staff can issue an 'all clear' message, when it's safe for us no-subscription paupers to run the new kernel ??
Embarrassing isn't it ?

There is never a kernel that works for each possible hardware configuration. There always are kernel regressions, it's just a too big piece of software written in a very old language. Monitoring the forum (in particular the kernel or release announcement threads) for issues reported by people with similar hardware is good, but best is to test if the new kernel works for your setup when you have a maintenance window. Nobody else will be able to guarantee that 100% up front. We do try to minimize issues, but we can only test on the hardware we have and backport fixes that can be identified.

spykeproteen · Jun 27, 2024

I also have the 6.5.13-5-pve kernel pinned for now. Latest 6.8.8.2 is now showing random reboots

[SOLVED] Reproducible Proxmox Crash on Kernel 6.8.4-2-pve: KVM force termination renders Web UI, SSH, and all running VMs Unresponsive

New Member

Attachments

Distinguished Member

New Member

Attachments

Well-Known Member

New Member

New Member

Member

Member

New Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

New Member

Member

New Member

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

New Member

We value your privacy