PVE Kernel Panics on Reboots

linuxgemini · Apr 4, 2024

Recently I had upgraded my Proxmox VE instance to Hetzner's AX52 platform (Ryzen 7 7700 on Asus' Pro WS 665-ACE mobo on BIOS v1711) with 128 GB (non-ECC) RAM and 4x 1TB NVMe storage.

I installed Proxmox VE 8 as Proxmox-on-Debian using the wiki page for it (https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_12_Bookworm) and before installing proxmox-default-kernel; reboots were happening without problems.

Though Hetzner's installimage program made a rather interesting storage setup: RAID1 for ESP (md0, vfat), swap (md1) and /boot (md2, ext3) and RAID5 for rootfs (md3, I intentionally chose xfs)

After the first reboot I issued with the PVE kernel running; the system had panicked. Luckily I had a KVM plugged in at the time so I have a screendump of it:

Unfortunately this panic happens right after system shutdown (so instead of halting it panics), so last log lines aren't giving much info:

Code:

Mar 30 20:06:09.368067 hel1.domainwithe.ld kernel: pcieport 0000:00:01.4: AER: Corrected error received: 0000:03:00.0
Mar 30 20:06:09.368223 hel1.domainwithe.ld kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 30 20:06:09.368282 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Mar 30 20:06:09.368330 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 30 20:06:14.352074 hel1.domainwithe.ld kernel: pcieport 0000:00:01.4: AER: Corrected error received: 0000:03:00.0
Mar 30 20:06:14.352268 hel1.domainwithe.ld kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 30 20:06:14.352324 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Mar 30 20:06:14.352366 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 30 20:06:19.324053 hel1.domainwithe.ld kernel: EXT4-fs (md2): unmounting filesystem a6181d4d-8479-490b-839e-6332dffb668d.
Mar 30 20:06:19.376066 hel1.domainwithe.ld kernel: md0: detected capacity change from 524160 to 0
Mar 30 20:06:19.376113 hel1.domainwithe.ld kernel: md: md0 stopped.
Mar 30 20:06:19.584064 hel1.domainwithe.ld kernel: md2: detected capacity change from 2093056 to 0
Mar 30 20:06:19.584090 hel1.domainwithe.ld kernel: md: md2 stopped.
Mar 30 20:06:22.228061 hel1.domainwithe.ld kernel: tap100i0: left allmulticast mode
Mar 30 20:06:22.228131 hel1.domainwithe.ld kernel: vmbr3: port 1(tap100i0) entered disabled state
Mar 30 20:06:23.140064 hel1.domainwithe.ld kernel: tap108i0: left allmulticast mode
Mar 30 20:06:23.140135 hel1.domainwithe.ld kernel: vmbr1: port 9(tap108i0) entered disabled state
Mar 30 20:06:23.224071 hel1.domainwithe.ld kernel: tap105i0: left allmulticast mode
Mar 30 20:06:23.224127 hel1.domainwithe.ld kernel: vmbr1: port 7(tap105i0) entered disabled state
Mar 30 20:06:23.280071 hel1.domainwithe.ld kernel: tap104i0: left allmulticast mode
Mar 30 20:06:23.280120 hel1.domainwithe.ld kernel: vmbr1: port 6(tap104i0) entered disabled state
Mar 30 20:06:23.572091 hel1.domainwithe.ld kernel: tap104i1: left allmulticast mode
Mar 30 20:06:23.572156 hel1.domainwithe.ld kernel: vmbr3: port 2(tap104i1) entered disabled state
Mar 30 20:06:23.900068 hel1.domainwithe.ld kernel: tap109i0: left allmulticast mode
Mar 30 20:06:23.900145 hel1.domainwithe.ld kernel: vmbr1: port 10(tap109i0) entered disabled state
Mar 30 20:06:28.016072 hel1.domainwithe.ld kernel: tap103i0: left allmulticast mode
Mar 30 20:06:28.016156 hel1.domainwithe.ld kernel: vmbr1: port 5(tap103i0) entered disabled state
Mar 30 20:06:28.180078 hel1.domainwithe.ld kernel: tap101i0: left allmulticast mode
Mar 30 20:06:28.180144 hel1.domainwithe.ld kernel: vmbr1: port 4(tap101i0) entered disabled state
Mar 30 20:06:28.388077 hel1.domainwithe.ld kernel: tap101i1: left allmulticast mode
Mar 30 20:06:28.388147 hel1.domainwithe.ld kernel: vmbr2: port 1(tap101i1) entered disabled state
Mar 30 20:06:38.712075 hel1.domainwithe.ld kernel: tap106i0: left allmulticast mode
Mar 30 20:06:38.712147 hel1.domainwithe.ld kernel: vmbr0: port 2(tap106i0) entered disabled state
Mar 30 20:06:38.900072 hel1.domainwithe.ld kernel: tap106i1: left allmulticast mode
Mar 30 20:06:38.900134 hel1.domainwithe.ld kernel: vmbr1: port 8(tap106i1) entered disabled state
Mar 30 20:06:43.640069 hel1.domainwithe.ld kernel: tap111i0: left allmulticast mode
Mar 30 20:06:43.640144 hel1.domainwithe.ld kernel: vmbr1: port 3(tap111i0) entered disabled state
Mar 30 20:06:47.104071 hel1.domainwithe.ld kernel: pcieport 0000:00:01.4: AER: Corrected error received: 0000:03:00.0
Mar 30 20:06:47.104247 hel1.domainwithe.ld kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 30 20:06:47.104304 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Mar 30 20:06:47.104344 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 30 20:06:48.152263 hel1.domainwithe.ld kernel: tap115i0: left allmulticast mode
Mar 30 20:06:48.152355 hel1.domainwithe.ld kernel: vmbr1: port 2(tap115i0) entered disabled state
Mar 30 20:06:52.892250 hel1.domainwithe.ld kernel: tap102i0: left allmulticast mode
Mar 30 20:06:52.892314 hel1.domainwithe.ld kernel: vmbr1: port 1(tap102i0) entered disabled state
Mar 30 20:07:01.220269 hel1.domainwithe.ld kernel: watchdog: watchdog0: watchdog did not stop!
Mar 30 20:07:01.220373 hel1.domainwithe.ld systemd-shutdown[1]: Using hardware watchdog 'Software Watchdog', version 0, device /dev/watchdog0
Mar 30 20:07:01.220421 hel1.domainwithe.ld systemd-shutdown[1]: Watchdog running with a timeout of 10min.
Mar 30 20:07:01.224240 hel1.domainwithe.ld systemd-shutdown[1]: Syncing filesystems and block devices.
Mar 30 20:07:01.257454 hel1.domainwithe.ld systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Mar 30 20:07:01.257554 hel1.domainwithe.ld systemd-journald[448]: Received SIGTERM from PID 1 (systemd-shutdow).

Hetzner told me that the PCIe Bus errors for NVMe is common on DDR5 platforms, though I'm not quite sure on that one.

I have a suspicion that the RAID5 partition (md3) doesn't shut down in time, which the kernel fails to ignore(?)

I am including a (hand-redacted) system report generated from the UI (hel1-pve-report-Thu-04-April-2024-14-52.txt). Did anyone experience this (or similar events)? Thanks.

bjorn-helgaas · Apr 26, 2024

I don't have any ideas about the panic, but I would like to debug the PCIe Correctable Errors that are logged by the 03:00.0 NVMe device.

If you are willing, please open a bug report at https://bugzilla.kernel.org/, product Drivers/PCI, mention the hardware platform, and attach:

complete dmesg log (I assume this will include some Correctable Errors)
output of "sudo lspci -vv"

Try booting with the "pcie_aspm=off" kernel parameter to see if it makes any difference. If it does, please also attach similar dmesg and lspci output for this boot.

This seems similar to https://bugzilla.kernel.org/show_bug.cgi?id=215027, which we originally thought was related to Intel VMD and/or the Samsung NVMe device you have, but I now suspect we might have an ASPM configuration problem.

Search

Search

PVE Kernel Panics on Reboots

linuxgemini

Member

Attachments

bjorn-helgaas

New Member