PVE Kernel Panics on Reboots

linuxgemini

Member
Mar 7, 2022
2
0
6
linuxgemini.space
Recently I had upgraded my Proxmox VE instance to Hetzner's AX52 platform (Ryzen 7 7700 on Asus' Pro WS 665-ACE mobo on BIOS v1711) with 128 GB (non-ECC) RAM and 4x 1TB NVMe storage.

I installed Proxmox VE 8 as Proxmox-on-Debian using the wiki page for it (https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_12_Bookworm) and before installing proxmox-default-kernel; reboots were happening without problems.

Though Hetzner's installimage program made a rather interesting storage setup: RAID1 for ESP (md0, vfat), swap (md1) and /boot (md2, ext3) and RAID5 for rootfs (md3, I intentionally chose xfs)

After the first reboot I issued with the PVE kernel running; the system had panicked. Luckily I had a KVM plugged in at the time so I have a screendump of it:

screenshot_1711763874006.png

Unfortunately this panic happens right after system shutdown (so instead of halting it panics), so last log lines aren't giving much info:

Code:
Mar 30 20:06:09.368067 hel1.domainwithe.ld kernel: pcieport 0000:00:01.4: AER: Corrected error received: 0000:03:00.0
Mar 30 20:06:09.368223 hel1.domainwithe.ld kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 30 20:06:09.368282 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Mar 30 20:06:09.368330 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 30 20:06:14.352074 hel1.domainwithe.ld kernel: pcieport 0000:00:01.4: AER: Corrected error received: 0000:03:00.0
Mar 30 20:06:14.352268 hel1.domainwithe.ld kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 30 20:06:14.352324 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Mar 30 20:06:14.352366 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 30 20:06:19.324053 hel1.domainwithe.ld kernel: EXT4-fs (md2): unmounting filesystem a6181d4d-8479-490b-839e-6332dffb668d.
Mar 30 20:06:19.376066 hel1.domainwithe.ld kernel: md0: detected capacity change from 524160 to 0
Mar 30 20:06:19.376113 hel1.domainwithe.ld kernel: md: md0 stopped.
Mar 30 20:06:19.584064 hel1.domainwithe.ld kernel: md2: detected capacity change from 2093056 to 0
Mar 30 20:06:19.584090 hel1.domainwithe.ld kernel: md: md2 stopped.
Mar 30 20:06:22.228061 hel1.domainwithe.ld kernel: tap100i0: left allmulticast mode
Mar 30 20:06:22.228131 hel1.domainwithe.ld kernel: vmbr3: port 1(tap100i0) entered disabled state
Mar 30 20:06:23.140064 hel1.domainwithe.ld kernel: tap108i0: left allmulticast mode
Mar 30 20:06:23.140135 hel1.domainwithe.ld kernel: vmbr1: port 9(tap108i0) entered disabled state
Mar 30 20:06:23.224071 hel1.domainwithe.ld kernel: tap105i0: left allmulticast mode
Mar 30 20:06:23.224127 hel1.domainwithe.ld kernel: vmbr1: port 7(tap105i0) entered disabled state
Mar 30 20:06:23.280071 hel1.domainwithe.ld kernel: tap104i0: left allmulticast mode
Mar 30 20:06:23.280120 hel1.domainwithe.ld kernel: vmbr1: port 6(tap104i0) entered disabled state
Mar 30 20:06:23.572091 hel1.domainwithe.ld kernel: tap104i1: left allmulticast mode
Mar 30 20:06:23.572156 hel1.domainwithe.ld kernel: vmbr3: port 2(tap104i1) entered disabled state
Mar 30 20:06:23.900068 hel1.domainwithe.ld kernel: tap109i0: left allmulticast mode
Mar 30 20:06:23.900145 hel1.domainwithe.ld kernel: vmbr1: port 10(tap109i0) entered disabled state
Mar 30 20:06:28.016072 hel1.domainwithe.ld kernel: tap103i0: left allmulticast mode
Mar 30 20:06:28.016156 hel1.domainwithe.ld kernel: vmbr1: port 5(tap103i0) entered disabled state
Mar 30 20:06:28.180078 hel1.domainwithe.ld kernel: tap101i0: left allmulticast mode
Mar 30 20:06:28.180144 hel1.domainwithe.ld kernel: vmbr1: port 4(tap101i0) entered disabled state
Mar 30 20:06:28.388077 hel1.domainwithe.ld kernel: tap101i1: left allmulticast mode
Mar 30 20:06:28.388147 hel1.domainwithe.ld kernel: vmbr2: port 1(tap101i1) entered disabled state
Mar 30 20:06:38.712075 hel1.domainwithe.ld kernel: tap106i0: left allmulticast mode
Mar 30 20:06:38.712147 hel1.domainwithe.ld kernel: vmbr0: port 2(tap106i0) entered disabled state
Mar 30 20:06:38.900072 hel1.domainwithe.ld kernel: tap106i1: left allmulticast mode
Mar 30 20:06:38.900134 hel1.domainwithe.ld kernel: vmbr1: port 8(tap106i1) entered disabled state
Mar 30 20:06:43.640069 hel1.domainwithe.ld kernel: tap111i0: left allmulticast mode
Mar 30 20:06:43.640144 hel1.domainwithe.ld kernel: vmbr1: port 3(tap111i0) entered disabled state
Mar 30 20:06:47.104071 hel1.domainwithe.ld kernel: pcieport 0000:00:01.4: AER: Corrected error received: 0000:03:00.0
Mar 30 20:06:47.104247 hel1.domainwithe.ld kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 30 20:06:47.104304 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:   device [144d:a80a] error status/mask=00000001/0000e000
Mar 30 20:06:47.104344 hel1.domainwithe.ld kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 30 20:06:48.152263 hel1.domainwithe.ld kernel: tap115i0: left allmulticast mode
Mar 30 20:06:48.152355 hel1.domainwithe.ld kernel: vmbr1: port 2(tap115i0) entered disabled state
Mar 30 20:06:52.892250 hel1.domainwithe.ld kernel: tap102i0: left allmulticast mode
Mar 30 20:06:52.892314 hel1.domainwithe.ld kernel: vmbr1: port 1(tap102i0) entered disabled state
Mar 30 20:07:01.220269 hel1.domainwithe.ld kernel: watchdog: watchdog0: watchdog did not stop!
Mar 30 20:07:01.220373 hel1.domainwithe.ld systemd-shutdown[1]: Using hardware watchdog 'Software Watchdog', version 0, device /dev/watchdog0
Mar 30 20:07:01.220421 hel1.domainwithe.ld systemd-shutdown[1]: Watchdog running with a timeout of 10min.
Mar 30 20:07:01.224240 hel1.domainwithe.ld systemd-shutdown[1]: Syncing filesystems and block devices.
Mar 30 20:07:01.257454 hel1.domainwithe.ld systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Mar 30 20:07:01.257554 hel1.domainwithe.ld systemd-journald[448]: Received SIGTERM from PID 1 (systemd-shutdow).

Hetzner told me that the PCIe Bus errors for NVMe is common on DDR5 platforms, though I'm not quite sure on that one.

I have a suspicion that the RAID5 partition (md3) doesn't shut down in time, which the kernel fails to ignore(?)

I am including a (hand-redacted) system report generated from the UI (hel1-pve-report-Thu-04-April-2024-14-52.txt). Did anyone experience this (or similar events)? Thanks.
 

Attachments

  • hel1-pve-report-Thu-04-April-2024-14-52.txt
    93.2 KB · Views: 3
I don't have any ideas about the panic, but I would like to debug the PCIe Correctable Errors that are logged by the 03:00.0 NVMe device.

If you are willing, please open a bug report at https://bugzilla.kernel.org/, product Drivers/PCI, mention the hardware platform, and attach:
  • complete dmesg log (I assume this will include some Correctable Errors)
  • output of "sudo lspci -vv"
Try booting with the "pcie_aspm=off" kernel parameter to see if it makes any difference. If it does, please also attach similar dmesg and lspci output for this boot.

This seems similar to https://bugzilla.kernel.org/show_bug.cgi?id=215027, which we originally thought was related to Intel VMD and/or the Samsung NVMe device you have, but I now suspect we might have an ASPM configuration problem.
 
I don't have any ideas about the panic, but I would like to debug the PCIe Correctable Errors that are logged by the 03:00.0 NVMe device.

If you are willing, please open a bug report at https://bugzilla.kernel.org/, product Drivers/PCI, mention the hardware platform, and attach:
  • complete dmesg log (I assume this will include some Correctable Errors)
  • output of "sudo lspci -vv"
Try booting with the "pcie_aspm=off" kernel parameter to see if it makes any difference. If it does, please also attach similar dmesg and lspci output for this boot.

This seems similar to https://bugzilla.kernel.org/show_bug.cgi?id=215027, which we originally thought was related to Intel VMD and/or the Samsung NVMe device you have, but I now suspect we might have an ASPM configuration problem.

Hi Bjorn, sadly I stopped using the server in the topic (migrated to an older gen spec).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!