Debugging a random server cold restart of a PVE node

jsabater

Member
Oct 25, 2021
126
12
23
49
Palma, Mallorca, Spain
Good day everyone!

A few days ago I had a sudden restart of a node in a 3-node PVE cluster which runs just LXC. The hardware is fairly new and has been working without any issue for almost a year now. Fortunately, the filesystem was checked and recovered, and Proxmox started normally and started providing service again, although the whole process took a while (around 20+ minutes). The status of the array was clean afterwards.

I checked every log I could find in /var/log but I could find nothing. It looks as if someone had done a hard-reset and two minutes after the last logged message the usual boot-up kernel messages appear.

I would like to know which ways are recommened to debug such scenarios (faulty RAM module, kernel bug, PVE kernel bug, etc.) in a manner that doesn't conflict with the operation of the Proxmox kernel and cluster, should it happen again.

I've been advised to use either of these two tools:
  • Kdump, a standard Linux mechanism to dump machine memory content on kernel crash based on Kexec.
  • mcelog, a user space backend for logging machine check errors reported by the hardware to the kernel.
If I've read and understood correctly:
  • Kdump requires a kernel that has been compiled with a number of flags, namely CONFIG_DEBUG_INFO, CONFIG_CRASH_DUMP and CONFIG_PROC_VMCORE, then you would have to modify the boot loader to run the kernel through kexec.
  • mcelog recommends running in daemon mode and does not require to alter the kernel.
Questions:
  1. Does the kernel package provided by Proxmox include the flags requested by Kdump?
  2. Would, could, should using mcelog in daemon mode affect the PVE kernel?

Summary of hardware:
  • Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz Hexacore
  • 128 GB RAM
  • 2x NVMe SSD 1T software RAID 1 ext4 filesystem
  • Linux proxmox2 5.15.39-2-pve #1 SMP PVE 5.15.39-2
  • pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-2-pve), pending update to 7.2-9 (probably next week)
Thanks in advance.



Now I'd like to know, should this ever happen again, how to



https://wiki.archlinux.org/title/Kdump
https://github.com/andikleen/mcelog