Debugging a random server cold restart of a PVE node

jsabater · Sep 3, 2022

Good day everyone!

A few days ago I had a sudden restart of a node in a 3-node PVE cluster which runs just LXC. The hardware is fairly new and has been working without any issue for almost a year now. Fortunately, the filesystem was checked and recovered, and Proxmox started normally and started providing service again, although the whole process took a while (around 20+ minutes). The status of the array was clean afterwards.

I checked every log I could find in /var/log but I could find nothing. It looks as if someone had done a hard-reset and two minutes after the last logged message the usual boot-up kernel messages appear.

I would like to know which ways are recommened to debug such scenarios (faulty RAM module, kernel bug, PVE kernel bug, etc.) in a manner that doesn't conflict with the operation of the Proxmox kernel and cluster, should it happen again.

I've been advised to use either of these two tools:

Kdump, a standard Linux mechanism to dump machine memory content on kernel crash based on Kexec.
mcelog, a user space backend for logging machine check errors reported by the hardware to the kernel.

If I've read and understood correctly:

Kdump requires a kernel that has been compiled with a number of flags, namely CONFIG_DEBUG_INFO, CONFIG_CRASH_DUMP and CONFIG_PROC_VMCORE, then you would have to modify the boot loader to run the kernel through kexec.
mcelog recommends running in daemon mode and does not require to alter the kernel.

Questions:

Does the kernel package provided by Proxmox include the flags requested by Kdump?
Would, could, should using mcelog in daemon mode affect the PVE kernel?

Summary of hardware:

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz Hexacore
128 GB RAM
2x NVMe SSD 1T software RAID 1 ext4 filesystem
Linux proxmox2 5.15.39-2-pve #1 SMP PVE 5.15.39-2
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-2-pve), pending update to 7.2-9 (probably next week)

Thanks in advance.

Now I'd like to know, should this ever happen again, how to

https://wiki.archlinux.org/title/Kdump
https://github.com/andikleen/mcelog

Search

Search

Debugging a random server cold restart of a PVE node

jsabater

Member

We value your privacy