Debugging a random server cold restart of a PVE node

jsabater

Member
Oct 25, 2021
102
8
23
48
Palma, Mallorca, Spain
Good day everyone!

A few days ago I had a sudden restart of a node in a 3-node PVE cluster which runs just LXC. The hardware is fairly new and has been working without any issue for almost a year now. Fortunately, the filesystem was checked and recovered, and Proxmox started normally and started providing service again, although the whole process took a while (around 20+ minutes). The status of the array was clean afterwards.

I checked every log I could find in /var/log but I could find nothing. It looks as if someone had done a hard-reset and two minutes after the last logged message the usual boot-up kernel messages appear.

I would like to know which ways are recommened to debug such scenarios (faulty RAM module, kernel bug, PVE kernel bug, etc.) in a manner that doesn't conflict with the operation of the Proxmox kernel and cluster, should it happen again.

I've been advised to use either of these two tools:
  • Kdump, a standard Linux mechanism to dump machine memory content on kernel crash based on Kexec.
  • mcelog, a user space backend for logging machine check errors reported by the hardware to the kernel.
If I've read and understood correctly:
  • Kdump requires a kernel that has been compiled with a number of flags, namely CONFIG_DEBUG_INFO, CONFIG_CRASH_DUMP and CONFIG_PROC_VMCORE, then you would have to modify the boot loader to run the kernel through kexec.
  • mcelog recommends running in daemon mode and does not require to alter the kernel.
Questions:
  1. Does the kernel package provided by Proxmox include the flags requested by Kdump?
  2. Would, could, should using mcelog in daemon mode affect the PVE kernel?

Summary of hardware:
  • Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz Hexacore
  • 128 GB RAM
  • 2x NVMe SSD 1T software RAID 1 ext4 filesystem
  • Linux proxmox2 5.15.39-2-pve #1 SMP PVE 5.15.39-2
  • pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-2-pve), pending update to 7.2-9 (probably next week)
Thanks in advance.



Now I'd like to know, should this ever happen again, how to



https://wiki.archlinux.org/title/Kdump
https://github.com/andikleen/mcelog
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!