Good day everyone!
A few days ago I had a sudden restart of a node in a 3-node PVE cluster which runs just LXC. The hardware is fairly new and has been working without any issue for almost a year now. Fortunately, the filesystem was checked and recovered, and Proxmox started normally and started providing service again, although the whole process took a while (around 20+ minutes). The status of the array was clean afterwards.
I checked every log I could find in
I would like to know which ways are recommened to debug such scenarios (faulty RAM module, kernel bug, PVE kernel bug, etc.) in a manner that doesn't conflict with the operation of the Proxmox kernel and cluster, should it happen again.
I've been advised to use either of these two tools:
Summary of hardware:
Now I'd like to know, should this ever happen again, how to
https://wiki.archlinux.org/title/Kdump
https://github.com/andikleen/mcelog
A few days ago I had a sudden restart of a node in a 3-node PVE cluster which runs just LXC. The hardware is fairly new and has been working without any issue for almost a year now. Fortunately, the filesystem was checked and recovered, and Proxmox started normally and started providing service again, although the whole process took a while (around 20+ minutes). The status of the array was clean afterwards.
I checked every log I could find in
/var/log
but I could find nothing. It looks as if someone had done a hard-reset and two minutes after the last logged message the usual boot-up kernel messages appear.I would like to know which ways are recommened to debug such scenarios (faulty RAM module, kernel bug, PVE kernel bug, etc.) in a manner that doesn't conflict with the operation of the Proxmox kernel and cluster, should it happen again.
I've been advised to use either of these two tools:
- Kdump, a standard Linux mechanism to dump machine memory content on kernel crash based on Kexec.
- mcelog, a user space backend for logging machine check errors reported by the hardware to the kernel.
- Kdump requires a kernel that has been compiled with a number of flags, namely
CONFIG_DEBUG_INFO
,CONFIG_CRASH_DUMP
andCONFIG_PROC_VMCORE
, then you would have to modify the boot loader to run the kernel through kexec. - mcelog recommends running in daemon mode and does not require to alter the kernel.
- Does the kernel package provided by Proxmox include the flags requested by Kdump?
- Would, could, should using mcelog in daemon mode affect the PVE kernel?
Summary of hardware:
- Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz Hexacore
- 128 GB RAM
- 2x NVMe SSD 1T software RAID 1 ext4 filesystem
- Linux proxmox2 5.15.39-2-pve #1 SMP PVE 5.15.39-2
- pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-2-pve), pending update to 7.2-9 (probably next week)
Now I'd like to know, should this ever happen again, how to
https://wiki.archlinux.org/title/Kdump
https://github.com/andikleen/mcelog