TL;DR if one host is rebooted, the second hosts hard-resets. No logs in dmesg/messages.
I have a little setup running on ThinkCenters, with nvme formatted as lvm. This setup serves a few home applications and is mainly for educational purposes. A playground so to say. Both hosts have 16GB RAM and 8GB (default) swap on lv [2].
There is no automatic HA relocate configured. Both hosts serve some lxc and qemu guests.
From time to time, when I update the hosts for instance, I need to reboot just one node. I would then expect the node which I rebooted manually to reboot and join the cluster, then bring up its VM's again. What actually happens is, that the second node hard-resets as soon as it doesn't reach the first node anymore. Then if both hosts booted up again (both in state waiting for quorum) VM's start-up and the cluster becomes available. Both hosts memory usage did not exceed 20% when this happend, Swap usage was 0B out of 8GB. Since there is no HA action configured, I do not expect much io before the second node dies.
My first question is: Is this normal behavior in a 2-node cluster? (I guess: no)
The second question is how to gather useful information to track down the cause of the second node's kernel to just die and reboot.
Sadly there is nothing logged to /var/log/messages when this happens. It looks like it dies that hard, that it does not even try to flush anything to disk.
It was happening on at least 5.13.19-6-pve, 5.15.39-2-pve and current 5.15.83-1-pve.
Maybe related to:
[1] https://forum.proxmox.com/threads/hosts-rebooting-at-the-same-time.33068/ - no cause identified or
[2] https://forum.proxmox.com/threads/security-pve-can-be-crashed-by-user-on-demand.49343/ - zfs host crashes on swap usage
I guess in a production environment this would be quiet disastrous.
I have a little setup running on ThinkCenters, with nvme formatted as lvm. This setup serves a few home applications and is mainly for educational purposes. A playground so to say. Both hosts have 16GB RAM and 8GB (default) swap on lv [2].
There is no automatic HA relocate configured. Both hosts serve some lxc and qemu guests.
From time to time, when I update the hosts for instance, I need to reboot just one node. I would then expect the node which I rebooted manually to reboot and join the cluster, then bring up its VM's again. What actually happens is, that the second node hard-resets as soon as it doesn't reach the first node anymore. Then if both hosts booted up again (both in state waiting for quorum) VM's start-up and the cluster becomes available. Both hosts memory usage did not exceed 20% when this happend, Swap usage was 0B out of 8GB. Since there is no HA action configured, I do not expect much io before the second node dies.
My first question is: Is this normal behavior in a 2-node cluster? (I guess: no)
The second question is how to gather useful information to track down the cause of the second node's kernel to just die and reboot.
Sadly there is nothing logged to /var/log/messages when this happens. It looks like it dies that hard, that it does not even try to flush anything to disk.
It was happening on at least 5.13.19-6-pve, 5.15.39-2-pve and current 5.15.83-1-pve.
Maybe related to:
[1] https://forum.proxmox.com/threads/hosts-rebooting-at-the-same-time.33068/ - no cause identified or
[2] https://forum.proxmox.com/threads/security-pve-can-be-crashed-by-user-on-demand.49343/ - zfs host crashes on swap usage
I guess in a production environment this would be quiet disastrous.