Hello everyone,
Thank you for this beautiful project.
I am having a really peculiar issue with a proxmox installation.
The setup is a 2 servers cluster + a rasperry as Qdevice for quorum(proxmox version 8.03). The 2 servers are Dell r730xd so it's enterprise stuff. we got replication going and HA and everything worked fine for about a couple of months. After a while though we got messages from zabbix that the only vm that was running on node 2 went offline and online. We checked the node 2 logs and true enough out of the blue we saw --Reboot-- and learned that apparently the node rebooted itself. The vm that was running on it migrated correctly to node 1 and back to node 2 as soon as it went back online.
This thing kept happening randomly, sometimes in short intervals (like 2 days) sometime with weeks apart between each event.
the cluster is in production but the issue has only manifested outside of working hours and with minimal load on the node.
We Thought that the replication could somehow create the issue during backup hours (which starts outside of working hours) and tried to turn it off during those hours. no change.
The cluster has it's dedicated NIC with a direct cable between nodes for replication
It doesn't seem to be an hardware issue because the node passes all his self tests during reboot and correctly restart proxmox, i reckon something like this could be related to some CPU or RAM issue but we should then hang during POST.
We thought of some power supply issue but again: the nodes has a backup PSU and it's connected to the same UPS of node 1 which has never rebooted. Also NUT server didn't report any power loss.
We tried to investigate into the logs of node 2 but apart from the ---reboot--- message nothing stands out.
If we inspect logs on node 1 we can confirm that quorum is preserved and the decision is made to migrate the VM back and forth from node 2 to node 1 but that's it
here is a pastebin: https://pastebin.com/Y3VizsPd
We were thinking of getting subscription given these are production servers anyway and try to update to 8.1 but then again it only happens on one of the nodes and the hardware is twinned.
We tried researching a bit here and there but with no luck
Anywone has suggestions?
Thank you for this beautiful project.
I am having a really peculiar issue with a proxmox installation.
The setup is a 2 servers cluster + a rasperry as Qdevice for quorum(proxmox version 8.03). The 2 servers are Dell r730xd so it's enterprise stuff. we got replication going and HA and everything worked fine for about a couple of months. After a while though we got messages from zabbix that the only vm that was running on node 2 went offline and online. We checked the node 2 logs and true enough out of the blue we saw --Reboot-- and learned that apparently the node rebooted itself. The vm that was running on it migrated correctly to node 1 and back to node 2 as soon as it went back online.
This thing kept happening randomly, sometimes in short intervals (like 2 days) sometime with weeks apart between each event.
the cluster is in production but the issue has only manifested outside of working hours and with minimal load on the node.
We Thought that the replication could somehow create the issue during backup hours (which starts outside of working hours) and tried to turn it off during those hours. no change.
The cluster has it's dedicated NIC with a direct cable between nodes for replication
It doesn't seem to be an hardware issue because the node passes all his self tests during reboot and correctly restart proxmox, i reckon something like this could be related to some CPU or RAM issue but we should then hang during POST.
We thought of some power supply issue but again: the nodes has a backup PSU and it's connected to the same UPS of node 1 which has never rebooted. Also NUT server didn't report any power loss.
We tried to investigate into the logs of node 2 but apart from the ---reboot--- message nothing stands out.
If we inspect logs on node 1 we can confirm that quorum is preserved and the decision is made to migrate the VM back and forth from node 2 to node 1 but that's it
here is a pastebin: https://pastebin.com/Y3VizsPd
We were thinking of getting subscription given these are production servers anyway and try to update to 8.1 but then again it only happens on one of the nodes and the hardware is twinned.
We tried researching a bit here and there but with no luck
Anywone has suggestions?