Hello all,
we are facing a strange issue. We have a two node Cluster plus QDevice on the latest PVE version with HA enabled on the VMs. We are currently in the process of testing the cluster. In order to test the fencing and HA recovery, we pulled the network from one of the two nodes (Node 1). Everything went to plan, the remaining node and the QDevice still had quorum according to the logs, it was only after the offline node got fenced when the second node randomly rebooted as well. I'm a bit puzzled because the Logs don't indicate any reason at all for the reboot.
We are running all the traffic on a single bond, we know this isn't best practice but since the Servers are blade systems we'd rather have the redundancy of the bond. The latency when bulk migrating all VMs from one node to the other doesn't surpass 0,9ms, so I can't really imagine the cluster losing Quorum due to latency. The problem is reproducible. The logs are attached. Does anyone have an idea? Thanks in advance!
we are facing a strange issue. We have a two node Cluster plus QDevice on the latest PVE version with HA enabled on the VMs. We are currently in the process of testing the cluster. In order to test the fencing and HA recovery, we pulled the network from one of the two nodes (Node 1). Everything went to plan, the remaining node and the QDevice still had quorum according to the logs, it was only after the offline node got fenced when the second node randomly rebooted as well. I'm a bit puzzled because the Logs don't indicate any reason at all for the reboot.
We are running all the traffic on a single bond, we know this isn't best practice but since the Servers are blade systems we'd rather have the redundancy of the bond. The latency when bulk migrating all VMs from one node to the other doesn't surpass 0,9ms, so I can't really imagine the cluster losing Quorum due to latency. The problem is reproducible. The logs are attached. Does anyone have an idea? Thanks in advance!