Hi,
I probably need to give some background:
- I have been running a three node cluster (including CEPH) for a while.
- Then I wanted to add another node for testing purposes that is not online most of the time (I was going to move the VMs from one of the original nodes to it and then redeploy the original node with a new name - but haven't got around to that yet. So the fourth node was there but was turned off)
- At some point I enabled HA - don't know whether it was after or before adding the fourth node. In any case, the fourth node was not involved in the HA, so no VMs were configured to be moved to it in case one of the original three nodes went down.
- Normal operations worked fine for weeks (I did not test HA).
Went on vacation and, even while still on the road, realized something was wrong: None of my services were reachable anymore.
Logged in from afar (thank god I set up VPN a while ago) and found that one of the three original nodes was down and was expecting HA to kick in. But it didn't because there was no quorum: With the fourth node offline anyway and one of the original three nodes down as well, HA manager was expecting four votes and only found two. I realized the problem and removed the fourth node from the cluster (pvecm delnode) so that there would only be three votes expected (of which two were still to be found). It was disconnected physically actually at the moment anyway and so it cannot come online in the cluster again.
That seemed to work and the HA manager started to move the VMs off the original node that had gone down down and starting them on one of the remaining two online nodes.
BUT: Now none of the VMs are running (anymore). Neither the ones that were already running on the remaining original nodes still online nor the ones that were moved from the node that went down. And neither the VMs that were configured for HA nor the one that were not. I also removed some of the HA VMs from the HA set up and tried restarting them but to no avail. The GUI may show that a VM is running but it will show an error message at the same time that the VM could not be started. And the console won't connect and there is only minimal VM ram usage and VM CPU utilization (but actually some!). So my conclusion is that the VMs are really not running.
Any ideas what might be wrong and how I can recover the system?
Many thanks in advance!
I probably need to give some background:
- I have been running a three node cluster (including CEPH) for a while.
- Then I wanted to add another node for testing purposes that is not online most of the time (I was going to move the VMs from one of the original nodes to it and then redeploy the original node with a new name - but haven't got around to that yet. So the fourth node was there but was turned off)
- At some point I enabled HA - don't know whether it was after or before adding the fourth node. In any case, the fourth node was not involved in the HA, so no VMs were configured to be moved to it in case one of the original three nodes went down.
- Normal operations worked fine for weeks (I did not test HA).
Went on vacation and, even while still on the road, realized something was wrong: None of my services were reachable anymore.
Logged in from afar (thank god I set up VPN a while ago) and found that one of the three original nodes was down and was expecting HA to kick in. But it didn't because there was no quorum: With the fourth node offline anyway and one of the original three nodes down as well, HA manager was expecting four votes and only found two. I realized the problem and removed the fourth node from the cluster (pvecm delnode) so that there would only be three votes expected (of which two were still to be found). It was disconnected physically actually at the moment anyway and so it cannot come online in the cluster again.
That seemed to work and the HA manager started to move the VMs off the original node that had gone down down and starting them on one of the remaining two online nodes.
BUT: Now none of the VMs are running (anymore). Neither the ones that were already running on the remaining original nodes still online nor the ones that were moved from the node that went down. And neither the VMs that were configured for HA nor the one that were not. I also removed some of the HA VMs from the HA set up and tried restarting them but to no avail. The GUI may show that a VM is running but it will show an error message at the same time that the VM could not be started. And the console won't connect and there is only minimal VM ram usage and VM CPU utilization (but actually some!). So my conclusion is that the VMs are really not running.
Any ideas what might be wrong and how I can recover the system?
Many thanks in advance!