I may have seriously f...ed up my system

proxwolfe

Renowned Member
Jun 20, 2020
546
67
68
50
Hi,

I probably need to give some background:

- I have been running a three node cluster (including CEPH) for a while.
- Then I wanted to add another node for testing purposes that is not online most of the time (I was going to move the VMs from one of the original nodes to it and then redeploy the original node with a new name - but haven't got around to that yet. So the fourth node was there but was turned off)
- At some point I enabled HA - don't know whether it was after or before adding the fourth node. In any case, the fourth node was not involved in the HA, so no VMs were configured to be moved to it in case one of the original three nodes went down.
- Normal operations worked fine for weeks (I did not test HA).

Went on vacation and, even while still on the road, realized something was wrong: None of my services were reachable anymore.

Logged in from afar (thank god I set up VPN a while ago) and found that one of the three original nodes was down and was expecting HA to kick in. But it didn't because there was no quorum: With the fourth node offline anyway and one of the original three nodes down as well, HA manager was expecting four votes and only found two. I realized the problem and removed the fourth node from the cluster (pvecm delnode) so that there would only be three votes expected (of which two were still to be found). It was disconnected physically actually at the moment anyway and so it cannot come online in the cluster again.

That seemed to work and the HA manager started to move the VMs off the original node that had gone down down and starting them on one of the remaining two online nodes.

BUT: Now none of the VMs are running (anymore). Neither the ones that were already running on the remaining original nodes still online nor the ones that were moved from the node that went down. And neither the VMs that were configured for HA nor the one that were not. I also removed some of the HA VMs from the HA set up and tried restarting them but to no avail. The GUI may show that a VM is running but it will show an error message at the same time that the VM could not be started. And the console won't connect and there is only minimal VM ram usage and VM CPU utilization (but actually some!). So my conclusion is that the VMs are really not running.

Any ideas what might be wrong and how I can recover the system?

Many thanks in advance!
 
I just found something that might (help) explain my problem:

As I said above, I am also using CEPH (the VMs are stored on CEPH). And the fourth experimental node was also part of the CEPH config. But I had configured CEPH with a minimum pool size of 2 (default 3). So it did work with 3 nodes online and I expected it to work as well with 2.

But I just realized that the GUI is empty with respect to CEPH (or shows timeout). So the going down of one of the original nodes leaving only two may have broken CEPH and so the VMs can't start (run) anymore. Does that sound about right?

If so, is there anything I can do from afar (the one thing I can't do is restart the one original node that went down).

One idea (the only one actually) I have is: Since I am running regular backups to a PBS (that is not part of the cluster), I might be able to restore (some of the) VMs to a local disk on one of the 2 remaining nodes (that is not committed to CEPH).

But my preference would be to bring CEPH back up (with only the two remaining nodes). Is that feasible?
 
One idea (the only one actually) I have is: Since I am running regular backups to a PBS (that is not part of the cluster), I might be able to restore (some of the) VMs to a local disk on one of the 2 remaining nodes (that is not committed to CEPH).
Okay, so that also doesn't work:

While my PBS is still up and running, inside the PVE GUI of the VMs I cannot access my backups on PBS.

I can only assume that this also has to do with the CEPH problem. While I don't back up to CEPH, the GUI normally shows that as an option and when accessing backups I need to select my PBS instead of the default CEPH. If so, that would seem to be a bug: I should still be able to access my PBS backups despite CEPH being down, shouldn't I?

Is that something I can overcome somehow?
 
You can maybe give one of the remaining nodes two votes, so that quorum is restored. I'm not sure though if that is possible in a non-quorate cluster.

Or you setup a qdevice for the cluster and a ceph arbiter device. However, I'm also not sure if this works while being non-quorate.
 
You can maybe give one of the remaining nodes two votes, so that quorum is restored. I'm not sure though if that is possible in a non-quorate cluster.

Or you setup a qdevice for the cluster and a ceph arbiter device. However, I'm also not sure if this works while being non-quorate.
The PVE cluster is operable. After removing the fourth node, only three votes are expected and 2 are available. So that is working. And I was able to remove that fourth node while the cluster was not operable by reducing the number of expected votes. So your suggestion is definitely correct.

I was not aware of the arbiter device for CEPH (or may have forgotten about it). That does sound like a feasible way (assuming this can be done with a CEPH cluster out of order - as with the PVE cluster). So thank you very much for this!!! I will investigate how to do it, try it and report back.
 
  • Like
Reactions: ph0x
I am back from my vacation and have physical access to my servers now and was able to (re)start the node that had gone astray.

After troubleshooting some other issues, all three nodes are now online.

My remaining issue is that while I got rid of the fourth node in the PVE cluster, I can't get rid of it(s remnants) in the CEPH cluster:
  • When I try to destroy the monitor registered on it, I get "Method 'DELETE /' not implemented (501)"
  • When I try to destroy the OSD registered on it, I get "hostname lookup 'fourthnode' failed - failed to get address info for: fourthnode: Name or service not known (500).
My guess is that this is because I removed the node from the cluster and now it cannot be found anymore when I want to do anything with/to it. It would probably have been better to first destroy the CEPH monitor and OSD on it and only then remove the node from the PVE cluster. Well, next time I know.

But what can I do now to get rid of the monitor and the OSD?
 
  • When I try to destroy the monitor registered on it, I get "Method 'DELETE /' not implemented (501)"
  • When I try to destroy the OSD registered on it, I get "hostname lookup 'fourthnode' failed - failed to get address info for: fourthnode: Name or service not known (500).
Are there any messages printed in journalctl when you issue these commands?