Proxmox Cluster with Ceph need help to add new machines

DanielB.

New Member
Oct 22, 2021
4
0
1
58
Hi all,

here are my same post from the german forum. Maybe anyone here can help me.

Hello all,

we are testing Proxmox Cluster (version 7.0) incl. Ceph (16.2) in a small environment, because we would like to use this productively.

We have the following as-is state.

Cluster of 3 nodes in HA network incl. Ceph installation. Everything is running fine so far.

In the first test we simulated a defective machine from the 3 node cluster.
We then replaced this defective machine with a new node. We managed to get the cluster running stable and Ceph did not have any errors according to the overview page.

In the next step we tried to simulate a failure of 2 nodes in the sense that the machines have to be replaced.

We proceeded as in the first scenario, with the only difference that the Ceph installation on a new machine did not go through cleanly. It downloaded the packages but lost the connection before we could go to configuration. Ceph Storage reports "unknown" on all nodes and a test container won't boot because it is on the Ceph Storage.

Means we have now in our simulation 2 defective machines from which one was already taken out of the cluster + 1 machine (No.5) with half-baked Ceph state.

In total there are 3 nodes in the cluster. Where no container can be started anymore.

We then tried on machine no.5 to uninstall Ceph using pveceph purge command. Here is the output:
"Error gathering ceph info, already purged? Message: rados_conf_read_file failed - Invalid argument
Error gathering ceph info, already purged? Message: rados_conf_read_file failed - Invalid argument
Foreign MON address in ceph.conf. Keeping config & keyrings"

We also get the message: "rados_conf_read_file failed - Invalid argument" when we call the Ceph interface in the Proxmox GUI.

Our questions now would be:

1. how do we get the cluster running again without reinstalling everything?
2. how do we get a Ceph installation that did not run cleanly repaired (reinstalled) or uninstalled?
3. is there a disaster recovery guide for a Proxmox cluster installation including Ceph and HA groups that describes the correct procedure?

So far we have only found snippets of documentation e.g. Adding or Removing Node but no best practice.

Thanks for the help.
 
When you have lost two MONs out of three in your Ceph cluster you can only recover by manually editing the MON map. This is a very bad situation and should be avoided. You need at least a majority of running MONs to have a working cluster (2 of 3, 3 of 5, etc.).