Restore Nodes after dead of quorum!

Hurtz1234

New Member
Oct 8, 2024
22
3
3
Hello, i have 6 nodes and had setup replicas: 6 and min_size = 3.

I head 3 nodes in the basment and 3 nodes in the top of my building. Because of Water draw in Basment, i lost 3 nodes totaly. I have bought now 3 new nodes. How can i bring bag thes new nodes to ceph? I have no backup of the old nodes, So i set it up and i thought that i can integreatit again into the cluster to get back my quorum, but this is not working. Because ceph is down. Has anybody a toturial how to add nodes when the quorum is dead?
 
Last edited:
More info is needed.
Do you still have any Ceph MONs around or were they lost too?
Is the Proxmox VE cluster itself up and running with the replaced nodes?

Quorum for Proxmox VE and Ceph are two distinct things!
 
I have a 6-node cluster with a replica size of 6 and a minimum size of 3. The problem started when 3 nodes, including Node 0, were placed in the basement, which flooded and damaged all 3.

I replaced the damaged nodes with new hardware and expected that giving them the same names would allow Proxmox to automatically reintegrate them into the cluster. However, this didn’t work. The cluster’s quorum is broken, and my efforts to restore it haven't been successful.

While I managed to re-establish quorum and add a new node, I still can't restore the original quorum with the replaced nodes. I find this process with Proxmox and Ceph to be more complicated than expected. I understand the need for a halt when quorum is broken, but I assumed that replacing nodes with similar hardware and the same server name would allow for seamless reintegration.
 
You will have to remove the old nodes from the Proxmox VE cluster https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node

Setting the Expected votes to a number you can actually achieve first can be useful: pvecm expected 3
After that you can re-add the replaced nodes to the Proxmox VE cluster and install the Ceph Packages in the correct version (what is present on the remaining nodes).
This should get the Proxmox VE cluster back up.

Ceph is a different story.
Do Data integrity should be fine with 6/3 (a bit overkill, but good in this situation ;) ).

The big question, do you still have at least one MON service running on the remaining nodes?
 
  • Like
Reactions: Hurtz1234
I removed the monitor and Ceph and i could bring it back, but it keeps a lot of issues behind. I have now a monitor which looks like this name.host.com (like you type in the host adress in the installation). I cant delete it
 
You will have to remove the old nodes from the Proxmox VE cluster https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node

Setting the Expected votes to a number you can actually achieve first can be useful: pvecm expected 3
After that you can re-add the replaced nodes to the Proxmox VE cluster and install the Ceph Packages in the correct version (what is present on the remaining nodes).
This should get the Proxmox VE cluster back up.

Ceph is a different story.
Do Data integrity should be fine with 6/3 (a bit overkill, but good in this situation ;) ).

The big question, do you still have at least one MON service running on the remaining nodes?
Thank you for the information. Will this removment also remove the osds in ceph?
 
I removed the monitor and Ceph and i could bring it back
How many Ceph MON instances did you have originally? If all MONs are lost, then you practically lost the Ceph cluster. If you still have one Ceph MON instance present on the nodes that didn't drown, it would be a lot better.

Will this removment also remove the osds in ceph?
No, that deals only with the Proxmox VE cluster. Ceph is its own clustered software stack that is separate, though managed by Proxmox VE.
 
  • Like
Reactions: Hurtz1234
I had 3 Mons left. I had on each node a Monitor. I brought it also back. But now i have Issues to bring ceph back.
To be honest it would be nice to have a kill button in proxmox which kills the complete node in proxmox and ceph with all osds and monitor and set new quorum rules. This would be a unike features with administrators love. Have an issue with a node replace it with a new one and go on.
 
Last edited:
I had 3 Mons left. I had on each node a Monitor. I brought it also back. But now i have Issues to bring ceph back.
Good. The issue is similar to Proxmox VE itself, not enough votes for a majority.

Ceph has the list of MONs in the /etc/ceph/ceph.conf file for service that are not MONs. So you could remove the sections for the lost MONs and the IP addresses in the `global.mon_host` line. This does not help for the MONs themselves though. They have their internal MONMAP where they keep track of which other MONs should be available.

You need to manually remove the lost MONs from the monmap. To do so:
  1. stop all MONs on the remaining nodes: systemctl stop ceph-mon@$(hostname).service
  2. extract the MONMAP on one of the hosts: ceph-mon -i $(hostname) --extract-monmap /tmp/monmap
  3. use the monmaptool to print the current monmap: monmaptool --print /tmp/monmap
  4. remove the lost MONs: monmaptool --rm {mon name} /tmp/monmap
  5. copy the modified monmap file to the other nodes
  6. inject the monmap into all remaining MONs: ceph-mon -i $(hostname) --inject-monmap /tmp/monmap
  7. If you now start the remaining MONs, they know nothing about the lost MONs and should be able to form a quorum.
Ceph will throw warnings, as it lost OSDs. If the disks of the flooded machines that were used for OSDs are still good, you can run
Code:
ceph-volume lvm activate --all
on the replaced machines once they are back in the Poxmox VE cluster and the Ceph packages have been installed. It should detect the OSDs, set up the services and re-add them to the Ceph cluster.
 
  • Like
Reactions: Hurtz1234
Thank you i got it back. But something is now wrong with my mon list and i cant add other mones from node1 and node 5
1729697280047.png
 
Do you only have the MONs that still exist in the monmap if you print it again? Are the MON services up and running?
 
  • Like
Reactions: Hurtz1234

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!