[Solved] Recovering CEPH and PVE From Wiped Cluster

scruffyte

New Member
Aug 22, 2024
2
3
3
The Headline:

I have managed to kick all 3 of my nodes from the cluster and wipe all configuration for both PVE and CEPH. This is bad. I have configuration backups, I just don't know how to use them.

The longer story:

Prior to this mishap, I had Proxmox installed on mirrored ZFS HDDs. I planned to move the install to a single smaller SSD (to free drive bays for more CEPH OSDs). I wanted to keep the configuration to avoid having to set up everything again and minimise downtime, so planned to do each node one by one to keep the cluster in quorum . Starting the node 1, I installed a fresh PVE on the new SSD and attempted to copy over the relevant config files. This ALMOST worked, but I was getting host key errors and the node wouldn't re-join the cluster. I managed to get the node 1 back in the cluster eventually, but CEPH was still down and something was still amiss with the fingerprints/keys from node 2/3's perspective. As a hail Mary I copied the entire /etc folder from the old HDD install to the new SSD install of node 1.

For reasons unbeknownst to me, this wiped node 1 - but worst of all it propagated across the entire cluster and wiped nodes 2&3 too. Every node was kicked from the cluster, VM configs, PVE configs and CEPH configs all reset to 0. Going to the CEPH tab in the web UI prompted me to install CEPH for the fist time. Completely. Wiped.

Suitably panicked at this point, I pulled the CEPH OSD drives from their bays to stop any further loss in it's tracks. As far as I know, the OSDs (and all my data) are untouched - it's just the configs.

What do I Need?

I need advice dear reader. How do I proceed? What are my options to restore my cluster? I am mainly concerned with recovering ceph. It has not only all my VMs on, but all my persistent container data on cephFS. I'm happy to reconfigure the PVE cluster from scratch... but I need ceph back exactly how it was. That being said, It would be a bonus to retain the VM configs so I don't have to re-create each VM manually and figure out which VHD is which. I did that a lot with XCP-NG and am sick of it. I fucked around and found out, and now I want to find out before I get fucked.

What do I Have?

Luckily, I took a backup of (what I thought were) the relevant configuration files before embarking on this shitshow of an upgrade. I have the following folders (recursive) from EACH NODE prior to poop --> fan :

  • /etc
  • /var/lib/ceph
  • /var/lib/pve-cluster
I ALSO have the HDD mirror from node 1 - this is in the untouched state before anything changed (I unplugged them before I did the new SSD install). I can dip into that if there are any files I need - hopefully.

What should I do? Any advice would be greatly appreciated and will happily buy coffees! Apologies for the coarse language... I am sleep deprived and stressed.

Many thanks,
Max
 
I managed to restore everything after a very stressful 24hrs!

For those reading in the future, don't bother backing up /etc/pve. Don't listen to what anyone says. Useful to have no doubt - so you can cherry pick files you might need, but it's an ineffective disaster recovery strategy. It's simply the FUSE mount of the SQLlite database, and you can't even write to it properly while it's mounted - and can't access it at all when it isn't. Instead back up the db file /var/lib/pve-cluster/config.db and use that to restore the config.

TLDR: To completely restore all nodes, I made the following backups and copied them to a fresh install: /etc/hosts, /etc/hostname, /etc/resolv.conf, /etc/ceph, /etc/corosync, /etc/ssh (particularly the host keys), /etc/network, /var/lib/ceph /var/lib/pve-cluster. I stopped PVE first to avoid conflics with "systemctl stop pve-cluster pvedaemon pveproxy pvestatd".

After restoring these files and rebooting PVE, VM CT, storage etc. configs are restored. Ceph required some extra work in my case:
  1. Enable the no-subscription repository and use "pveceph install --repository no-subscription" to install ceph (or use the web ui)
  2. Manually start and enable the manager and monitor on each node using systemctl start/enable ceph-mgr@<hostname>/ceph-mon@<hostname>
  3. Check your OSDs are detected by running "ceph-volume lvm list"
  4. Rejoin the OSDs to the cluster using "ceph-volume lvm activate --all"
  5. Profit
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!