The Headline:
I have managed to kick all 3 of my nodes from the cluster and wipe all configuration for both PVE and CEPH. This is bad. I have configuration backups, I just don't know how to use them.
The longer story:
Prior to this mishap, I had Proxmox installed on mirrored ZFS HDDs. I planned to move the install to a single smaller SSD (to free drive bays for more CEPH OSDs). I wanted to keep the configuration to avoid having to set up everything again and minimise downtime, so planned to do each node one by one to keep the cluster in quorum . Starting the node 1, I installed a fresh PVE on the new SSD and attempted to copy over the relevant config files. This ALMOST worked, but I was getting host key errors and the node wouldn't re-join the cluster. I managed to get the node 1 back in the cluster eventually, but CEPH was still down and something was still amiss with the fingerprints/keys from node 2/3's perspective. As a hail Mary I copied the entire /etc folder from the old HDD install to the new SSD install of node 1.
For reasons unbeknownst to me, this wiped node 1 - but worst of all it propagated across the entire cluster and wiped nodes 2&3 too. Every node was kicked from the cluster, VM configs, PVE configs and CEPH configs all reset to 0. Going to the CEPH tab in the web UI prompted me to install CEPH for the fist time. Completely. Wiped.
Suitably panicked at this point, I pulled the CEPH OSD drives from their bays to stop any further loss in it's tracks. As far as I know, the OSDs (and all my data) are untouched - it's just the configs.
What do I Need?
I need advice dear reader. How do I proceed? What are my options to restore my cluster? I am mainly concerned with recovering ceph. It has not only all my VMs on, but all my persistent container data on cephFS. I'm happy to reconfigure the PVE cluster from scratch... but I need ceph back exactly how it was. That being said, It would be a bonus to retain the VM configs so I don't have to re-create each VM manually and figure out which VHD is which. I did that a lot with XCP-NG and am sick of it. I fucked around and found out, and now I want to find out before I get fucked.
What do I Have?
Luckily, I took a backup of (what I thought were) the relevant configuration files before embarking on this shitshow of an upgrade. I have the following folders (recursive) from EACH NODE prior to poop --> fan :
What should I do? Any advice would be greatly appreciated and will happily buy coffees! Apologies for the coarse language... I am sleep deprived and stressed.
Many thanks,
Max
I have managed to kick all 3 of my nodes from the cluster and wipe all configuration for both PVE and CEPH. This is bad. I have configuration backups, I just don't know how to use them.
The longer story:
Prior to this mishap, I had Proxmox installed on mirrored ZFS HDDs. I planned to move the install to a single smaller SSD (to free drive bays for more CEPH OSDs). I wanted to keep the configuration to avoid having to set up everything again and minimise downtime, so planned to do each node one by one to keep the cluster in quorum . Starting the node 1, I installed a fresh PVE on the new SSD and attempted to copy over the relevant config files. This ALMOST worked, but I was getting host key errors and the node wouldn't re-join the cluster. I managed to get the node 1 back in the cluster eventually, but CEPH was still down and something was still amiss with the fingerprints/keys from node 2/3's perspective. As a hail Mary I copied the entire /etc folder from the old HDD install to the new SSD install of node 1.
For reasons unbeknownst to me, this wiped node 1 - but worst of all it propagated across the entire cluster and wiped nodes 2&3 too. Every node was kicked from the cluster, VM configs, PVE configs and CEPH configs all reset to 0. Going to the CEPH tab in the web UI prompted me to install CEPH for the fist time. Completely. Wiped.
Suitably panicked at this point, I pulled the CEPH OSD drives from their bays to stop any further loss in it's tracks. As far as I know, the OSDs (and all my data) are untouched - it's just the configs.
What do I Need?
I need advice dear reader. How do I proceed? What are my options to restore my cluster? I am mainly concerned with recovering ceph. It has not only all my VMs on, but all my persistent container data on cephFS. I'm happy to reconfigure the PVE cluster from scratch... but I need ceph back exactly how it was. That being said, It would be a bonus to retain the VM configs so I don't have to re-create each VM manually and figure out which VHD is which. I did that a lot with XCP-NG and am sick of it. I fucked around and found out, and now I want to find out before I get fucked.
What do I Have?
Luckily, I took a backup of (what I thought were) the relevant configuration files before embarking on this shitshow of an upgrade. I have the following folders (recursive) from EACH NODE prior to poop --> fan :
- /etc
- /var/lib/ceph
- /var/lib/pve-cluster
What should I do? Any advice would be greatly appreciated and will happily buy coffees! Apologies for the coarse language... I am sleep deprived and stressed.
Many thanks,
Max