We had a node failure that took down the ceph manager service, i know there should have been more than one running but ceph -s said their were 2 on standby that never took over.
Ceph was completely pooched and we had to do restorations from backup and luckily managed to recover some stuff from the ceph storage pool.
Going through the process of removing ceph completely so it can be rebuilt, following steps here: https://dannyda.com/2021/04/10/how-...ph-and-its-configuration-from-proxmox-ve-pve/
I stopped all ceph services, unmounted all osds (I could not run ceph osd down && ceph osd destroy as no ceph commands will work)
I removed the /etc/pve/ceph.conf, /etc/ceph folder and the /var/lib/ceph folder on all nodes.
Once I confirmed again across all 10 nodes that no ceph services were running I restarted the 10th in the list.
As soon as that happened EVERYTHING went down. After a few minutes the nodes started showing green again in the GUI and I had to go through and restart all of the VMs.
None of the steps that I took should have cause pve-cluster or corosync to freak out and drop all connections.
What. the. hell. happened?!
Ceph was completely pooched and we had to do restorations from backup and luckily managed to recover some stuff from the ceph storage pool.
Going through the process of removing ceph completely so it can be rebuilt, following steps here: https://dannyda.com/2021/04/10/how-...ph-and-its-configuration-from-proxmox-ve-pve/
I stopped all ceph services, unmounted all osds (I could not run ceph osd down && ceph osd destroy as no ceph commands will work)
I removed the /etc/pve/ceph.conf, /etc/ceph folder and the /var/lib/ceph folder on all nodes.
Once I confirmed again across all 10 nodes that no ceph services were running I restarted the 10th in the list.
As soon as that happened EVERYTHING went down. After a few minutes the nodes started showing green again in the GUI and I had to go through and restart all of the VMs.
None of the steps that I took should have cause pve-cluster or corosync to freak out and drop all connections.
What. the. hell. happened?!