Removed ceph, restarted node and all nodes went down. Why?!

Sep 11, 2019
26
1
8
55
We had a node failure that took down the ceph manager service, i know there should have been more than one running but ceph -s said their were 2 on standby that never took over.

Ceph was completely pooched and we had to do restorations from backup and luckily managed to recover some stuff from the ceph storage pool.
Going through the process of removing ceph completely so it can be rebuilt, following steps here: https://dannyda.com/2021/04/10/how-...ph-and-its-configuration-from-proxmox-ve-pve/

I stopped all ceph services, unmounted all osds (I could not run ceph osd down && ceph osd destroy as no ceph commands will work)

I removed the /etc/pve/ceph.conf, /etc/ceph folder and the /var/lib/ceph folder on all nodes.

Once I confirmed again across all 10 nodes that no ceph services were running I restarted the 10th in the list.

As soon as that happened EVERYTHING went down. After a few minutes the nodes started showing green again in the GUI and I had to go through and restart all of the VMs.

None of the steps that I took should have cause pve-cluster or corosync to freak out and drop all connections.

What. the. hell. happened?!
 
Hard to say in hindsight.

What versions are running? (pveversion -v)
How is the network in the cluster set up? Especially regarding Corosync and Ceph.
 
I think what may have happened is after removing the 2 "failed" nodes they were still HA targets. Right now ceph in the cluster is totally broken and I've disabled it everywhere that I can. I won't be able to test again until the weekend in case it takes everything down.

Here is Pve -v

proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve) pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe) pve-kernel-helper: 7.1-8 pve-kernel-5.13: 7.1-6 pve-kernel-5.4: 6.4-10 pve-kernel-5.13.19-3-pve: 5.13.19-7 pve-kernel-5.13.19-2-pve: 5.13.19-4 pve-kernel-5.4.151-1-pve: 5.4.151-1 pve-kernel-5.4.143-1-pve: 5.4.143-1 pve-kernel-5.4.140-1-pve: 5.4.140-1 pve-kernel-5.4.128-1-pve: 5.4.128-2 pve-kernel-5.4.124-1-pve: 5.4.124-2 pve-kernel-5.4.119-1-pve: 5.4.119-1 pve-kernel-5.4.114-1-pve: 5.4.114-1 pve-kernel-5.4.106-1-pve: 5.4.106-1 pve-kernel-5.4.78-2-pve: 5.4.78-2 pve-kernel-5.4.34-1-pve: 5.4.34-2 ceph: 15.2.15-pve1 ceph-fuse: 15.2.15-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.22-pve2 libproxmox-acme-perl: 1.4.1 libproxmox-backup-qemu0: 1.2.0-1 libpve-access-control: 7.1-6 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.1-2 libpve-guest-common-perl: 4.0-3 libpve-http-server-perl: 4.1-1 libpve-storage-perl: 7.0-15 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.11-1 lxcfs: 4.0.11-pve1 novnc-pve: 1.3.0-1 openvswitch-switch: 2.15.0+ds1-2 proxmox-backup-client: 2.1.4-1 proxmox-backup-file-restore: 2.1.4-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.4-5 pve-cluster: 7.1-3 pve-container: 4.1-3 pve-docs: 7.1-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.3-4 pve-ha-manager: 3.3-3 pve-i18n: 2.6-2 pve-qemu-kvm: 6.1.0-3 pve-xtermjs: 4.16.0-1 qemu-server: 7.1-4 smartmontools: 7.2-pve2 spiceterm: 3.2-2 swtpm: 0.7.0~rc1+2 vncterm: 1.7-1 zfsutils-linux: 2.1.2-pve1