Need help - Failed upgrades on multiple nodes after adding a node with Ceph

gsch123

New Member
May 7, 2024
2
0
1
4 node cluster with a qdevice. This is a homelab.

Well, I really messed it up. I had an OS ssd fail on a node. No problem, reinstalled the OS on a fresh SSD, came back, gave it a new name and IP. added back in the OSDs all was good. But I didn't delete the old node. Then I noticed the versions didn't match in ceph since it was a fresh install made sense so I did an apt upgrage alll. Not thinking I did this on all the nodes at the same time. I left for a few hours and came back and assumed it was done.

Ugh, two of the nodes won't come back online. I attached a monitor and they show a ton of ceph errors. Can't ping them from other machines. So now the cluster thinks I have 3 nodes down and is in a really bad state.

any ideas on how to proceed?
 
Networking wasn't working (ifup gave a permission error). I was able to get them back up by running dpkg --configure -a

It froze on pve-manager. I was able to Ctrl-C out and the rest completed.

Now my ceph cluster is back and VMs started.

If I run pvecm -v all looks ok, except:
pve-manager: not correctly installed (running version: 8.2.4/faa83925c9641325)

dpkg --configure -a pve-manager completes now but does not change the pvecm message.

Anyone have any other ideas?