Need help - Failed upgrades on multiple nodes after adding a node with Ceph

gsch123

New Member
May 7, 2024
2
0
1
4 node cluster with a qdevice. This is a homelab.

Well, I really messed it up. I had an OS ssd fail on a node. No problem, reinstalled the OS on a fresh SSD, came back, gave it a new name and IP. added back in the OSDs all was good. But I didn't delete the old node. Then I noticed the versions didn't match in ceph since it was a fresh install made sense so I did an apt upgrage alll. Not thinking I did this on all the nodes at the same time. I left for a few hours and came back and assumed it was done.

Ugh, two of the nodes won't come back online. I attached a monitor and they show a ton of ceph errors. Can't ping them from other machines. So now the cluster thinks I have 3 nodes down and is in a really bad state.

any ideas on how to proceed?
 
Networking wasn't working (ifup gave a permission error). I was able to get them back up by running dpkg --configure -a

It froze on pve-manager. I was able to Ctrl-C out and the rest completed.

Now my ceph cluster is back and VMs started.

If I run pvecm -v all looks ok, except:
pve-manager: not correctly installed (running version: 8.2.4/faa83925c9641325)

dpkg --configure -a pve-manager completes now but does not change the pvecm message.

Anyone have any other ideas?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!