Lost cluster config...

tin

Renowned Member
Aug 14, 2010
108
2
83
Northwest NSW, Australia
After doing a dist-upgrade on my nodes today, one node dropped out. I tried several things suggested in various "lost node" posts, eventually attempting "pvecm add <node> --force".... This resulted in the node I ran the command on showing the entire cluster as blank (no storage, no VMs), but also still showing the cluster as broken.
If I look on the other side of the breakage, all the VMs and storage is still present.

How can I manually re-sync this cluster so all nodes have the same config? Nothing about the VMs has changed since the break, and the listing I have from the non-blank side is still correct. I just need to get all nodes to use that config again.


Edit: I've now copied config.db from the working side to the "blank" node so it can see the VMs again (and shows the correct states if you access the web interface on each side of the break)... I'm now back to simply having nodes that have broken from the cluster and need to repair the syncing of the cluster.
 
Last edited:
Edit: I've now copied config.db from the working side to the "blank" node so it can see the VMs again (and shows the correct states if you access the web interface on each side of the break)... I'm now back to simply having nodes that have broken from the cluster and need to repair the syncing of the cluster.
* check the status of corosync.service and pve-cluster.service
* do both run smoothly?
* check the logs (especially for both services (corosync and pve-cluster/pmxcfs)
are all nodes on the same software versions? (pveversion -v)

I hope this helps!
 
I just restarted pve-cluster on the "missing" node to grab the log output it was giving me before... It had been saying something about keys or authentication when I posted... But after restarting, it now shows up as having quorum, and if I do "pvecm status" it all looks good.
Time heals all wounds? :D

Oddly though, the web interface shows the node as a question mark still (including when viewing via the broken node now)... But it will let me stop/start nodes, see resource usage, etc. It just won't allow me to create a new VM on that node. I tried restarting pveproxy on both the missing node and the one I do my admin tasks from, but no change.

Could it be that some ssh keys are mismatched between the corosync based storage and the actual node's root login?


Edit: Never mind - while I was typing all that, I rebooted the problem node and that seems to have made it show up again. All looks happy now. :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!