Cluster broken after update to PVE 8.2?

I checked the output of systemctl corosync.status and, among other stuff it says on the upgraded node:
Code:
A new membership (4.6c7) was formed. Members joined: 4

whereas on the other nodes it says:
Code:
A new membership (1.6c1) was formed. Members left: 4

Since you suggested that there might be an (undocumented) protocol change in corosync, I am now wondering whether the 4.6c7 vs 1.6c1 might point towards this or whether this is just a counter?
 
Since you suggested that there might be an (undocumented) protocol change in corosync, I am now wondering whether the 4.6c7 vs 1.6c1 might point towards this or whether this is just a counter?
Unfortunately, I have no idea either. Still very new to Proxmox and the clustering side of things.

If you can wait a day or so, then some of the more experienced people can probably help out better. :)
 
If you can wait a day or so, then some of the more experienced people can probably help out better. :)

Fortunately, I migrated the important VMs away from the upgraded node before upgrading.

So they reside now on the quorate two node part of my cluster. I think I will wait and hope that tomorrow the Proxmox guys know what to do.

But thank you for your thoughts and time!!!
 
  • Like
Reactions: justinclift
Just thought of something that we should have looked at already. What does the last few lines of the corosync service output for the upgraded node (node1) show?

Bash:
# journalctl -u corosync -n 30
 
Last edited:
I'm guessing you tried rebooting ALL the nodes.
It's a three node cluster and only has two nodes (with the important VMs on it) communicating.

If either of those two nodes reboots, that'll be an automatic quorum loss. And if the watchdog hasn't been defanged on the remaining one, it'll probably reboot a minute later if the other node doesn't come back online fast enough.
 
Last edited:
It's a three node cluster and only has two nodes (with the important VMs on it) communicating.

If either of those two nodes reboots, that'll be an automatic quorum loss. And if the watchdog hasn't been defanged on the remaining one, it'll probably reboot a minute later if the other node doesn't come back online fast enough.
Yes, I have not rebooted.
 
Okay, so it's working again.

I don't know what exactly did the trick and it might not be a solution to apply in similar situations, I'm afraid...

So what did I do?

Well, I have been waiting to switch from 10gbe ethernet to Infiniband for a while (I opened another thread in that respect). My issues was that I did not know how to replace the networking hardware in a running cluster one by one and I did not have enough free PCIe slots to put in the new Infiniband card next to the ethernet card. So I had waited.

But with the one node separated from the cluster anyway, I thought, I would use the opportunity and experiment with that one node. I removed another PCIe card that I could do without for a while and put in the Infiniband card. This triggered a renaming of all my 10gbe ports (which are used for Corosync and Ceph). So I had to update the networking (just copy the entries from the old port name entries over to the new port name entries). After restarting the networking, my cluster was whole again.

Come to think of it - maybe it was a networking issue after all. I had tried to ping all of the hosts from one another and this did work. And I had tried to ssh into the other nodes from each of the nodes and this did work. But what I had not tried, was to ping over the dedicated Corosync and Ceph networks, just over the normal admin network. So maybe the Corosync and/or Ceph networks were affected by some weird renaming problem caused by the PVE upgrade (maybe the one I described above and that I attributed to adding the Infiniband card)?

When I muster the courage to upgrade the next node, I will check whether the Corosync and Ceph networks are renamed and then I will update this thread.
 
If you look back through the systemd messages from corosync on the node that was isolated, it'll probably have info about what it was unhappy about.

Possibly even info confirming that it couldn't contact the other nodes over the dedicated corosync network. ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!