Cluster broken after update to PVE 8.2?

proxwolfe · Jun 9, 2024

I checked the output of systemctl corosync.status and, among other stuff it says on the upgraded node:

Code:

A new membership (4.6c7) was formed. Members joined: 4

whereas on the other nodes it says:

Code:

A new membership (1.6c1) was formed. Members left: 4

Since you suggested that there might be an (undocumented) protocol change in corosync, I am now wondering whether the 4.6c7 vs 1.6c1 might point towards this or whether this is just a counter?

proxwolfe · Jun 9, 2024

justinclift said:
Guessing you've also tried just stopping the corosync service on node1, waiting a few seconds, then starting corosync back up again and seeing if it decides to play ball?

No dice

justinclift · Jun 9, 2024

proxwolfe said:
Since you suggested that there might be an (undocumented) protocol change in corosync, I am now wondering whether the 4.6c7 vs 1.6c1 might point towards this or whether this is just a counter?

Unfortunately, I have no idea either. Still very new to Proxmox and the clustering side of things.

If you can wait a day or so, then some of the more experienced people can probably help out better.

proxwolfe · Jun 9, 2024

justinclift said:
If you can wait a day or so, then some of the more experienced people can probably help out better.

Fortunately, I migrated the important VMs away from the upgraded node before upgrading.

So they reside now on the quorate two node part of my cluster. I think I will wait and hope that tomorrow the Proxmox guys know what to do.

But thank you for your thoughts and time!!!

justinclift · Jun 9, 2024

No worries at all, that's probably what I'd do too.

justinclift · Jun 9, 2024

Just thought of something that we should have looked at already. What does the last few lines of the corosync service output for the upgraded node (node1) show?

Bash:

# journalctl -u corosync -n 30

gfngfn256 · Jun 9, 2024

I'm guessing you tried rebooting ALL the nodes.

justinclift · Jun 9, 2024

gfngfn256 said:
I'm guessing you tried rebooting ALL the nodes.

It's a three node cluster and only has two nodes (with the important VMs on it) communicating.

If either of those two nodes reboots, that'll be an automatic quorum loss. And if the watchdog hasn't been defanged on the remaining one, it'll probably reboot a minute later if the other node doesn't come back online fast enough.

proxwolfe · Jun 9, 2024

justinclift said:
It's a three node cluster and only has two nodes (with the important VMs on it) communicating.

If either of those two nodes reboots, that'll be an automatic quorum loss. And if the watchdog hasn't been defanged on the remaining one, it'll probably reboot a minute later if the other node doesn't come back online fast enough.

Yes, I have not rebooted.

proxwolfe · Jun 9, 2024

Okay, so it's working again.

I don't know what exactly did the trick and it might not be a solution to apply in similar situations, I'm afraid...

So what did I do?

Well, I have been waiting to switch from 10gbe ethernet to Infiniband for a while (I opened another thread in that respect). My issues was that I did not know how to replace the networking hardware in a running cluster one by one and I did not have enough free PCIe slots to put in the new Infiniband card next to the ethernet card. So I had waited.

But with the one node separated from the cluster anyway, I thought, I would use the opportunity and experiment with that one node. I removed another PCIe card that I could do without for a while and put in the Infiniband card. This triggered a renaming of all my 10gbe ports (which are used for Corosync and Ceph). So I had to update the networking (just copy the entries from the old port name entries over to the new port name entries). After restarting the networking, my cluster was whole again.

Come to think of it - maybe it was a networking issue after all. I had tried to ping all of the hosts from one another and this did work. And I had tried to ssh into the other nodes from each of the nodes and this did work. But what I had not tried, was to ping over the dedicated Corosync and Ceph networks, just over the normal admin network. So maybe the Corosync and/or Ceph networks were affected by some weird renaming problem caused by the PVE upgrade (maybe the one I described above and that I attributed to adding the Infiniband card)?

When I muster the courage to upgrade the next node, I will check whether the Corosync and Ceph networks are renamed and then I will update this thread.

justinclift · Jun 9, 2024

If you look back through the systemd messages from corosync on the node that was isolated, it'll probably have info about what it was unhappy about.

Possibly even info confirming that it couldn't contact the other nodes over the dedicated corosync network.

Search

Search

Cluster broken after update to PVE 8.2?

proxwolfe

Renowned Member

proxwolfe

Renowned Member

justinclift

Well-Known Member

proxwolfe

Renowned Member

justinclift

Well-Known Member

justinclift

Well-Known Member

gfngfn256

Distinguished Member

justinclift

Well-Known Member

proxwolfe

Renowned Member

proxwolfe

Renowned Member

justinclift

Well-Known Member

We value your privacy