Corosync fails randomly on PVE 6.0-5

gradinaruvasile

Renowned Member
Oct 22, 2015
84
12
73
Today we had an issue related to corosync i suppose.
The whole cluster disintegrated, nodes started rebooting, i even powered off all nodes and started them again - this seemed to fix the issue for the moment.

There are 4 nodes and sometimes they just stopped seeing each other. Sometimes they seen each other in pairs or only themselves.

I looked at "corosync-quorumtool -m -a" and the Ring ID sometimes jumped fast, it was in 400s and now is at 12000s.
It also displayed "Activity blocked" at Quorate.
This cluster is upgraded from 5.4 to 6. Corosync was updated according to the wiki prior to the cluster upgrade.

Attached corosync logs.

Now i had to stop pve-ha-lrm and pve-ha-crm to prevent random rebooting and effectively disabe HA in the process.
There were no modifications to the networking, it just started happening during operation.

The corosync conf file ends with this section (What does the version line mean? Corosync is at version 3.):

Code:
totem {
  cluster_name: clustername
  config_version: 18
  interface {
    bindnetaddr: 172.22.1.50
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Is this a bug related to corosync 3?
 

Attachments

It is happening again.
I see that 2 of the nodes see each other, in pairs. They had different Ring IDs, 1 and 2.
I stopped corosync on all nodes, started up and it is working again.
 
It seems that this was somehow related to some network errors (someone made a loop in another part of the network and we missed the notification).
This is weird since
- the access switches have bpdu guard acivated so the loop is not propagating into the network and
- the virtualization switches while transporting vlans to the core switch and acess switches are separate and i assume the cluster communications should be local (although the management interfaces are in a vlan that is raised on the core switches).
Later we did a test and created a loop and it seems the cluster communication is affected by the loop created in other switches - the Ring ID's second number increased every time the looped ports were reactivated (we had a policy to reactivate ports after 30 seconds) and shut down. But nothing more dramatic happened, maybe there were some weird things happening after the ports were on-off for a half an hour or so.

Anyway, we now will create a separate vlan locally on the virtualization switches and route the cluster comms through that one and see how wil lthat go.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!