Corosync fails randomly on PVE 6.0-5

gradinaruvasile · Aug 14, 2019

Today we had an issue related to corosync i suppose.
The whole cluster disintegrated, nodes started rebooting, i even powered off all nodes and started them again - this seemed to fix the issue for the moment.

There are 4 nodes and sometimes they just stopped seeing each other. Sometimes they seen each other in pairs or only themselves.

I looked at "corosync-quorumtool -m -a" and the Ring ID sometimes jumped fast, it was in 400s and now is at 12000s.
It also displayed "Activity blocked" at Quorate.
This cluster is upgraded from 5.4 to 6. Corosync was updated according to the wiki prior to the cluster upgrade.

Attached corosync logs.

Now i had to stop pve-ha-lrm and pve-ha-crm to prevent random rebooting and effectively disabe HA in the process.
There were no modifications to the networking, it just started happening during operation.

The corosync conf file ends with this section (What does the version line mean? Corosync is at version 3.):

Code:

totem {
  cluster_name: clustername
  config_version: 18
  interface {
    bindnetaddr: 172.22.1.50
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Is this a bug related to corosync 3?

gradinaruvasile · Aug 14, 2019

It is happening again.
I see that 2 of the nodes see each other, in pairs. They had different Ring IDs, 1 and 2.
I stopped corosync on all nodes, started up and it is working again.

gradinaruvasile · Aug 16, 2019

It seems that this was somehow related to some network errors (someone made a loop in another part of the network and we missed the notification).
This is weird since
- the access switches have bpdu guard acivated so the loop is not propagating into the network and
- the virtualization switches while transporting vlans to the core switch and acess switches are separate and i assume the cluster communications should be local (although the management interfaces are in a vlan that is raised on the core switches).
Later we did a test and created a loop and it seems the cluster communication is affected by the loop created in other switches - the Ring ID's second number increased every time the looped ports were reactivated (we had a policy to reactivate ports after 30 seconds) and shut down. But nothing more dramatic happened, maybe there were some weird things happening after the ports were on-off for a half an hour or so.

Anyway, we now will create a separate vlan locally on the virtualization switches and route the cluster comms through that one and see how wil lthat go.

Search

Search

Corosync fails randomly on PVE 6.0-5

gradinaruvasile

Renowned Member

Attachments

gradinaruvasile

Renowned Member

gradinaruvasile

Renowned Member