Corosync Problem - Cluster reboot

Stone · Oct 24, 2022

Hi.

Today we have a strange behavior in our cluster.
One node is getting troubles with their network (as we seen in the logfiles - nobody knows why) and the hole cluster crashed.

We cant understand why this happened. We see in the logfiles that the cluster has a quorum of 13 nodes (there are 14 in total) and node 09 runs into a problem. Ok we are fine with this. But why do all nodes a reboot when a quorum consists of 13 nodes?

I have some config and log files.
I try to post it below.

If someone has an idea what caused the error please let me know.

PVE version:

Code:

root@host14:~# pveversion
pve-manager/6.3-6/2184247e (running kernel: 5.4.106-1-pve)

PVE Cluster status:

Code:

root@host14:~# pvecm status
Cluster information
-------------------
Name:             Cluster-PVE
Config Version:   27
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Oct 24 17:30:32 2022
Quorum provider:  corosync_votequorum
hosts:            14
host ID:          0x0000000e
Ring ID:          1.256
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   14
Highest expected: 14
Total votes:      14
Quorum:           8
Flags:            Quorate

Membership information
----------------------
    hostid      Votes Name
0x00000001          1 172.28.30.20
0x00000002          1 172.28.30.21
0x00000003          1 172.28.30.22
0x00000004          1 172.28.30.23
0x00000005          1 172.28.30.24
0x00000006          1 172.28.30.25
0x00000007          1 172.28.30.26
0x00000008          1 172.28.30.27
0x00000009          1 172.28.30.28
0x0000000a          1 172.28.30.29
0x0000000b          1 172.28.30.30
0x0000000c          1 172.28.30.31
0x0000000d          1 172.28.30.32
0x0000000e          1 172.28.30.33 (local)

PVE nodes:

Code:

root@host14:~# pvecm hosts

Membership information
----------------------
    hostid      Votes Name
         1          1 host01
         2          1 host02
         3          1 host03
         4          1 host04
         5          1 host05
         6          1 host06
         7          1 host07
         8          1 host08
         9          1 host09
        10          1 host10
        11          1 host11
        12          1 host12
        13          1 host13
        14          1 host14 (local)

Corosync config:

Code:

root@host14:~# cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

hostlist {
  host {
    name: host04
    hostid: 4
    quorum_votes: 1
    ring0_addr: host04-ha
  }
  host {
    name: host10
    hostid: 10
    quorum_votes: 1
    ring0_addr: host10-ha
  }
  host {
    name: host09
    hostid: 9
    quorum_votes: 1
    ring0_addr: host09-ha
  }
  host {
    name: host08
    hostid: 8
    quorum_votes: 1
    ring0_addr: host08-ha
  }
  host {
    name: host07
    hostid: 7
    quorum_votes: 1
    ring0_addr: host07-ha
  }
  host {
    name: host11
    hostid: 11
    quorum_votes: 1
    ring0_addr: host11-ha
  }
  host {
    name: host13
    hostid: 13
    quorum_votes: 1
    ring0_addr: host13-ha
  }
  host {
    name: host03
    hostid: 3
    quorum_votes: 1
    ring0_addr: host03-ha
  }
  host {
    name: host02
    hostid: 2
    quorum_votes: 1
    ring0_addr: host02-ha
  }
  host {
    name: host12
    hostid: 12
    quorum_votes: 1
    ring0_addr: host12-ha
  }
  host {
    name: host14
    hostid: 14
    quorum_votes: 1
    ring0_addr: host14-ha
  }
  host {
    name: host06
    hostid: 6
    quorum_votes: 1
    ring0_addr: host06-ha
  }
  host {
    name: host05
    hostid: 5
    quorum_votes: 1
    ring0_addr: host05-ha
  }
  host {
    name: host01
    hostid: 1
    quorum_votes: 1
    ring0_addr: host01-ha
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cluster-PVE
  config_version: 27
  interface {
    bindnetaddr: 172.28.30.20
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Moayad · Oct 25, 2022

Hi,

Thank you for the Syslogs!

Does the HA is enabled in the cluster?

The Syslogs provided are poor a bit, you should see before the cluster crash what happened in 2 to 10 minutes. However, from the "Corosync.conf" I can guess there was a lost connection in the Corosync network, since the cluster has only one ring. As we recommend having separate NIC only for the Corosync and adding a redundant ring as described in our Docs guide [0]

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

fiona · Oct 25, 2022

Hi,
are the other hosts also running Proxmox VE 6.3? Proxmox 6.x is end-of-life since a few months now. It's highly recommended to upgrade to 7.x, see here for the official guide: https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

alex_ca · Oct 25, 2022

Hi guys,
There are two other threads reporting a very similar behavior.
How about taking in consideration the possibility there is some BUG in the way Corosync is acting?

https://forum.proxmox.com/threads/sudden-reboot-of-multiple-nodes-while-adding-a-new-node.116714/

https://forum.proxmox.com/threads/c...-reboots-as-soon-as-i-join-a-new-node.116804/

and in both cases the problem started just recently, both clusters already added nodes in the past without issues.

It seems corosync, instead of rebooting a problematic node, it's rebooting other nodes.....

fabian · Oct 25, 2022

alexmolon said:
Hi guys,
There are two other threads reporting a very similar behavior.
How about taking in consideration the possibility there is some BUG in the way Corosync is acting?

https://forum.proxmox.com/threads/sudden-reboot-of-multiple-nodes-while-adding-a-new-node.116714/

https://forum.proxmox.com/threads/c...-reboots-as-soon-as-i-join-a-new-node.116804/

and in both cases the problem started just recently, both clusters already added nodes in the past without issues.

It seems corosync, instead of rebooting a problematic node, it's rebooting other nodes.....

I already gave you some detailed answers in your thread - your specific (broken!) network setup did trigger a bug in corosync/knet and upstream is working on a fix and there exists a workaround. this and the other thread doesn't have the same symptoms - so if it is (another) bug in corosync, it is a different one. the cause in your case was very specific (your systems configured to use a high MTU with one joining node effectively blackholing all traffic larger than the standard MTU).

a single link corosync setup can quite easily cause a whole cluster fence if there are network issues (the logs here show that single link going up and down repeatedly and that the token that corosync passes around as part of it's consensus algorithm times out - if that situation persists for long enough, which seems to have been the case here, quorum will not be re-established in time to prevent a fencing event).

the logs are very incomplete unfortunately - it might be possible to say more with more log content.

there are tons of ways to break corosync that can cause the whole cluster to fence itself if HA is enabled. not each and every one of them is actually unexpected or a bug.

Stone · Oct 25, 2022

Moayad said:
Hi,

Thank you for the Syslogs!

Does the HA is enabled in the cluster?

The Syslogs provided are poor a bit, you should see before the cluster crash what happened in 2 to 10 minutes. However, from the "Corosync.conf" I can guess there was a lost connection in the Corosync network, since the cluster has only one ring. As we recommend having separate NIC only for the Corosync and adding a redundant ring as described in our Docs guide [0]

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

Yes HA is enabled.
The HA network is working with two dedicated nics with LACP and goes to two switches (stack).

fiona said:
Hi,
are the other hosts also running Proxmox VE 6.3? Proxmox 6.x is end-of-life since a few months now. It's highly recommended to upgrade to 7.x, see here for the official guide: https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

Yes i know the version 6.3 is old and a update will be done this year but this is not the reason for the crash.

Search

Search

Corosync Problem - Cluster reboot

Stone

Renowned Member

Attachments

Moayad

Proxmox Staff Member

fiona

Proxmox Staff Member

alex_ca

New Member

fabian

Proxmox Staff Member

Stone

Renowned Member

We value your privacy