reboot of all cluster nodes when corosync is restarted on specific member

mensinck · Dec 7, 2021

Hey all,

I observed a strange reboot off all my cluster nodes as soon as on one specific host cororsync is restarted or this host rebooted.

I have 7 hosts in one cluster

Corosync has 2 links configured. ring0 is on a separate network on separate switch. ring1 is shared as VLAN over 10G fiber interface.

This is a part of my corosync.conf:

Code:

.....
node {
    name: pve56
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.24.56
    ring1_addr: 192.168.25.56
  }
....
quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: HA-Cluster-A
  config_version: 13
  interface {
    knet_link_priority: 10
    linknumber: 0
  }
  interface {
    knet_link_priority: 20
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on

When I run ,,corosync-cfgtool -s" on my pve56 (nodeid 7) i get:

Code:

Local node ID 7, transport knet
LINK ID 0
        addr    = 192.168.24.56
        status:
                nodeid:   2:    connected
                nodeid:   3:    connected
                nodeid:   5:    connected
                nodeid:   6:    localhost
                nodeid:   7:    connected
                nodeid:   8:    connected
                nodeid:   9:    connected
LINK ID 1
        addr    = 192.168.25.56
        status:
                nodeid:   2:    connected
                nodeid:   3:    connected
                nodeid:   5:    connected
                nodeid:   6:    localhost
                nodeid:   7:    disconnected
                nodeid:   8:    connected
                nodeid:   9:    connected

This looks the same on all other nodes.

a corosync-cfgtool -n will show all hosts reachable:

Code:

Local node ID 7, transport knet
nodeid: 2 reachable
   LINK: 0 (192.168.24.56->192.168.24.59) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.59) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 (192.168.24.56->192.168.24.52) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.52) enabled connected mtu: 1397

nodeid: 5 reachable
   LINK: 0 (192.168.24.56->192.168.24.54) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.54) enabled connected mtu: 1397

nodeid: 6 reachable
   LINK: 0 (192.168.24.56->192.168.24.55) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.55) enabled connected mtu: 1397

nodeid: 8 reachable
   LINK: 0 (192.168.24.56->192.168.24.57) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.57) enabled connected mtu: 1397

nodeid: 9 reachable
   LINK: 0 (192.168.24.56->192.168.24.58) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.58) enabled connected mtu: 1397

As soon as corosync on pve56 (nodeid 7) is stopped /etc/pve will be write protected (it looks to me so) on all other hosts ans watchdog is rebooting all hosts after 60 sec.
I can stop and start corosync on all other nodes with no problem.

What I do not understand is

1. why corosync-cfgtool reports pve56 as node 6 (localhost) and on ring1 node 7 as disconnected?
2. why could this change during one of the last upgrades. I have to state, the cluster was running without any problems before. I unfortunately can not review whis upgrade changed the cluster this way..

My host are all upgraded to this packages:

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.151-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-10
pve-kernel-helper: 6.4-10
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.151-1-pve: 5.4.151-1
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 15.2.15-pve1~bpo10
ceph-fuse: 15.2.15-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve1~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1

Any help to identify the problem and fix it will be welcome

Thanks in advance

Lukas

fabian · Dec 7, 2021

likely https://bugzilla.proxmox.com/show_bug.cgi?id=3672 - a fixed version for PVE 6.x is on it's way through the repos (corosync 3.1.2-pve2~bpo10+1, kronosnet 1.22-pve2~bpo10+1, currently on pvetest)

mensinck · Dec 7, 2021

Hi Fabian,
Thanks for your reply,
Viewed the bug report and likely this could be right.

I will test the packages when they ,,arrive" and come back

Best regards
Lukas

reboot of all cluster nodes when corosync is restarted on specific member

mensinck

Renowned Member

fabian

Proxmox Staff Member

mensinck

Renowned Member

We value your privacy