reboot of all cluster nodes when corosync is restarted on specific member

mensinck

Renowned Member
Oct 19, 2015
29
1
68
Kiel, germany
Hey all,

I observed a strange reboot off all my cluster nodes as soon as on one specific host cororsync is restarted or this host rebooted.

I have 7 hosts in one cluster

Corosync has 2 links configured. ring0 is on a separate network on separate switch. ring1 is shared as VLAN over 10G fiber interface.

This is a part of my corosync.conf:

Code:
.....
node {
    name: pve56
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.24.56
    ring1_addr: 192.168.25.56
  }
....
quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: HA-Cluster-A
  config_version: 13
  interface {
    knet_link_priority: 10
    linknumber: 0
  }
  interface {
    knet_link_priority: 20
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on

When I run ,,corosync-cfgtool -s" on my pve56 (nodeid 7) i get:
Code:
Local node ID 7, transport knet
LINK ID 0
        addr    = 192.168.24.56
        status:
                nodeid:   2:    connected
                nodeid:   3:    connected
                nodeid:   5:    connected
                nodeid:   6:    localhost
                nodeid:   7:    connected
                nodeid:   8:    connected
                nodeid:   9:    connected
LINK ID 1
        addr    = 192.168.25.56
        status:
                nodeid:   2:    connected
                nodeid:   3:    connected
                nodeid:   5:    connected
                nodeid:   6:    localhost
                nodeid:   7:    disconnected
                nodeid:   8:    connected
                nodeid:   9:    connected

This looks the same on all other nodes.

a corosync-cfgtool -n will show all hosts reachable:
Code:
Local node ID 7, transport knet
nodeid: 2 reachable
   LINK: 0 (192.168.24.56->192.168.24.59) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.59) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 (192.168.24.56->192.168.24.52) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.52) enabled connected mtu: 1397

nodeid: 5 reachable
   LINK: 0 (192.168.24.56->192.168.24.54) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.54) enabled connected mtu: 1397

nodeid: 6 reachable
   LINK: 0 (192.168.24.56->192.168.24.55) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.55) enabled connected mtu: 1397

nodeid: 8 reachable
   LINK: 0 (192.168.24.56->192.168.24.57) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.57) enabled connected mtu: 1397

nodeid: 9 reachable
   LINK: 0 (192.168.24.56->192.168.24.58) enabled connected mtu: 1397
   LINK: 1 (192.168.25.56->192.168.25.58) enabled connected mtu: 1397

As soon as corosync on pve56 (nodeid 7) is stopped /etc/pve will be write protected (it looks to me so) on all other hosts ans watchdog is rebooting all hosts after 60 sec.
I can stop and start corosync on all other nodes with no problem.

What I do not understand is

1. why corosync-cfgtool reports pve56 as node 6 (localhost) and on ring1 node 7 as disconnected?
2. why could this change during one of the last upgrades. I have to state, the cluster was running without any problems before. I unfortunately can not review whis upgrade changed the cluster this way..


My host are all upgraded to this packages:

Code:
proxmox-ve: 6.4-1 (running kernel: 5.4.151-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-10
pve-kernel-helper: 6.4-10
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.151-1-pve: 5.4.151-1
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 15.2.15-pve1~bpo10
ceph-fuse: 15.2.15-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve1~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1

Any help to identify the problem and fix it will be welcome

Thanks in advance

Lukas
 
Hi Fabian,
Thanks for your reply,
Viewed the bug report and likely this could be right.

I will test the packages when they ,,arrive" and come back

Best regards
Lukas
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!