Corosync 3, Node reboot after losing qurum

Nov 8, 2017
99
3
13
33
Muscat
Hi
We've had these issues that one node caused the whole cluster nodes to get rebooted. We've been advised to upgrade to corosync 3 to get rid of multicasts and the issues it might bring it with it.

We have successfully upgraded all 14 nodes to corosync 3.
And now I've been started to upgrade them to Proxmox 6 one by one. Today after the upgrade has been finished on the second node, I typed "reboot" so it can reboot with its new kernel. Then I saw another node got rebooted automatically.
After it booted up, I checked the faulty node syslog and found this:

Code:
Jan 30 17:32:00 master14 corosync[5859]:   [KNET  ] link: host: 1 link: 0 is down
Jan 30 17:32:00 master14 corosync[5859]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 30 17:32:00 master14 corosync[5859]:   [KNET  ] host: host: 1 has no active links
Jan 30 17:32:16 master14 corosync[5859]:   [TOTEM ] A new membership (e.b0) was formed. Members left: 1 2 3 4 5 6 7 8 9 10 11 12 13
Jan 30 17:32:16 master14 corosync[5859]:   [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5 6 7 8 9 10 11 12 13
Jan 30 17:32:20 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:21 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:22 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:40 master14 pmxcfs[5881]: [status] crit: cpg_send_message failed: 6
Jan 30 17:32:40 master14 pve-firewall[2306]: firewall update time (15.675 seconds)
Jan 30 17:32:41 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:42 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:43 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:44 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 40
Jan 30 17:32:45 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 50
Jan 30 17:32:46 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 60
Jan 30 17:32:47 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 70
Jan 30 17:32:48 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 80
Jan 30 17:32:49 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 90
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 100
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] notice: cpg_send_message retried 100 times
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] crit: cpg_send_message failed: 6
Jan 30 17:32:51 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:52 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:53 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:54 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 40
Jan 30 17:32:55 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 50
Jan 30 17:32:56 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 60
Jan 30 17:32:56 master14 watchdog-mux[1337]: client watchdog expired - disable watchdog updates



Here is the corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
...
...
...
..
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: proxmoxcluster1
  config_version: 23
  interface {
    bindnetaddr: 172.27.3.11
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}


I just realized that the "bindnetaddr" is the IP of the node which just has been upgraded and rebooted.
But if it could cause the cluster nodes to lose their quorum, why other nodes weren't affected?


and here is the pveversion of the faulty node:

Code:
proxmox-ve: 5.4-2 (running kernel: 4.15.18-24-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-12
pve-kernel-4.15.18-24-pve: 4.15.18-52
ceph: 12.2.12-pve1
corosync: 3.0.2-pve4~bpo9
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.5-1~bpo9+2
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-55
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!