Hi
We've had these issues that one node caused the whole cluster nodes to get rebooted. We've been advised to upgrade to corosync 3 to get rid of multicasts and the issues it might bring it with it.
We have successfully upgraded all 14 nodes to corosync 3.
And now I've been started to upgrade them to Proxmox 6 one by one. Today after the upgrade has been finished on the second node, I typed "reboot" so it can reboot with its new kernel. Then I saw another node got rebooted automatically.
After it booted up, I checked the faulty node syslog and found this:
Here is the corosync.conf
I just realized that the "bindnetaddr" is the IP of the node which just has been upgraded and rebooted.
But if it could cause the cluster nodes to lose their quorum, why other nodes weren't affected?
and here is the pveversion of the faulty node:
We've had these issues that one node caused the whole cluster nodes to get rebooted. We've been advised to upgrade to corosync 3 to get rid of multicasts and the issues it might bring it with it.
We have successfully upgraded all 14 nodes to corosync 3.
And now I've been started to upgrade them to Proxmox 6 one by one. Today after the upgrade has been finished on the second node, I typed "reboot" so it can reboot with its new kernel. Then I saw another node got rebooted automatically.
After it booted up, I checked the faulty node syslog and found this:
Code:
Jan 30 17:32:00 master14 corosync[5859]: [KNET ] link: host: 1 link: 0 is down
Jan 30 17:32:00 master14 corosync[5859]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 30 17:32:00 master14 corosync[5859]: [KNET ] host: host: 1 has no active links
Jan 30 17:32:16 master14 corosync[5859]: [TOTEM ] A new membership (e.b0) was formed. Members left: 1 2 3 4 5 6 7 8 9 10 11 12 13
Jan 30 17:32:16 master14 corosync[5859]: [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5 6 7 8 9 10 11 12 13
Jan 30 17:32:20 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:21 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:22 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:40 master14 pmxcfs[5881]: [status] crit: cpg_send_message failed: 6
Jan 30 17:32:40 master14 pve-firewall[2306]: firewall update time (15.675 seconds)
Jan 30 17:32:41 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:42 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:43 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:44 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 40
Jan 30 17:32:45 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 50
Jan 30 17:32:46 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 60
Jan 30 17:32:47 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 70
Jan 30 17:32:48 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 80
Jan 30 17:32:49 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 90
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 100
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] notice: cpg_send_message retried 100 times
Jan 30 17:32:50 master14 pmxcfs[5881]: [status] crit: cpg_send_message failed: 6
Jan 30 17:32:51 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 10
Jan 30 17:32:52 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 20
Jan 30 17:32:53 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 30
Jan 30 17:32:54 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 40
Jan 30 17:32:55 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 50
Jan 30 17:32:56 master14 pmxcfs[5881]: [status] notice: cpg_send_message retry 60
Jan 30 17:32:56 master14 watchdog-mux[1337]: client watchdog expired - disable watchdog updates
Here is the corosync.conf
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
...
...
...
..
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: proxmoxcluster1
config_version: 23
interface {
bindnetaddr: 172.27.3.11
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
I just realized that the "bindnetaddr" is the IP of the node which just has been upgraded and rebooted.
But if it could cause the cluster nodes to lose their quorum, why other nodes weren't affected?
and here is the pveversion of the faulty node:
Code:
proxmox-ve: 5.4-2 (running kernel: 4.15.18-24-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-12
pve-kernel-4.15.18-24-pve: 4.15.18-52
ceph: 12.2.12-pve1
corosync: 3.0.2-pve4~bpo9
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.5-1~bpo9+2
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-55
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3