corosync stability issue

ianmbetts · May 4, 2021

Hi ,
I have a classic three node hyper-converged cluster.
Each node is identical HP DL380p with 96GB memory.
12 disk bays as follows:-
1 x sata 320 GB HDD (PVE host OS)
1 x sata 120GB SSD (Ceph WAL)
10 x 1TB HDD (OSDs)

Network and interfaces are as follows:
1 Gbe corosync (single port on each node connected to an unmanaged 1Gbe switch)
10 Gbe Ceph public network ( broadcast bond directly connected, no switch)
10 Gbe Ceph cluster network ( separate cluster net , also a directly connected broadcast bond)
10 Gbe office network interfaces ( cisco nexus )

I am running the latest full upgrade from no-subscription.
There are about a dozen VMs running, most are fairly light weight., the largest is a 2TB linux image running Owncloud.
There are around 20 physical machines on the office network which sync to owncloud.

I am seeing seemingly random corosync issues where quorum is momentarily lost, sometimes resulting in a node reboot.
Problems exacerbated during backup to a pbs datastore, not so bad if backing up to nfs.
Having said this removing all NFS mounts has stabilized things but logs still show corosync is still glitching every few minutes.

Couple of questions:

1) I am doubting this physical corosync setup (nics/cabling/switch)
Can I replace the corosync switch with a directly connected broadcast bond using the same topology as my 10Ge Ceph networks
( i.e. using two 1Ge ports in each node) ?
Will the latency be acceptable ?

2) Removing NFS mounts seems to have helped but I wondering about pbs.
Does it use NFS under the hood ?

Thanks

Here is what I see in syslog
May 4 11:45:07 mfscn01 corosync[2087]: [MAIN ] Completed service synchronization, ready to provide service.
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retried 39 times
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retried 39 times
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retried 1 times
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: members: 1/2070, 2/2223, 3/2157
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: starting data syncronisation
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: received sync request (epoch 1/2070/00000066)
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: received sync request (epoch 1/2070/00000042)
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: received all states
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: leader is 1/2070
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: synced members: 1/2070, 2/2223
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: start sending inode updates
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: sent all (8) updates
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: all data is up to date
May 4 11:45:07 mfscn01 pmxcfs[2070]: [dcdb] notice: dfsm_deliver_queue: queue length 5
May 4 11:45:07 mfscn01 pve-ha-crm[3585]: loop take too long (38 seconds)
May 4 11:45:07 mfscn01 pve-ha-lrm[3618]: successfully acquired lock 'ha_agent_mfscn01_lock'
May 4 11:45:07 mfscn01 pve-ha-lrm[3618]: status change lost_agent_lock => active
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: received all states
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: all data is up to date
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: dfsm_deliver_queue: queue length 41
May 4 11:45:07 mfscn01 pvestatd[3151]: status update time (24.251 seconds)
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mfscn01/local: -1
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mfscn01/pbs01: -1
May 4 11:45:07 mfscn01 pmxcfs[2070]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mfscn01/POOL0: -1
May 4 11:45:08 mfscn01 pmxcfs[2070]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mfscn02/local: -1
May 4 11:45:08 mfscn01 pmxcfs[2070]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mfscn02/pbs01: -1
May 4 11:45:08 mfscn01 pmxcfs[2070]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mfscn02/POOL0: -1
May 4 11:46:51 mfscn01 pveproxy[3610]: worker 311484 finished
May 4 11:46:51 mfscn01 pveproxy[3610]: worker 311484 finished
May 4 11:46:51 mfscn01 pveproxy[3610]: starting 1 worker(s)
May 4 11:46:51 mfscn01 pveproxy[3610]: worker 331992 started
May 4 11:46:53 mfscn01 pveproxy[331991]: worker exit
May 4 11:47:30 mfscn01 pvedaemon[193873]: <root@pam> successful auth for user 'root@pam'
May 4 11:52:02 mfscn01 pveproxy[3610]: worker 310934 finished
May 4 11:52:02 mfscn01 pveproxy[3610]: starting 1 worker(s)
May 4 11:52:02 mfscn01 pveproxy[3610]: worker 336031 started
May 4 11:52:03 mfscn01 pveproxy[336030]: got inotify poll request in wrong process - disabling inotify
May 4 11:52:05 mfscn01 pveproxy[336030]: worker exit
May 4 11:55:06 mfscn01 pvedaemon[259441]: <root@pam> successful auth for user 'root@pam'
May 4 12:01:30 mfscn01 corosync[2087]: [KNET ] link: host: 2 link: 0 is down
May 4 12:01:30 mfscn01 corosync[2087]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 4 12:01:30 mfscn01 corosync[2087]: [KNET ] host: host: 2 has no active links
May 4 12:01:33 mfscn01 corosync[2087]: [TOTEM ] Token has not been received in 2737 ms
May 4 12:01:38 mfscn01 corosync[2087]: [QUORUM] Sync members[2]: 1 3
May 4 12:01:38 mfscn01 corosync[2087]: [QUORUM] Sync left[1]: 2
May 4 12:01:38 mfscn01 corosync[2087]: [TOTEM ] A new membership (1.33dd) was formed. Members left: 2
May 4 12:01:38 mfscn01 corosync[2087]: [TOTEM ] Failed to receive the leave message. failed: 2
May 4 12:01:39 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retry 10
May 4 12:01:40 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retry 20
May 4 12:01:41 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retry 30
May 4 12:01:42 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retry 40
May 4 12:01:43 mfscn01 corosync[2087]: [QUORUM] Sync members[2]: 1 3
May 4 12:01:43 mfscn01 corosync[2087]: [QUORUM] Sync left[1]: 2
May 4 12:01:43 mfscn01 corosync[2087]: [TOTEM ] A new membership (1.33e1) was formed. Members
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: members: 1/2070, 3/2157
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: starting data syncronisation
May 4 12:01:43 mfscn01 corosync[2087]: [QUORUM] Members[2]: 1 3
May 4 12:01:43 mfscn01 corosync[2087]: [MAIN ] Completed service synchronization, ready to provide service.
May 4 12:01:43 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retried 43 times
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retried 1 times
May 4 12:01:43 mfscn01 pmxcfs[2070]: [status] notice: members: 1/2070, 3/2157
May 4 12:01:43 mfscn01 pmxcfs[2070]: [status] notice: starting data syncronisation
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: received sync request (epoch 1/2070/00000067)
May 4 12:01:43 mfscn01 pmxcfs[2070]: [status] notice: received sync request (epoch 1/2070/00000043)
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: received all states
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: leader is 1/2070
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: synced members: 1/2070, 3/2157
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: start sending inode updates
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: sent all (0) updates
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: all data is up to date
May 4 12:01:43 mfscn01 pmxcfs[2070]: [dcdb] notice: dfsm_deliver_queue: queue length 6
May 4 12:01:43 mfscn01 pmxcfs[2070]: [status] notice: received all states
May 4 12:01:43 mfscn01 pmxcfs[2070]: [status] notice: all data is up to date
May 4 12:01:43 mfscn01 pmxcfs[2070]: [status] notice: dfsm_deliver_queue: queue length 24
May 4 12:01:51 mfscn01 corosync[2087]: [QUORUM] Sync members[2]: 1 3
May 4 12:01:51 mfscn01 corosync[2087]: [TOTEM ] A new membership (1.33e5) was formed. Members
May 4 12:01:54 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 10
May 4 12:01:55 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 20
May 4 12:01:56 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 30
May 4 12:01:56 mfscn01 corosync[2087]: [QUORUM] Sync members[2]: 1 3
May 4 12:01:56 mfscn01 corosync[2087]: [TOTEM ] A new membership (1.33e9) was formed. Members
May 4 12:01:57 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 40
May 4 12:01:58 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 50
May 4 12:01:59 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 60
May 4 12:01:59 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retry 10
May 4 12:02:00 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 70
May 4 12:02:00 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retry 20
May 4 12:02:00 mfscn01 corosync[2087]: [QUORUM] Sync members[2]: 1 3
May 4 12:02:00 mfscn01 corosync[2087]: [TOTEM ] A new membership (1.33ed) was formed. Members
May 4 12:02:01 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 80
May 4 12:02:01 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retry 30
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retry 90
May 4 12:02:02 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retry 40
May 4 12:02:02 mfscn01 corosync[2087]: [KNET ] rx: host: 2 link: 0 is up
May 4 12:02:02 mfscn01 corosync[2087]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
May 4 12:02:02 mfscn01 corosync[2087]: [QUORUM] Sync members[3]: 1 2 3
May 4 12:02:02 mfscn01 corosync[2087]: [QUORUM] Sync joined[1]: 2
May 4 12:02:02 mfscn01 corosync[2087]: [TOTEM ] A new membership (1.33f1) was formed. Members joined: 2
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: members: 1/2070, 2/2223, 3/2157
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: starting data syncronisation
May 4 12:02:02 mfscn01 corosync[2087]: [QUORUM] Members[3]: 1 2 3
May 4 12:02:02 mfscn01 corosync[2087]: [MAIN ] Completed service synchronization, ready to provide service.
May 4 12:02:02 mfscn01 pmxcfs[2070]: [status] notice: cpg_send_message retried 42 times
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retried 96 times
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: cpg_send_message retried 1 times
May 4 12:02:02 mfscn01 pmxcfs[2070]: [status] notice: members: 1/2070, 2/2223, 3/2157
May 4 12:02:02 mfscn01 pmxcfs[2070]: [status] notice: starting data syncronisation
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: received sync request (epoch 1/2070/00000068)
May 4 12:02:02 mfscn01 pmxcfs[2070]: [status] notice: received sync request (epoch 1/2070/00000044)
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: received all states
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: leader is 1/2070
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: synced members: 1/2070, 3/2157
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: start sending inode updates
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: sent all (5) updates
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: all data is up to date
May 4 12:02:02 mfscn01 pmxcfs[2070]: [dcdb] notice: dfsm_deliver_queue: queue length 6
May 4 12:02:02 mfscn01 pmxcfs[2070]: [status] notice: received all states
May 4 12:02:02 mfscn01 pmxcfs[2070]: [status] notice: all data is up to date
May 4 12:02:02 mfscn01 pmxcfs[2070]: [status] notice: dfsm_deliver_queue: queue length 19
May 4 12:02:30 mfscn01 pvedaemon[259441]: <root@pam> successful auth for user 'root@pam'
May 4 12:03:13 mfscn01 corosync[2087]: [KNET ] link: host: 3 link: 0 is down
May 4 12:03:13 mfscn01 corosync[2087]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
May 4 12:03:13 mfscn01 corosync[2087]: [KNET ] host: host: 3 has no active links
May 4 12:03:13 mfscn01 corosync[2087]: [TOTEM ] Token has not been received in 2737 ms
May 4 12:03:14 mfscn01 corosync[2087]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
May 4 12:03:19 mfscn01 corosync[2087]: [QUORUM] Sync members[1]: 1
May 4 11:47:30 mfscn01 pvedaemon[193873]: <root@pam> successful auth for user 'root@pam'
May 4 11:52:02 mfscn01 pveproxy[3610]: worker 310934 finished
May 4 11:52:02 mfscn01 pveproxy[3610]: starting 1 worker(s)
May 4 11:52:02 mfscn01 pveproxy[3610]: worker 336031 started
May 4 11:52:03 mfscn01 pveproxy[336030]: got inotify poll request in wrong process - disabling inotify
May 4 11:52:05 mfscn01 pveproxy[336030]: worker exit
May 4 11:55:06 mfscn01 pvedaemon[259441]: <root@pam> successful auth for user 'root@pam'

mira · May 4, 2021

Please provide the output of pveversion -v.
Rather than replacing it with a bond, you could try adding a 2nd link on one or all of the other networks as fallback.

Is the PBS or any of the NFS servers connected via the corosync network?

ianmbetts · May 4, 2021

Thanks for response

PBS is a physical server hanging off the 10G nexus switch, each node has a 10G connection to the switch,
same with server hosts PBS and NFS, but problem appeared with PBS before I added NFS.

The corosync network was previously also used for occasional remote management, so currently BMC ports are
still connected to the switch, as was a remote access gateway but I have shut that down to eliminate it.
All that is active on the switch (that I am aware of) is the two ports on each node: a) the corosync port, b) the BMC port.

Since my original post I have restarted systemd-timesyncd on all nodes as there was some skew.
But I am still seeing the same glitches on corosync, but not enough to cause a reboot fortunately.
I will try adding some redundancy using the other networks as you suggested. (thanks)

Here is the pveversion
root@mfscn01:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-4 (running version: 6.4-4/337d6701)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.98-1-pve: 5.4.98-1
ceph: 15.2.11-pve1
ceph-fuse: 15.2.11-pve1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

ianmbetts · May 4, 2021

Just to follow up on this, after adding one of the other networks into the corosync as suggested things are considerably improved.
Now I only very occasional messages indicating only that an altenative corosync network path has been selected.
e.g.

May 4 16:15:47 mfscn03 corosync[2067]: [KNET ] link: host: 1 link: 0 is down
May 4 16:15:47 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
May 4 16:15:47 mfscn03 pmxcfs[2044]: [status] notice: received log
May 4 16:16:08 mfscn03 corosync[2067]: [KNET ] rx: host: 1 link: 0 is up
May 4 16:16:08 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)

~~every 10mins or so~~
pretty much exactly every 20mins

WORD OF WARNING for anyone following this!
be very careful editing corosyn.conf, I introduced a syntax error that knocked out the cluster, and difficult to recover
remotely

ianmbetts · May 5, 2021

so it looks like every time the links toggle it is after a pmxcfs received log

May 5 10:28:15 mfscn03 pmxcfs[2044]: [status] notice: received log
May 5 10:29:06 mfscn03 corosync[2067]: [KNET ] link: host: 1 link: 0 is down
May 5 10:29:06 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
May 5 10:29:21 mfscn03 corosync[2067]: [KNET ] rx: host: 1 link: 0 is up
May 5 10:29:21 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 5 10:43:09 mfscn03 pmxcfs[2044]: [status] notice: received log
May 5 10:47:57 mfscn03 corosync[2067]: [KNET ] link: host: 1 link: 0 is down
May 5 10:47:57 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
May 5 10:48:28 mfscn03 corosync[2067]: [KNET ] rx: host: 1 link: 0 is up
May 5 10:48:28 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 5 10:58:10 mfscn03 pmxcfs[2044]: [status] notice: received log
May 5 11:06:47 mfscn03 corosync[2067]: [KNET ] link: host: 1 link: 0 is down
May 5 11:06:47 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
May 5 11:07:02 mfscn03 corosync[2067]: [KNET ] rx: host: 1 link: 0 is up
May 5 11:07:02 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)

And sometimes not

ay 5 12:22:03 mfscn03 corosync[2067]: [KNET ] link: host: 1 link: 0 is down
May 5 12:22:03 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
May 5 12:22:22 mfscn03 corosync[2067]: [KNET ] rx: host: 1 link: 0 is up
May 5 12:22:22 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 5 12:40:53 mfscn03 corosync[2067]: [KNET ] link: host: 1 link: 0 is down
May 5 12:40:53 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
May 5 12:41:06 mfscn03 corosync[2067]: [KNET ] rx: host: 1 link: 0 is up
May 5 12:41:06 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
May 5 12:59:44 mfscn03 corosync[2067]: [KNET ] link: host: 1 link: 0 is down
May 5 12:59:44 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
May 5 13:00:07 mfscn03 corosync[2067]: [KNET ] rx: host: 1 link: 0 is up
May 5 13:00:07 mfscn03 corosync[2067]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)

spirit · May 6, 2021

do you see " link: host: 1 link: 0 is down" on every node ? (that should mean that the host1 (from corosync.conf), is flappling).

if yes, maybe check with "dmesg" on host1 if you don't see nothing strange.

corosync stability issue

ianmbetts

Member

mira

Proxmox Staff Member

ianmbetts

Member

ianmbetts

Member

ianmbetts

Member

spirit

Distinguished Member

We value your privacy