corosync entries in journalctl link: 0 is down etc.

brucexx

Renowned Member
Mar 19, 2015
268
9
83
We had a node 5 in a 6 node cluster fenced due to excessive ram ecc errors. HA worked great and all vms started on other nodes. The cluster worked with no corosync issues for last year since it was put it in production (we had ecc errors in February but ram was replaced). I should mention that for cluster I have been using bonded LACP (802.3ad) with hash 3+4.

PVE version:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.1 (running version: 8.3.1/fb48e850ef9dde27)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.0
libpve-storage-perl: 8.3.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.2
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

While troubleshooting I discovered entries in the journalctl for corosync for almost all remaining nodes randomly losing connection to other nodes, here is an example from node 1 and 3:

root@pve01-nj:~# journalctl -u corosync -n 70
Apr 06 22:02:10 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 2d5fca5
Apr 06 22:02:10 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 2d5fca6
Oct 29 11:55:16 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 119219ee
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] link: host: 5 link: 0 is down
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] host: host: 5 has no active links
Nov 23 02:32:14 pve01-nj corosync[2325]: [TOTEM ] Token has not been received in 4200 ms
Nov 23 02:32:15 pve01-nj corosync[2325]: [TOTEM ] A processor failed, forming new configuration: token timed out (5600ms), waiting 6720ms for consensus.
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Sync members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Sync left[1]: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [TOTEM ] A new membership (1.143) was formed. Members left: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [TOTEM ] Failed to receive the leave message. failed: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve01-nj corosync[2325]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] rx: host: 2 link: 0 is up
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] link: host: 3 link: 0 is down
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 has no active links
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:21:49 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 33cda
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] link: host: 6 link: 0 is down
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 has no active links
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 08:35:16 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 40f12
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:29:55 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 556c0
Nov 23 11:26:19 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 5f813
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] rx: host: 5 link: 0 is up
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Sync members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Sync joined[1]: 5
Nov 23 15:25:19 pve01-nj corosync[2325]: [TOTEM ] A new membership (1.148) was formed. Members joined: 5
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve01-nj corosync[2325]: [MAIN ] Completed service synchronization, ready to provide service.
root@pve01-nj:~#

root@pve03-nj:~# journalctl -u corosync -n 70
Jul 15 22:30:27 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 9f8da42
Jul 15 22:30:27 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 9f8da43
Oct 01 11:26:35 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: f8c1b0d
Oct 29 03:08:13 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 118b5171
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] link: host: 5 link: 0 is down
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] host: host: 5 has no active links
Nov 23 02:32:14 pve03-nj corosync[2284]: [TOTEM ] Token has not been received in 4200 ms
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Sync members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Sync left[1]: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [TOTEM ] A new membership (1.143) was formed. Members left: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [TOTEM ] Failed to receive the leave message. failed: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve03-nj corosync[2284]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 23 07:21:49 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 33cda
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] link: host: 4 link: 0 is down
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 has no active links
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 07:32:20 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 08:28:05 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 3faa5
Nov 23 08:28:36 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 3fc35
Nov 23 08:35:16 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 40f12
Nov 23 08:53:00 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 441ff
Nov 23 09:29:01 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 4a8cc
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] link: host: 6 link: 0 is down
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] host: host: 6 has no active links
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] link: host: 6 link: 0 is down
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] link: host: 4 link: 0 is down
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 6 has no active links
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 4 has no active links
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] rx: host: 6 link: 0 is up
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] rx: host: 4 link: 0 is up
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:58:44 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 5a8ce
Nov 23 11:26:19 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 5f813
Nov 23 11:50:36 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 63d6b
Nov 23 11:50:37 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 63d7b
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] link: host: 1 link: 0 is down
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] host: host: 1 has no active links
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] link: host: 2 link: 0 is down
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] host: host: 2 has no active links
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] rx: host: 5 link: 0 is up
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Sync members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Sync joined[1]: 5
Nov 23 15:25:19 pve03-nj corosync[2284]: [TOTEM ] A new membership (1.148) was formed. Members joined: 5
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve03-nj corosync[2284]: [MAIN ] Completed service synchronization, ready to provide service.
root@pve03-nj:~#


All nodes starting from timestamp 15:25 when node 5 rejoined the cluster stopped reporting link is down or retransmit, it is now 20:00 so 4.5 hours. Is there a correlation between a node being down (1 of 6 nodes) and corosync reporting random nodes links being down ?

Thank you
 
I should mention that for cluster I have been using bonded LACP (802.3ad) with hash 3+4.
that's usually not a good idea, as you will have both the bond and corosync trying to gauge the health of the link(s) and doing failover, which prolongs recovery times with often negative consequences.

without logs of all nodes it will be a bit hard to tell what's going on.. but since the times are not aligned across nodes, this sounds more like your network sometimes not letting UDP packets through properly/in time?
 
I get the misalignment of the time links being down but the time frames before node fenced and after node joined are telling. Since February when we had ecc ram issue I do not see any corosync entries for "link down" and since node joined yesterday and its been 18 hours.

I will try to do some testing on a "spare" 3 node cluster if I find time and will keep monitoring journal for any corosync issues.

Thank you
 
Update:

I see some retransmits sometimes daily 1-2-3 on one or two nodes on others nothing. Sometimes it is happing daily and sometimes no corosync logs for 5 months.

I actually have another cluster on that subnet (different cluster name) same version with 5 nodes configured with LACP exactly the same way for cluster network and I see combined 3 retransmit corosync messages across 5 nodes since December 15 2024.