We had a node 5 in a 6 node cluster fenced due to excessive ram ecc errors. HA worked great and all vms started on other nodes. The cluster worked with no corosync issues for last year since it was put it in production (we had ecc errors in February but ram was replaced). I should mention that for cluster I have been using bonded LACP (802.3ad) with hash 3+4.
PVE version:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.1 (running version: 8.3.1/fb48e850ef9dde27)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.0
libpve-storage-perl: 8.3.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.2
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
While troubleshooting I discovered entries in the journalctl for corosync for almost all remaining nodes randomly losing connection to other nodes, here is an example from node 1 and 3:
root@pve01-nj:~# journalctl -u corosync -n 70
Apr 06 22:02:10 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 2d5fca5
Apr 06 22:02:10 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 2d5fca6
Oct 29 11:55:16 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 119219ee
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] link: host: 5 link: 0 is down
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] host: host: 5 has no active links
Nov 23 02:32:14 pve01-nj corosync[2325]: [TOTEM ] Token has not been received in 4200 ms
Nov 23 02:32:15 pve01-nj corosync[2325]: [TOTEM ] A processor failed, forming new configuration: token timed out (5600ms), waiting 6720ms for consensus.
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Sync members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Sync left[1]: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [TOTEM ] A new membership (1.143) was formed. Members left: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [TOTEM ] Failed to receive the leave message. failed: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve01-nj corosync[2325]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] rx: host: 2 link: 0 is up
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] link: host: 3 link: 0 is down
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 has no active links
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:21:49 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 33cda
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] link: host: 6 link: 0 is down
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 has no active links
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 08:35:16 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 40f12
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:29:55 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 556c0
Nov 23 11:26:19 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 5f813
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] rx: host: 5 link: 0 is up
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Sync members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Sync joined[1]: 5
Nov 23 15:25:19 pve01-nj corosync[2325]: [TOTEM ] A new membership (1.148) was formed. Members joined: 5
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve01-nj corosync[2325]: [MAIN ] Completed service synchronization, ready to provide service.
root@pve01-nj:~#
root@pve03-nj:~# journalctl -u corosync -n 70
Jul 15 22:30:27 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 9f8da42
Jul 15 22:30:27 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 9f8da43
Oct 01 11:26:35 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: f8c1b0d
Oct 29 03:08:13 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 118b5171
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] link: host: 5 link: 0 is down
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] host: host: 5 has no active links
Nov 23 02:32:14 pve03-nj corosync[2284]: [TOTEM ] Token has not been received in 4200 ms
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Sync members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Sync left[1]: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [TOTEM ] A new membership (1.143) was formed. Members left: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [TOTEM ] Failed to receive the leave message. failed: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve03-nj corosync[2284]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 23 07:21:49 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 33cda
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] link: host: 4 link: 0 is down
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 has no active links
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 07:32:20 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 08:28:05 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 3faa5
Nov 23 08:28:36 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 3fc35
Nov 23 08:35:16 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 40f12
Nov 23 08:53:00 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 441ff
Nov 23 09:29:01 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 4a8cc
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] link: host: 6 link: 0 is down
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] host: host: 6 has no active links
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] link: host: 6 link: 0 is down
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] link: host: 4 link: 0 is down
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 6 has no active links
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 4 has no active links
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] rx: host: 6 link: 0 is up
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] rx: host: 4 link: 0 is up
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:58:44 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 5a8ce
Nov 23 11:26:19 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 5f813
Nov 23 11:50:36 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 63d6b
Nov 23 11:50:37 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 63d7b
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] link: host: 1 link: 0 is down
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] host: host: 1 has no active links
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] link: host: 2 link: 0 is down
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] host: host: 2 has no active links
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] rx: host: 5 link: 0 is up
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Sync members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Sync joined[1]: 5
Nov 23 15:25:19 pve03-nj corosync[2284]: [TOTEM ] A new membership (1.148) was formed. Members joined: 5
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve03-nj corosync[2284]: [MAIN ] Completed service synchronization, ready to provide service.
root@pve03-nj:~#
All nodes starting from timestamp 15:25 when node 5 rejoined the cluster stopped reporting link is down or retransmit, it is now 20:00 so 4.5 hours. Is there a correlation between a node being down (1 of 6 nodes) and corosync reporting random nodes links being down ?
Thank you
PVE version:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.3.1 (running version: 8.3.1/fb48e850ef9dde27)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.0
libpve-storage-perl: 8.3.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.2
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-1
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.2
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
While troubleshooting I discovered entries in the journalctl for corosync for almost all remaining nodes randomly losing connection to other nodes, here is an example from node 1 and 3:
root@pve01-nj:~# journalctl -u corosync -n 70
Apr 06 22:02:10 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 2d5fca5
Apr 06 22:02:10 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 2d5fca6
Oct 29 11:55:16 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 119219ee
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] link: host: 5 link: 0 is down
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 02:32:12 pve01-nj corosync[2325]: [KNET ] host: host: 5 has no active links
Nov 23 02:32:14 pve01-nj corosync[2325]: [TOTEM ] Token has not been received in 4200 ms
Nov 23 02:32:15 pve01-nj corosync[2325]: [TOTEM ] A processor failed, forming new configuration: token timed out (5600ms), waiting 6720ms for consensus.
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Sync members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Sync left[1]: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [TOTEM ] A new membership (1.143) was formed. Members left: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [TOTEM ] Failed to receive the leave message. failed: 5
Nov 23 02:32:22 pve01-nj corosync[2325]: [QUORUM] Members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve01-nj corosync[2325]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 03:35:43 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] rx: host: 2 link: 0 is up
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 03:35:46 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] link: host: 3 link: 0 is down
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 has no active links
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 23 05:53:29 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:21:49 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 33cda
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:23:17 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:23:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:42:00 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 07:42:01 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] link: host: 6 link: 0 is down
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 has no active links
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 07:44:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 08:35:16 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 40f12
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 09:05:14 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:29:55 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 556c0
Nov 23 11:26:19 pve01-nj corosync[2325]: [TOTEM ] Retransmit List: 5f813
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] link: host: 2 link: 0 is down
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 13:28:08 pve01-nj corosync[2325]: [KNET ] host: host: 2 has no active links
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 13:28:09 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] rx: host: 5 link: 0 is up
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 15:25:18 pve01-nj corosync[2325]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Sync members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Sync joined[1]: 5
Nov 23 15:25:19 pve01-nj corosync[2325]: [TOTEM ] A new membership (1.148) was formed. Members joined: 5
Nov 23 15:25:19 pve01-nj corosync[2325]: [QUORUM] Members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve01-nj corosync[2325]: [MAIN ] Completed service synchronization, ready to provide service.
root@pve01-nj:~#
root@pve03-nj:~# journalctl -u corosync -n 70
Jul 15 22:30:27 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 9f8da42
Jul 15 22:30:27 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 9f8da43
Oct 01 11:26:35 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: f8c1b0d
Oct 29 03:08:13 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 118b5171
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] link: host: 5 link: 0 is down
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 02:32:12 pve03-nj corosync[2284]: [KNET ] host: host: 5 has no active links
Nov 23 02:32:14 pve03-nj corosync[2284]: [TOTEM ] Token has not been received in 4200 ms
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Sync members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Sync left[1]: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [TOTEM ] A new membership (1.143) was formed. Members left: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [TOTEM ] Failed to receive the leave message. failed: 5
Nov 23 02:32:22 pve03-nj corosync[2284]: [QUORUM] Members[5]: 1 2 3 4 6
Nov 23 02:32:22 pve03-nj corosync[2284]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 23 07:21:49 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 33cda
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] link: host: 4 link: 0 is down
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 has no active links
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Nov 23 07:32:19 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 07:32:20 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 08:28:05 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 3faa5
Nov 23 08:28:36 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 3fc35
Nov 23 08:35:16 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 40f12
Nov 23 08:53:00 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 441ff
Nov 23 09:29:01 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 4a8cc
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] link: host: 6 link: 0 is down
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:29:55 pve03-nj corosync[2284]: [KNET ] host: host: 6 has no active links
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:29:56 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] link: host: 6 link: 0 is down
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] link: host: 4 link: 0 is down
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 6 has no active links
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 10:54:47 pve03-nj corosync[2284]: [KNET ] host: host: 4 has no active links
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] rx: host: 6 link: 0 is up
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] rx: host: 4 link: 0 is up
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 4 joined
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Nov 23 10:54:50 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 10:58:44 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 5a8ce
Nov 23 11:26:19 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 5f813
Nov 23 11:50:36 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 63d6b
Nov 23 11:50:37 pve03-nj corosync[2284]: [TOTEM ] Retransmit List: 63d7b
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] link: host: 1 link: 0 is down
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 23 12:44:32 pve03-nj corosync[2284]: [KNET ] host: host: 1 has no active links
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 23 12:44:33 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] link: host: 2 link: 0 is down
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 14:11:11 pve03-nj corosync[2284]: [KNET ] host: host: 2 has no active links
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 23 14:11:12 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] rx: host: 5 link: 0 is up
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Nov 23 15:25:17 pve03-nj corosync[2284]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Sync members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Sync joined[1]: 5
Nov 23 15:25:19 pve03-nj corosync[2284]: [TOTEM ] A new membership (1.148) was formed. Members joined: 5
Nov 23 15:25:19 pve03-nj corosync[2284]: [QUORUM] Members[6]: 1 2 3 4 5 6
Nov 23 15:25:19 pve03-nj corosync[2284]: [MAIN ] Completed service synchronization, ready to provide service.
root@pve03-nj:~#
All nodes starting from timestamp 15:25 when node 5 rejoined the cluster stopped reporting link is down or retransmit, it is now 20:00 so 4.5 hours. Is there a correlation between a node being down (1 of 6 nodes) and corosync reporting random nodes links being down ?
Thank you