Hello,
I am experiencing an issue with my cluster and would appreciate your advice.
Problem description:
I am constantly seeing Corosync-related messages (flapping/instability), even though there are no visible link issues. Linux does not report any port or link failures. Switch monitoring also shows no port problems
Environment:
Troubleshooting performed:
Removed bonding for Corosync on the 3rd node - there are messages
Moved Corosync to a separate NIC - there are messages
Switched NIC ports -there are messages
Tried different NICs entirely - there are messages
Ran Corosync over the main vmbr (together with VM traffic) - there are messages
Replaced the hardware of the 3rd node completely -there are messages
Tried turning off EEE at proxmox -there are messages
Removing or shutting down node3 - no messages, everything is fine
Observation:
No other errors are reported in the cluster
2-node cluster works perfectly fine
Issues appear only in 3-node configuration
Since this is my first cluster setup, I am unsure how critical this behavior is.
Questions:
Is it normal to see such messages in a 3-node or more-node cluster?
Can the cluster still be considered stable in this state?
If not, what could be the root cause?
What would you recommend to troubleshoot or fix this issue?
I am experiencing an issue with my cluster and would appreciate your advice.
Problem description:
I am constantly seeing Corosync-related messages (flapping/instability), even though there are no visible link issues. Linux does not report any port or link failures. Switch monitoring also shows no port problems
07:59:29 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
07:59:29 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
08:21:46 node1 corosync[5629]: [KNET ] link: host: 1 link: 0 is down
08:21:46 node1 corosync[5629]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
08:21:47 node1 corosync[5629]: [KNET ] rx: host: 1 link: 0 is up
08:21:47 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
08:21:47 node1 corosync[5629]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
08:21:47 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:00:27 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:00:27 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:00:29 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:00:29 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:00:29 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:00:29 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:28:06 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:28:06 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:28:08 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:28:08 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:28:08 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:28:08 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:32:14 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:32:14 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:32:15 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:32:15 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:32:15 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:32:15 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:54:57 node1 corosync[5629]: [KNET ] link: host: 3 link: 1 is down
09:54:57 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:54:59 node1 corosync[5629]: [KNET ] rx: host: 3 link: 1 is up
09:54:59 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
09:54:59 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:54:59 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
05:59:00 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
05:59:00 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
06:05:18 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
06:05:18 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
06:05:20 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
06:05:20 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
06:05:20 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
06:05:20 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:02:33 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:02:33 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:02:35 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:02:35 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:02:35 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:02:36 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:06:39 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:06:39 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:06:41 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:06:41 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:06:41 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:06:42 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:22:42 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:22:42 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:22:44 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:22:44 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:22:44 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:22:44 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:42:38 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:42:38 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:42:40 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:42:40 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:42:40 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:42:40 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
09:28:02 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
09:28:02 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:28:04 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
09:28:04 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:28:04 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:28:04 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
09:31:25 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
09:31:25 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:31:27 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
09:31:27 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:31:27 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:31:27 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
07:57:09 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:57:11 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
07:57:11 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
07:57:11 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:57:11 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
07:59:05 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
07:59:05 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:59:07 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
07:59:07 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
07:59:08 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:59:08 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:04:32 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
08:04:32 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:04:34 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
08:04:34 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
08:04:34 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:04:34 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:25:06 node3 corosync[1799824]: [KNET ] link: host: 2 link: 0 is down
08:25:06 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:25:08 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 0 is up
08:25:08 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:25:08 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:25:08 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:32:48 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2383e
08:37:41 node3 corosync[1799824]: [TOTEM ] Retransmit List: 23cbe
08:52:18 node3 corosync[1799824]: [TOTEM ] Retransmit List: 24a5f
09:19:39 node3 corosync[1799824]: [KNET ] link: host: 2 link: 0 is down
09:19:39 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:19:41 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 0 is up
09:19:41 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:19:41 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:19:41 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
09:41:33 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2781e
10:17:48 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
10:17:48 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
10:17:50 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
10:17:50 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
10:17:50 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
10:17:50 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
10:37:04 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2aba4
10:48:03 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2b5d2
10:50:22 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2b7f8
11:02:09 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2c2e1
11:11:50 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2cbc4
07:59:29 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
08:21:46 node1 corosync[5629]: [KNET ] link: host: 1 link: 0 is down
08:21:46 node1 corosync[5629]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
08:21:47 node1 corosync[5629]: [KNET ] rx: host: 1 link: 0 is up
08:21:47 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
08:21:47 node1 corosync[5629]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
08:21:47 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:00:27 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:00:27 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:00:29 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:00:29 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:00:29 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:00:29 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:28:06 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:28:06 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:28:08 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:28:08 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:28:08 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:28:08 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:32:14 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:32:14 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:32:15 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:32:15 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:32:15 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:32:15 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:54:57 node1 corosync[5629]: [KNET ] link: host: 3 link: 1 is down
09:54:57 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:54:59 node1 corosync[5629]: [KNET ] rx: host: 3 link: 1 is up
09:54:59 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
09:54:59 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:54:59 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
05:59:00 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
05:59:00 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
06:05:18 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
06:05:18 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
06:05:20 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
06:05:20 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
06:05:20 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
06:05:20 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:02:33 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:02:33 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:02:35 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:02:35 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:02:35 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:02:36 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:06:39 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:06:39 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:06:41 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:06:41 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:06:41 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:06:42 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:22:42 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:22:42 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:22:44 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:22:44 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:22:44 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:22:44 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:42:38 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:42:38 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:42:40 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:42:40 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:42:40 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:42:40 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
09:28:02 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
09:28:02 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:28:04 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
09:28:04 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:28:04 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:28:04 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
09:31:25 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
09:31:25 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:31:27 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
09:31:27 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:31:27 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:31:27 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
07:57:09 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:57:11 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
07:57:11 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
07:57:11 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:57:11 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
07:59:05 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
07:59:05 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:59:07 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
07:59:07 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
07:59:08 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:59:08 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:04:32 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
08:04:32 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:04:34 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
08:04:34 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
08:04:34 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:04:34 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:25:06 node3 corosync[1799824]: [KNET ] link: host: 2 link: 0 is down
08:25:06 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:25:08 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 0 is up
08:25:08 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:25:08 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:25:08 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:32:48 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2383e
08:37:41 node3 corosync[1799824]: [TOTEM ] Retransmit List: 23cbe
08:52:18 node3 corosync[1799824]: [TOTEM ] Retransmit List: 24a5f
09:19:39 node3 corosync[1799824]: [KNET ] link: host: 2 link: 0 is down
09:19:39 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:19:41 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 0 is up
09:19:41 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:19:41 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:19:41 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
09:41:33 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2781e
10:17:48 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
10:17:48 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
10:17:50 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
10:17:50 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
10:17:50 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
10:17:50 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
10:37:04 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2aba4
10:48:03 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2b5d2
10:50:22 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2b7f8
11:02:09 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2c2e1
11:11:50 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2cbc4
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: node1
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.1.1
ring1_addr: 10.10.121.7
}
node {
name: node2
nodeid: 1
quorum_votes: 1
ring0_addr: 172.16.1.2
ring1_addr: 10.10.121.8
}
node {
name: node3
nodeid: 3
quorum_votes: 1
ring0_addr: 172.16.1.3
ring1_addr: 10.10.121.21
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: oxy-pve-cl1
config_version: 13
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
debug: off
to_syslog: yes
}
nodelist {
node {
name: node1
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.1.1
ring1_addr: 10.10.121.7
}
node {
name: node2
nodeid: 1
quorum_votes: 1
ring0_addr: 172.16.1.2
ring1_addr: 10.10.121.8
}
node {
name: node3
nodeid: 3
quorum_votes: 1
ring0_addr: 172.16.1.3
ring1_addr: 10.10.121.21
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: oxy-pve-cl1
config_version: 13
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
Environment:
- Initially: 2-node cluster (HPE servers) — stable
- Corosync traffic is isolated in a dedicated LACP bond on separate 1 Gbps NICs
- After adding the third node (not hpe), flapping messages begin to appear
- The issue occurs on all three nodes
- If the third node is removed, the cluster becomes stable again
Troubleshooting performed:
Removed bonding for Corosync on the 3rd node - there are messages
Moved Corosync to a separate NIC - there are messages
Switched NIC ports -there are messages
Tried different NICs entirely - there are messages
Ran Corosync over the main vmbr (together with VM traffic) - there are messages
Replaced the hardware of the 3rd node completely -there are messages
Tried turning off EEE at proxmox -there are messages
Removing or shutting down node3 - no messages, everything is fine
Observation:
No other errors are reported in the cluster
2-node cluster works perfectly fine
Issues appear only in 3-node configuration
Since this is my first cluster setup, I am unsure how critical this behavior is.
Questions:
Is it normal to see such messages in a 3-node or more-node cluster?
Can the cluster still be considered stable in this state?
If not, what could be the root cause?
What would you recommend to troubleshoot or fix this issue?
Last edited: