Corosync link flapping with 3 nodes

dedfry · Mar 24, 2026

Hello,

I am experiencing an issue with my cluster and would appreciate your advice.

Problem description:
I am constantly seeing Corosync-related messages (flapping/instability), even though there are no visible link issues. Linux does not report any port or link failures. Switch monitoring also shows no port problems

07:59:29 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
07:59:29 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
08:21:46 node1 corosync[5629]: [KNET ] link: host: 1 link: 0 is down
08:21:46 node1 corosync[5629]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
08:21:47 node1 corosync[5629]: [KNET ] rx: host: 1 link: 0 is up
08:21:47 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
08:21:47 node1 corosync[5629]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
08:21:47 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:00:27 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:00:27 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:00:29 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:00:29 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:00:29 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:00:29 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:28:06 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:28:06 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:28:08 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:28:08 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:28:08 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:28:08 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:32:14 node1 corosync[5629]: [KNET ] link: host: 3 link: 0 is down
09:32:14 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
09:32:15 node1 corosync[5629]: [KNET ] rx: host: 3 link: 0 is up
09:32:15 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
09:32:15 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:32:15 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397
09:54:57 node1 corosync[5629]: [KNET ] link: host: 3 link: 1 is down
09:54:57 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:54:59 node1 corosync[5629]: [KNET ] rx: host: 3 link: 1 is up
09:54:59 node1 corosync[5629]: [KNET ] link: Resetting MTU for link 1 because host 3 joined
09:54:59 node1 corosync[5629]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
09:54:59 node1 corosync[5629]: [KNET ] pmtud: Global data MTU changed to: 1397

05:59:00 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
05:59:00 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
06:05:18 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
06:05:18 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
06:05:20 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
06:05:20 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
06:05:20 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
06:05:20 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:02:33 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:02:33 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:02:35 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:02:35 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:02:35 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:02:36 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:06:39 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:06:39 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:06:41 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:06:41 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:06:41 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:06:42 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:22:42 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:22:42 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:22:44 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:22:44 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:22:44 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:22:44 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
08:42:38 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
08:42:38 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:42:40 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
08:42:40 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:42:40 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:42:40 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
09:28:02 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
09:28:02 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:28:04 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
09:28:04 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:28:04 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:28:04 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397
09:31:25 node2 corosync[1946854]: [KNET ] link: host: 2 link: 0 is down
09:31:25 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:31:27 node2 corosync[1946854]: [KNET ] rx: host: 2 link: 0 is up
09:31:27 node2 corosync[1946854]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:31:27 node2 corosync[1946854]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:31:27 node2 corosync[1946854]: [KNET ] pmtud: Global data MTU changed to: 1397

07:57:09 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:57:11 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
07:57:11 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
07:57:11 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:57:11 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
07:59:05 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
07:59:05 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:59:07 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
07:59:07 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
07:59:08 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
07:59:08 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:04:32 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
08:04:32 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:04:34 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
08:04:34 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
08:04:34 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:04:34 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:25:06 node3 corosync[1799824]: [KNET ] link: host: 2 link: 0 is down
08:25:06 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
08:25:08 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 0 is up
08:25:08 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
08:25:08 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
08:25:08 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
08:32:48 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2383e
08:37:41 node3 corosync[1799824]: [TOTEM ] Retransmit List: 23cbe
08:52:18 node3 corosync[1799824]: [TOTEM ] Retransmit List: 24a5f
09:19:39 node3 corosync[1799824]: [KNET ] link: host: 2 link: 0 is down
09:19:39 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
09:19:41 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 0 is up
09:19:41 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
09:19:41 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
09:19:41 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
09:41:33 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2781e
10:17:48 node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
10:17:48 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
10:17:50 node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
10:17:50 node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
10:17:50 node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
10:17:50 node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
10:37:04 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2aba4
10:48:03 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2b5d2
10:50:22 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2b7f8
11:02:09 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2c2e1
11:11:50 node3 corosync[1799824]: [TOTEM ] Retransmit List: 2cbc4

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node1
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.1.1
ring1_addr: 10.10.121.7
}
node {
name: node2
nodeid: 1
quorum_votes: 1
ring0_addr: 172.16.1.2
ring1_addr: 10.10.121.8
}
node {
name: node3
nodeid: 3
quorum_votes: 1
ring0_addr: 172.16.1.3
ring1_addr: 10.10.121.21
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: oxy-pve-cl1
config_version: 13
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

Environment:

Initially: 2-node cluster (HPE servers) — stable
Corosync traffic is isolated in a dedicated LACP bond on separate 1 Gbps NICs
After adding the third node (not hpe), flapping messages begin to appear
The issue occurs on all three nodes
If the third node is removed, the cluster becomes stable again

Troubleshooting performed:
Removed bonding for Corosync on the 3rd node - there are messages
Moved Corosync to a separate NIC - there are messages
Switched NIC ports -there are messages
Tried different NICs entirely - there are messages
Ran Corosync over the main vmbr (together with VM traffic) - there are messages
Replaced the hardware of the 3rd node completely -there are messages
Tried turning off EEE at proxmox -there are messages
Removing or shutting down node3 - no messages, everything is fine

Observation:
No other errors are reported in the cluster
2-node cluster works perfectly fine
Issues appear only in 3-node configuration
Since this is my first cluster setup, I am unsure how critical this behavior is.

Questions:
Is it normal to see such messages in a 3-node or more-node cluster?
Can the cluster still be considered stable in this state?
If not, what could be the root cause?
What would you recommend to troubleshoot or fix this issue?

spirit · Mar 24, 2026

it's look like to nic of the 3th is going down/up or flapping.

nic driver bug ? maybe bad cable ?

do you have any kernel log on the 3th node ? #dmesg ?

maybe also try without bonding/lacp with 2 corosync links

dedfry · Mar 24, 2026

spirit said:
it's look like to nic of the 3th is going down/up or flapping.

nic driver bug ? maybe bad cable ?

do you have any kernel log on the 3th node ? #dmesg ?

maybe also try without bonding/lacp with 2 corosync links

[Mon Mar 23 12:32:56 2026] vmbr0: the hash_elasticity option has been deprecated and is always 16
[Mon Mar 23 12:33:00 2026] tg3 0000:04:00.0 eno8303: Link is up at 1000 Mbps, full duplex
[Mon Mar 23 12:33:00 2026] tg3 0000:04:00.0 eno8303: Flow control is off for TX and off for RX
[Mon Mar 23 12:33:00 2026] tg3 0000:04:00.0 eno8303: EEE is disabled
[Mon Mar 23 12:36:43 2026] sctp: Hash tables configured (bind 16384/16384)
[Mon Mar 23 17:48:59 2026] perf: interrupt took too long (3181 > 3180), lowering kernel.perf_event_max_sample_rate to 62000
[Mon Mar 23 22:04:28 2026] tg3 0000:04:00.0 eno8303: Link is down
[Mon Mar 23 22:20:09 2026] tg3 0000:04:00.0 eno8303: Link is up at 1000 Mbps, full duplex
[Mon Mar 23 22:20:09 2026] tg3 0000:04:00.0 eno8303: Flow control is off for TX and off for RX
[Mon Mar 23 22:20:09 2026] tg3 0000:04:00.0 eno8303: EEE is disabled
[Tue Mar 24 04:15:10 2026] perf: interrupt took too long (4006 > 3976), lowering kernel.perf_event_max_sample_rate to 49000
[Tue Mar 24 11:26:13 2026] ahci 0000:05:00.0: Using 48-bit DMA addresses

node3 pmxcfs[1799828]: [status] notice: received log
node3 systemd[1]: Starting prometheus-node-exporter-ipmitool-sensor.service - Collect ipmitool sensor metrics for prometheus-node-e>
node3 systemd[1]: prometheus-node-exporter-ipmitool-sensor.service: Deactivated successfully.
node3 systemd[1]: Finished prometheus-node-exporter-ipmitool-sensor.service - Collect ipmitool sensor metrics for prometheus-node-e>
node3 corosync[1799824]: [KNET ] link: host: 2 link: 1 is down
node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
node3 pmxcfs[1799828]: [status] notice: received log
node3 corosync[1799824]: [KNET ] rx: host: 2 link: 1 is up
node3 corosync[1799824]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
node3 corosync[1799824]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
node3 corosync[1799824]: [KNET ] pmtud: Global data MTU changed to: 1397
node3 corosync[1799824]: [TOTEM ] Retransmit List: 35046
node3 pmxcfs[1799828]: [status] notice: received log
node3 pmxcfs[1799828]: [status] notice: received log
node3 pmxcfs[1799828]: [status] notice: received log

as you can see, linux got no porblems with network. VMs work fine too on node1-2, didnt add VMs to node3 yet
Nic driver bug ? Previously used intel nic, changed it to broadcom, still messages
Changed dac cables too

Removed all bondings on node3, still got messages

bitranox · Mar 24, 2026

This is not random flapping
it's specifically link: 1 (the second Corosync ring/link) that is unstable, while link: 0 remains the preferred path. This is a very specific and diagnosable problem.

The MTU of 1397 is your primary smoking gun. Standard Ethernet MTU is 1500 bytes. Corosync/KNET uses Path MTU Discovery (PMTUD) when it finds 1397, it means something in the network path is fragmenting or capping packets. This causes retransmits, which triggers the flapping you're seeing.
The second key observation: only link: 1 is flapping, not link: 0. This tells you your Corosync config has two rings/links configured, and the second one, specifically from/to node3, is the problem.

Confirm what link 1 actually is on node3 : cat /etc/pve/corosync.conf

Look for the interface sections. You likely have ringnumber: 0 and ringnumber: 1 (or link_mode with two addresses per node). Find which IP corresponds to link 1 on node3.

On every node, run: ip link show # Check MTU column for each interface

They should all be identical. Then test end-to-end PMTUD manually:

Bash:

# From node1, test MTU to node3's link-1 IP
ping -M do -s 1472 <node3-link1-ip>   # 1472 + 28 overhead = 1500 bytes
ping -M do -s 1400 <node3-link1-ip>   # Should succeed if MTU is 1397

If 1472 fails but 1400 succeeds, you've confirmed a path MTU below 1500.

Check switch port configuration for node3
On your switch, compare the port config for node3 versus node1/2. Specifically look for:
- Access vs trunk mode : if node3's port is in trunk mode it'll prepend 802.1Q tags (4 bytes), reducing effective MTU to 1496 without jumbo frames
- MTU setting on the port : some managed switches have per-port MTU caps

Given that you replaced hardware entirely and still have the issue, the problem is network/config, not hardware. The tg3 link-down at 22:04 in your kernel log (15+ minutes of physical link loss) is a separate event, possibly EEE or switch-side power saving. The Corosync flapping is the MTU/link1 issue.

spirit · Mar 25, 2026

bitranox said:
The MTU of 1397 is your primary smoking gun. Standard Ethernet MTU is 1500 bytes. Corosync/KNET uses Path MTU Discovery (PMTUD) when it finds 1397, it means something in the network path is fragmenting or capping packets. This causes retransmits, which triggers the flapping you're seeing.
The second key observation: only link: 1 is flapping, not link: 0. This tells you your Corosync config has two rings/links configured, and the second one, specifically from/to node3, is the problem.

corosync mtu is always lower than real mtu.
in my production, I have also 1397 pmtud with 1500 mtu on nic

Code:

Feb 19 11:05:20 corosync[23618]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397

spirit · Mar 25, 2026

dedfry said:
[Mon Mar 23 12:32:56 2026] vmbr0: the hash_elasticity option has been deprecated and is always 16
[Mon Mar 23 12:33:00 2026] tg3 0000:04:00.0 eno8303: Link is up at 1000 Mbps, full duplex
[Mon Mar 23 12:33:00 2026] tg3 0000:04:00.0 eno8303: Flow control is off for TX and off for RX
[Mon Mar 23 12:33:00 2026] tg3 0000:04:00.0 eno8303: EEE is disabled
[Mon Mar 23 12:36:43 2026] sctp: Hash tables configured (bind 16384/16384)
[Mon Mar 23 17:48:59 2026] perf: interrupt took too long (3181 > 3180), lowering kernel.perf_event_max_sample_rate to 62000
[Mon Mar 23 22:04:28 2026] tg3 0000:04:00.0 eno8303: Link is down
[Mon Mar 23 22:20:09 2026] tg3 0000:04:00.0 eno8303: Link is up at 1000 Mbps, full duplex

So, here, at 22:04, why is your link down ? node reboot ? if not, you have clearly a problem with nic , the cable, or the switch port[/spoiler]

bitranox · Mar 25, 2026

spirit said:
corosync mtu is always lower than real mtu.
in my production, I have also 1397 pmtud with 1500 mtu on nic

good to know ! i did not check that on my hosts, it just caught my attention. I take everything back. ;-)

dedfry · Mar 25, 2026

spirit said:
So, here, at 22:04, why is your link down ? node reboot ? if not, you have clearly a problem with nic , the cable, or the switch port[/spoiler]

Thank u for reply, that was manual port shutdown on switch

dedfry · Mar 25, 2026

bitranox said:
This is not random flapping
it's specifically link: 1 (the second Corosync ring/link) that is unstable, while link: 0 remains the preferred path. This is a very specific and diagnosable problem.

The MTU of 1397 is your primary smoking gun. Standard Ethernet MTU is 1500 bytes. Corosync/KNET uses Path MTU Discovery (PMTUD) when it finds 1397, it means something in the network path is fragmenting or capping packets. This causes retransmits, which triggers the flapping you're seeing.
The second key observation: only link: 1 is flapping, not link: 0. This tells you your Corosync config has two rings/links configured, and the second one, specifically from/to node3, is the problem.

Confirm what link 1 actually is on node3 : cat /etc/pve/corosync.conf

Look for the interface sections. You likely have ringnumber: 0 and ringnumber: 1 (or link_mode with two addresses per node). Find which IP corresponds to link 1 on node3.

On every node, run: ip link show # Check MTU column for each interface

They should all be identical. Then test end-to-end PMTUD manually:

Bash:

# From node1, test MTU to node3's link-1 IP ping -M do -s 1472 <node3-link1-ip> # 1472 + 28 overhead = 1500 bytes ping -M do -s 1400 <node3-link1-ip> # Should succeed if MTU is 1397

If 1472 fails but 1400 succeeds, you've confirmed a path MTU below 1500.

Check switch port configuration for node3
On your switch, compare the port config for node3 versus node1/2. Specifically look for:
- Access vs trunk mode : if node3's port is in trunk mode it'll prepend 802.1Q tags (4 bytes), reducing effective MTU to 1496 without jumbo frames
- MTU setting on the port : some managed switches have per-port MTU caps

Given that you replaced hardware entirely and still have the issue, the problem is network/config, not hardware. The tg3 link-down at 22:04 in your kernel log (15+ minutes of physical link loss) is a separate event, possibly EEE or switch-side power saving. The Corosync flapping is the MTU/link1 issue.

thank u for your reply!

I tested ping ping -M do -s 1472 <node3-link1-ip> # 1472 + 28 overhead = 1500 bytes that succeded and ping -M do -s 1400 <node3-link1-ip> # Should succeed if MTU is 1397[/CODE] succeded too

We did pings with different port settings (trunk with native and access ports) but in both cases they went normal no mistake. Tried 1472 and 1400

For switch we had trunk ports for corosync with native vlan, changed it to just access ports with needed vlan as a test it may be some bug in our switch, but that didnt help and error occured again

Frankly speaking i dont see mtu as a root cause, but just in chance we are gonna increase our switch mtu tomorrow to 3000 and check if that might help (current mtu is 1500)

Also that might not get caught in logs i send earlier but with node3 link0 and link1 boht went down in corosync, so it might be not only link1 problem, but in that case

I can send any logs and any configs of my cluster or network device if u need them to help me|

As experiment i set window_size: 30 and max_messages: 10 at /etc/corosync/corosync.conf as my node3 is slightly faster and maybe thats gonna help

EDITED:
Cant increase mtu at our switch, cause it doesnt support specific port mtu options, only whole switch

dedfry · Mar 25, 2026

dedfry said:
As experiment i set window_size: 30 and max_messages: 10 at /etc/corosync/corosync.conf as my node3 is slightly faster and maybe thats gonna help

So that didnt help at all

Fr im already drained with that problem, can anyone give any advice where to troubleshoot ? Anymore ideas? would appreciate any help

bitranox · Mar 25, 2026

I will get probably a lot of backlash for that, but just try before spending endless hours :

Install Claude Code on the node and my proxmox-skill , see : https://pypi.org/project/bx-skills/
and let claude diagnose it . Just prompt : "diagnose corosync restarts, use bx-proxmox skill"

let me know how it turned out ....

Johannes S · Mar 25, 2026

bitranox said:
I will get probably a lot of backlash for that, but just try before spending endless hours :

Instead spending hours on finding the right prompt and finding out whether the ki output is correct or not?
Your post is not helpful at all. Asking an AI is something OP can do theirselv, they specifically asked here because they wanted help from humans, not some bullshit-as-a-service-generator

bitranox · Mar 25, 2026

Johannes S said:
Asking an AI is something OP can do theirselv,

thats exactly what I proposed.

dedfry · Mar 27, 2026

I finally solved problem thank u for help

The root was my network devices configurated as stack and corosync was connected to different units, thats caused link flapping, when i connected to 1 unit problem was gone

bitranox · Mar 27, 2026

dedfry said:
I finally solved problem thank u for help

You found it yourself, or used claude ?

dedfry · Mar 27, 2026

bitranox said:
You found it yourself, or used claude ?

found myself, i dont trust ai agents

Search

Search

Corosync link flapping with 3 nodes

dedfry

New Member

spirit

Distinguished Member

dedfry

New Member

bitranox

Member

spirit

Distinguished Member

spirit

Distinguished Member

bitranox

Member

dedfry

New Member

dedfry

New Member

dedfry

New Member

bitranox

Member

Johannes S

Distinguished Member

bitranox

Member

dedfry

New Member

bitranox

Member

dedfry

New Member

We value your privacy