Hi
I have an interesting question regarding a small proxmox cluster.
1. Background
I have a 3 node proxmox cluster, that has internet/switch facing NIC in a bond and a second NIC that does a simple peer-to-peer link.
E.g. prox-01 <-> prox-02 <-> prox-03 <-> prox-01
This is a simple cost-effective way to connect all hosts. My networking configuration for each node is like this:
As you can see I am using broadcast mode so packets are sent each way.
2. Problem/Issue
Recently I had a link failure between prox-01 and prox-02
2026-04-17T13:37:12.276667+03:00 prox-01 kernel: [3966775.451514] mlx5_core 0000:43:00.0 enp67s0f0np0: Link down
Then I had corosync step in:
2026-04-17 13:37:13.5412026-04-17T13:37:13.311676+03:00 proxmox-01 corosync[1984]: [KNET ] link: host: 2 link: 0 is down
2026-04-17 13:37:13.5412026-04-17T13:37:13.311983+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
2026-04-17 13:37:13.5412026-04-17T13:37:13.312012+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 has no active links
2026-04-17 13:37:15.0442026-04-17T13:37:14.944110+03:00 proxmox-01 corosync[1984]: [TOTEM ] Token has not been received in 2737 ms
2026-04-17 13:37:16.0002026-04-17T13:37:15.926696+03:00 proxmox-01 kernel: [3966779.101643] mlx5_core 0000:43:00.0 enp67s0f0np0: Link up
2026-04-17 13:37:16.0462026-04-17T13:37:15.856919+03:00 proxmox-01 corosync[1984]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
2026-04-17 13:37:16.0462026-04-17T13:37:15.926696+03:00 proxmox-01 kernel: [3966779.101643] mlx5_core 0000:43:00.0 enp67s0f0np0: Link up
2026-04-17 13:37:18.5492026-04-17T13:37:18.312509+03:00 proxmox-01 corosync[1984]: [KNET ] rx: host: 2 link: 0 is up
2026-04-17 13:37:18.5492026-04-17T13:37:18.312623+03:00 proxmox-01 corosync[1984]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
2026-04-17 13:37:18.5492026-04-17T13:37:18.312684+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
2026-04-17 13:37:18.5492026-04-17T13:37:18.389680+03:00 proxmox-01 corosync[1984]: [KNET ] pmtud: Global data MTU changed to: 1397
2026-04-17 13:37:18.5492026-04-17T13:37:18.408929+03:00 proxmox-01 corosync[1984]: [QUORUM] Sync members[3]: 1 2 3
2026-04-17 13:37:18.5492026-04-17T13:37:18.408985+03:00 proxmox-01 corosync[1984]: [TOTEM ] A new membership (1.a3b) was formed. Members
2026-04-17 13:37:18.5492026-04-17T13:37:18.411066+03:00 proxmox-01 corosync[1984]: [QUORUM] Members[3]: 1 2 3
2026-04-17 13:37:18.5492026-04-17T13:37:18.411107+03:00 proxmox-01 corosync[1984]: [MAIN ] Completed service synchronization, ready to provide service.
It recovered, but after a few mins it did it a few more times until it didnt recover fast enoguh and prox-01 and prox-02 rebooted themselves.
I found out that corosync can have more links so I added the secondary link as the one that connects via switch/internet, so via the other nic and made it a low priority link.
Is there any information how can I redirect the data to flow through the second link via prox-01 <-> prox-03 <-> prox-02 so corosync understands it?
I have an interesting question regarding a small proxmox cluster.
1. Background
I have a 3 node proxmox cluster, that has internet/switch facing NIC in a bond and a second NIC that does a simple peer-to-peer link.
E.g. prox-01 <-> prox-02 <-> prox-03 <-> prox-01
This is a simple cost-effective way to connect all hosts. My networking configuration for each node is like this:
Code:
auto enp67s0f0np0
iface enp67s0f0np0 inet manual
mtu 9000
auto enp67s0f1np1
iface enp67s0f1np1 inet manual
mtu 9000
auto bond1
iface bond1 inet static
address REDACTED/25
netmask 255.255.255.128
bond_slaves enp67s0f0np0 enp67s0f1np1
bond-mode broadcast
As you can see I am using broadcast mode so packets are sent each way.
2. Problem/Issue
Recently I had a link failure between prox-01 and prox-02
2026-04-17T13:37:12.276667+03:00 prox-01 kernel: [3966775.451514] mlx5_core 0000:43:00.0 enp67s0f0np0: Link down
Then I had corosync step in:
2026-04-17 13:37:13.5412026-04-17T13:37:13.311676+03:00 proxmox-01 corosync[1984]: [KNET ] link: host: 2 link: 0 is down
2026-04-17 13:37:13.5412026-04-17T13:37:13.311983+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
2026-04-17 13:37:13.5412026-04-17T13:37:13.312012+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 has no active links
2026-04-17 13:37:15.0442026-04-17T13:37:14.944110+03:00 proxmox-01 corosync[1984]: [TOTEM ] Token has not been received in 2737 ms
2026-04-17 13:37:16.0002026-04-17T13:37:15.926696+03:00 proxmox-01 kernel: [3966779.101643] mlx5_core 0000:43:00.0 enp67s0f0np0: Link up
2026-04-17 13:37:16.0462026-04-17T13:37:15.856919+03:00 proxmox-01 corosync[1984]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
2026-04-17 13:37:16.0462026-04-17T13:37:15.926696+03:00 proxmox-01 kernel: [3966779.101643] mlx5_core 0000:43:00.0 enp67s0f0np0: Link up
2026-04-17 13:37:18.5492026-04-17T13:37:18.312509+03:00 proxmox-01 corosync[1984]: [KNET ] rx: host: 2 link: 0 is up
2026-04-17 13:37:18.5492026-04-17T13:37:18.312623+03:00 proxmox-01 corosync[1984]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
2026-04-17 13:37:18.5492026-04-17T13:37:18.312684+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
2026-04-17 13:37:18.5492026-04-17T13:37:18.389680+03:00 proxmox-01 corosync[1984]: [KNET ] pmtud: Global data MTU changed to: 1397
2026-04-17 13:37:18.5492026-04-17T13:37:18.408929+03:00 proxmox-01 corosync[1984]: [QUORUM] Sync members[3]: 1 2 3
2026-04-17 13:37:18.5492026-04-17T13:37:18.408985+03:00 proxmox-01 corosync[1984]: [TOTEM ] A new membership (1.a3b) was formed. Members
2026-04-17 13:37:18.5492026-04-17T13:37:18.411066+03:00 proxmox-01 corosync[1984]: [QUORUM] Members[3]: 1 2 3
2026-04-17 13:37:18.5492026-04-17T13:37:18.411107+03:00 proxmox-01 corosync[1984]: [MAIN ] Completed service synchronization, ready to provide service.
It recovered, but after a few mins it did it a few more times until it didnt recover fast enoguh and prox-01 and prox-02 rebooted themselves.
I found out that corosync can have more links so I added the secondary link as the one that connects via switch/internet, so via the other nic and made it a low priority link.
Is there any information how can I redirect the data to flow through the second link via prox-01 <-> prox-03 <-> prox-02 so corosync understands it?