3 node proxmox link failure, reboots node even when second link is up

TafkaMax

New Member
Jun 17, 2025
8
1
3
Hi

I have an interesting question regarding a small proxmox cluster.

1. Background

I have a 3 node proxmox cluster, that has internet/switch facing NIC in a bond and a second NIC that does a simple peer-to-peer link.

E.g. prox-01 <-> prox-02 <-> prox-03 <-> prox-01

This is a simple cost-effective way to connect all hosts. My networking configuration for each node is like this:


Code:
auto enp67s0f0np0

iface enp67s0f0np0 inet manual

  mtu 9000


auto enp67s0f1np1

iface enp67s0f1np1 inet manual

  mtu 9000


auto bond1

iface bond1 inet static

  address REDACTED/25

  netmask 255.255.255.128

  bond_slaves enp67s0f0np0 enp67s0f1np1

  bond-mode broadcast

As you can see I am using broadcast mode so packets are sent each way.

2. Problem/Issue

Recently I had a link failure between prox-01 and prox-02

2026-04-17T13:37:12.276667+03:00 prox-01 kernel: [3966775.451514] mlx5_core 0000:43:00.0 enp67s0f0np0: Link down

Then I had corosync step in:

2026-04-17 13:37:13.5412026-04-17T13:37:13.311676+03:00 proxmox-01 corosync[1984]: [KNET ] link: host: 2 link: 0 is down
2026-04-17 13:37:13.5412026-04-17T13:37:13.311983+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
2026-04-17 13:37:13.5412026-04-17T13:37:13.312012+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 has no active links
2026-04-17 13:37:15.0442026-04-17T13:37:14.944110+03:00 proxmox-01 corosync[1984]: [TOTEM ] Token has not been received in 2737 ms
2026-04-17 13:37:16.0002026-04-17T13:37:15.926696+03:00 proxmox-01 kernel: [3966779.101643] mlx5_core 0000:43:00.0 enp67s0f0np0: Link up
2026-04-17 13:37:16.0462026-04-17T13:37:15.856919+03:00 proxmox-01 corosync[1984]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
2026-04-17 13:37:16.0462026-04-17T13:37:15.926696+03:00 proxmox-01 kernel: [3966779.101643] mlx5_core 0000:43:00.0 enp67s0f0np0: Link up
2026-04-17 13:37:18.5492026-04-17T13:37:18.312509+03:00 proxmox-01 corosync[1984]: [KNET ] rx: host: 2 link: 0 is up
2026-04-17 13:37:18.5492026-04-17T13:37:18.312623+03:00 proxmox-01 corosync[1984]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
2026-04-17 13:37:18.5492026-04-17T13:37:18.312684+03:00 proxmox-01 corosync[1984]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
2026-04-17 13:37:18.5492026-04-17T13:37:18.389680+03:00 proxmox-01 corosync[1984]: [KNET ] pmtud: Global data MTU changed to: 1397
2026-04-17 13:37:18.5492026-04-17T13:37:18.408929+03:00 proxmox-01 corosync[1984]: [QUORUM] Sync members[3]: 1 2 3
2026-04-17 13:37:18.5492026-04-17T13:37:18.408985+03:00 proxmox-01 corosync[1984]: [TOTEM ] A new membership (1.a3b) was formed. Members
2026-04-17 13:37:18.5492026-04-17T13:37:18.411066+03:00 proxmox-01 corosync[1984]: [QUORUM] Members[3]: 1 2 3
2026-04-17 13:37:18.5492026-04-17T13:37:18.411107+03:00 proxmox-01 corosync[1984]: [MAIN ] Completed service synchronization, ready to provide service.

It recovered, but after a few mins it did it a few more times until it didnt recover fast enoguh and prox-01 and prox-02 rebooted themselves.

I found out that corosync can have more links so I added the secondary link as the one that connects via switch/internet, so via the other nic and made it a low priority link.

Is there any information how can I redirect the data to flow through the second link via prox-01 <-> prox-03 <-> prox-02 so corosync understands it?
 
You can see the status of both "rings" this way:
Code:
~# corosync-cfgtool  -s
Local node ID 6, transport knet
LINK ID 0 udp
        addr    = 10.3.16.7
        status:
                nodeid:          1:     disconnected
                nodeid:          2:     connected
                nodeid:          4:     connected
...
LINK ID 1 udp
        addr    = 10.11.16.7
        status:
                nodeid:          1:     disconnected
                nodeid:          2:     connected
                nodeid:          4:     connected
Code:
~# corosync-cfgtool  -n
Local node ID 6, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.3.16.7->10.3.16.9) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.7->10.11.16.9) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (10.3.16.7->10.3.16.10) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.7->10.11.16.10) enabled connected mtu: 1397
  
...
(Important: my "nodeid 1" is disconnected on purpose. Yours should be "connected", of course.)

As long as everything is "connected" one ring may get lost without losing Quorum. See also man corosync-cfgtool