Nodes constantly fencing for no obvious reason

Pintjes

New Member
Aug 24, 2023
6
0
1
Hi all,

Today I added dedicated corosync interfaces and network thinking that would fix my 'nodes fencing for no reason' problem... However, it seems like nodes are still fencing for no obvious reason, even more than ever. (Like every 15 minutes).

Can anyone help me troubleshoot this issue is driving me crazyyyyyy...

Some extra info:

- 3 node cluster.
- The 2 nodes that have running VMs, are also the nodes which have ZFS replication between them, are the nodes that are constantly fencing...
- Everything (data & corosync) are connected to the same switch, but switch should be more then capable of running it's load.

Thank you all in advance!
 
Hi,

- Everything (data & corosync) are connected to the same switch, but switch should be more then capable of running it's load.
"capable of running the load" isn't the same thing as low-latency. Corosync requires low-latency connection (<5ms normally), which is why we recommended separate hardware (NICs and switches) for this.

Have you verified that corosync indeed uses the dedicated corosync link?
Have you also verified that the ZFS replication takes the data link, not the corosync link?

Please also provide the output of
- pveversion -v
- corosync-cfgtool -n
- cat /etc/pve/corosync.conf
- The corosync logs journalctl -u corosync around the time of such fencing event
 
  • Like
Reactions: Johannes S
Hi,


"capable of running the load" isn't the same thing as low-latency. Corosync requires low-latency connection (<5ms normally), which is why we recommended separate hardware (NICs and switches) for this.

Have you verified that corosync indeed uses the dedicated corosync link?
Have you also verified that the ZFS replication takes the data link, not the corosync link?

Please also provide the output of
- pveversion -v
- corosync-cfgtool -n
- cat /etc/pve/corosync.conf
- The corosync logs journalctl -u corosync around the time of such fencing event
Hi Cheiss,

As for using the dedicated interface, I have checked the routing table and that says the following:
Code:
root@pve-02:~# ip route show
default via 192.168.202.1 dev vmbr0.202 proto kernel onlink
192.168.202.0/24 dev vmbr0.202 proto kernel scope link src 192.168.202.20
192.168.207.0/24 dev vmbr0.207 proto kernel scope link src 192.168.207.20
192.168.208.0/24 dev ens2 proto kernel scope link src 192.168.208.20
Which; if I'm not mistaking, should mean it uses 'ens2' (the dedicated interface) for it's corosync network, namely 192.168.208.0/24.

I also have setup a separate migration network which uses the 10Gb connection, replication should use this as well as that was the only interface it had before adding the dedicated corosync interface yesterday.

Code:
root@pve-02:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-11-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-11-pve-signed: 6.8.12-11
proxmox-kernel-6.8: 6.8.12-11
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.8-pve2
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.11
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-4
pve-ha-manager: 4.0.7
pve-i18n: 3.4.4
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2


Code:
root@pve-02:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.10) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.30) enabled connected mtu: 1397


Logs, I recieved a 'trying to fence node...' notification @ 3:18 as well as 3:05 on another node...
Code:
Jun 05 02:01:03 pve-02 corosync[1827]:   [TOTEM ] Retransmit List: 1c276
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] link: host: 3 link: 0 is down
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:04:34 pve-02 corosync[1827]:   [TOTEM ] Token has not been received in 2737 ms
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync members[2]: 1 2
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync left[1]: 3
Jun 05 03:04:39 pve-02 corosync[1827]:   [TOTEM ] A new membership (1.f3f) was formed. Members left: 3
Jun 05 03:04:39 pve-02 corosync[1827]:   [TOTEM ] Failed to receive the leave message. failed: 3
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Members[2]: 1 2
Jun 05 03:04:39 pve-02 corosync[1827]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 03:08:08 pve-02 corosync[1827]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Jun 05 03:08:08 pve-02 corosync[1827]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:08:08 pve-02 corosync[1827]:   [QUORUM] Sync members[3]: 1 2 3
Jun 05 03:08:08 pve-02 corosync[1827]:   [QUORUM] Sync joined[1]: 3
Jun 05 03:08:08 pve-02 corosync[1827]:   [TOTEM ] A new membership (1.f44) was formed. Members joined: 3
Jun 05 03:08:08 pve-02 corosync[1827]:   [QUORUM] Members[3]: 1 2 3
Jun 05 03:08:08 pve-02 corosync[1827]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 03:08:09 pve-02 corosync[1827]:   [KNET  ] pmtud: Global data MTU changed to: 1397
-- Boot 6b2294757eb24d3196e95b71b1c8be4a --
Jun 05 03:22:47 pve-02 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jun 05 03:22:47 pve-02 corosync[1902]:   [MAIN  ] Corosync Cluster Engine  starting up
Jun 05 03:22:47 pve-02 corosync[1902]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pi>
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] Initializing transport (Kronosnet).
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] totemknet initialized
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] pmtud: MTU manually set to: 0
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss>
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: cmap
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: cfg
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: cpg
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun 05 03:22:47 pve-02 corosync[1902]:   [WD    ] Watchdog not enabled by configuration
Jun 05 03:22:47 pve-02 corosync[1902]:   [WD    ] resource load_15min missing a recovery key.
Jun 05 03:22:47 pve-02 corosync[1902]:   [WD    ] resource memory_used missing a recovery key.
Jun 05 03:22:47 pve-02 corosync[1902]:   [WD    ] no resources configured.
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QUORUM] Using quorum provider corosync_votequorum
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: votequorum
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: quorum
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] Configuring link 0
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] Configured link number 0: local addr: 192.168.208.20, port=5405
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [QUORUM] Sync members[1]: 2
Jun 05 03:22:47 pve-02 corosync[1902]:   [QUORUM] Sync joined[1]: 2
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] A new membership (2.f49) was formed. Members joined: 2
Jun 05 03:22:47 pve-02 corosync[1902]:   [QUORUM] Members[1]: 2
Jun 05 03:22:47 pve-02 corosync[1902]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 03:22:47 pve-02 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] rx: host: 3 link: 0 is up
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] rx: host: 1 link: 0 is up
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 03:22:50 pve-02 corosync[1902]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jun 05 03:22:50 pve-02 corosync[1902]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jun 05 03:22:50 pve-02 corosync[1902]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 05 03:22:50 pve-02 corosync[1902]:   [QUORUM] Sync members[3]: 1 2 3
Jun 05 03:22:50 pve-02 corosync[1902]:   [QUORUM] Sync joined[2]: 1 3
Jun 05 03:22:50 pve-02 corosync[1902]:   [TOTEM ] A new membership (1.f4d) was formed. Members joined: 1 3
Jun 05 03:22:50 pve-02 corosync[1902]:   [QUORUM] This node is within the primary component and will provide service.
Jun 05 03:22:50 pve-02 corosync[1902]:   [QUORUM] Members[3]: 1 2 3
Jun 05 03:22:50 pve-02 corosync[1902]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 03:37:47 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 185e
Jun 05 03:49:28 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 2b5a
Jun 05 04:26:51 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 6822
Jun 05 05:43:59 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: e578
Jun 05 06:04:12 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 10637
Jun 05 06:44:00 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 1468c
Jun 05 06:51:58 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 15376
Jun 05 07:07:32 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 16cc4
Jun 05 07:11:28 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 17317
Jun 05 07:27:08 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 18c78
Jun 05 07:36:30 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 19b9b
Jun 05 08:40:26 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 20336
 
Just noticed I didn't add the /etc/pve/corosync.conf, so here it is (with the 2nd network added).

The 2nd network runs over the 10Gb networking, which is a Linux bridge with multiple VLANs.

Code:
root@pve-02:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.208.10
    ring1_addr: 192.168.202.10
  }
  node {
    name: pve-02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.208.20
    ring1_addr: 192.168.202.20
  }
  node {
    name: pve-03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.208.30
    ring1_addr: 192.168.202.30
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Homelab
  config_version: 14
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
The second ring you defined (ring1) doesn't actually do anything, as it is not configured.
This can also be seen in the ouput of corosync-cfgtool -n above.

You need to actually configure a second corosync interface, e.g. your totem section should look more like this:
Code:
totem {
  cluster_name: Homelab
  config_version: 14
  interface {
    knet_link_priority: 100
    linknumber: 0
  }
  interface {
    knet_link_priority: 10
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Editing /etc/pve/corosync.conf must be done with care, see the example at https://pve.proxmox.com/wiki/Separate_Cluster_Network#Configure_corosync on how to it properly. Especially pay attention to the config_version entry when updating!
 
The second ring you defined (ring1) doesn't actually do anything, as it is not configured.
This can also be seen in the ouput of corosync-cfgtool -n above.

You need to actually configure a second corosync interface, e.g. your totem section should look more like this:
Code:
totem {
  cluster_name: Homelab
  config_version: 14
  interface {
    knet_link_priority: 100
    linknumber: 0
  }
  interface {
    knet_link_priority: 10
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Editing /etc/pve/corosync.conf must be done with care, see the example at https://pve.proxmox.com/wiki/Separate_Cluster_Network#Configure_corosync on how to it properly. Especially pay attention to the config_version entry when updating!
I see, I have changed this now:
Code:
totem {
  cluster_name: Homelab
  config_version: 15
  interface {
    knet_link_priority: 100
    linknumber: 0
  }
  interface {
    knet_link_priority: 99
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

However I am still hoping to get to the underlying problem, can we troubleshoot that further?
 
What does corosync-cfgtool -n look like now? (Just to confirm :))

W.r.t. to the root cause:
Code:
Jun 05 02:01:03 pve-02 corosync[1827]:   [TOTEM ] Retransmit List: 1c276
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] link: host: 3 link: 0 is down
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:04:34 pve-02 corosync[1827]:   [TOTEM ] Token has not been received in 2737 ms
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync members[2]: 1 2
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync left[1]: 3

This suggests that the link is flappy and/or the latency to high.
Since everything is connected to the same switch, one possible cause could be that the switch simply does not forward packets in time under load.

You can see if there is maybe some QoS setting to prioritise the dedicated corosync links?
That's why we generally recommend a separate switch for corosync, at least for the primary link. Even a little, dumb L2 switch is more than enough often, since there basically aren't bandwidth requirements - just latency.
 
Last edited:
What does corosync-cfgtool -n look like now? (Just to confirm :))

W.r.t. to the root cause:
Code:
Jun 05 02:01:03 pve-02 corosync[1827]:   [TOTEM ] Retransmit List: 1c276
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] link: host: 3 link: 0 is down
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:04:34 pve-02 corosync[1827]:   [TOTEM ] Token has not been received in 2737 ms
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync members[2]: 1 2
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync left[1]: 3

This suggests that the link is flappy and/or the latency to high.
Since everything is connected to the same switch, one possible cause could be that the switch simply does not forward packets in time under load.

You can see if there is maybe some QoS setting to prioritise the dedicated corosync links?
That's why we generally recommend a separate switch for corosync, at least for the primary link. Even a little, dumb L2 switch is more than enough often, since there basically aren't bandwidth requirements - just latency.
This is how it looks now (I don't see any difference?):
Code:
root@pve-02:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.10) enabled connected mtu: 1397
   LINK: 1 udp (192.168.202.20->192.168.202.10) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.30) enabled connected mtu: 1397
   LINK: 1 udp (192.168.202.20->192.168.202.30) enabled connected mtu: 1397

Seems like I might have to look into getting a separate switch then... https://community.ui.com/questions/...ate-VLAN/dd75c456-3b0c-4e11-b382-f89b7fdf0d54

Edit: have turned on flow control on the 10 Gb interfaces to see if that fixes anything. Otherwise it seems like adding a switch is the only solution.
 
Last edited:
This is how it looks now (I don't see any difference?):
Yep, that looks good now.
The difference is in the number of LINK: <n> lines - with the old configuration, you had only LINK: 0 per node, now there is LINK: 0 and LINK: 1 for each node.

Looks like a similar problem, a separate switch is definitely a safe bet to avoid future problems. But especially since there are only 3 nodes, a small 4 or 8 port L2 switch would suffice. Corosync only sends small packets, so there aren't really any hard requirements for the switch if used only for corosync.