Nodes constantly fencing for no obvious reason

Pintjes

Member
Aug 24, 2023
13
4
8
Hi all,

Today I added dedicated corosync interfaces and network thinking that would fix my 'nodes fencing for no reason' problem... However, it seems like nodes are still fencing for no obvious reason, even more than ever. (Like every 15 minutes).

Can anyone help me troubleshoot this issue is driving me crazyyyyyy...

Some extra info:

- 3 node cluster.
- The 2 nodes that have running VMs, are also the nodes which have ZFS replication between them, are the nodes that are constantly fencing...
- Everything (data & corosync) are connected to the same switch, but switch should be more then capable of running it's load.

Thank you all in advance!
 
Hi,

- Everything (data & corosync) are connected to the same switch, but switch should be more then capable of running it's load.
"capable of running the load" isn't the same thing as low-latency. Corosync requires low-latency connection (<5ms normally), which is why we recommended separate hardware (NICs and switches) for this.

Have you verified that corosync indeed uses the dedicated corosync link?
Have you also verified that the ZFS replication takes the data link, not the corosync link?

Please also provide the output of
- pveversion -v
- corosync-cfgtool -n
- cat /etc/pve/corosync.conf
- The corosync logs journalctl -u corosync around the time of such fencing event
 
  • Like
Reactions: Johannes S
Hi,


"capable of running the load" isn't the same thing as low-latency. Corosync requires low-latency connection (<5ms normally), which is why we recommended separate hardware (NICs and switches) for this.

Have you verified that corosync indeed uses the dedicated corosync link?
Have you also verified that the ZFS replication takes the data link, not the corosync link?

Please also provide the output of
- pveversion -v
- corosync-cfgtool -n
- cat /etc/pve/corosync.conf
- The corosync logs journalctl -u corosync around the time of such fencing event
Hi Cheiss,

As for using the dedicated interface, I have checked the routing table and that says the following:
Code:
root@pve-02:~# ip route show
default via 192.168.202.1 dev vmbr0.202 proto kernel onlink
192.168.202.0/24 dev vmbr0.202 proto kernel scope link src 192.168.202.20
192.168.207.0/24 dev vmbr0.207 proto kernel scope link src 192.168.207.20
192.168.208.0/24 dev ens2 proto kernel scope link src 192.168.208.20
Which; if I'm not mistaking, should mean it uses 'ens2' (the dedicated interface) for it's corosync network, namely 192.168.208.0/24.

I also have setup a separate migration network which uses the 10Gb connection, replication should use this as well as that was the only interface it had before adding the dedicated corosync interface yesterday.

Code:
root@pve-02:~# pveversion -v
proxmox-ve: 8.4.0 (running kernel: 6.8.12-11-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-11-pve-signed: 6.8.12-11
proxmox-kernel-6.8: 6.8.12-11
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.8-pve2
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.11
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-4
pve-ha-manager: 4.0.7
pve-i18n: 3.4.4
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2


Code:
root@pve-02:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.10) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.30) enabled connected mtu: 1397


Logs, I recieved a 'trying to fence node...' notification @ 3:18 as well as 3:05 on another node...
Code:
Jun 05 02:01:03 pve-02 corosync[1827]:   [TOTEM ] Retransmit List: 1c276
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] link: host: 3 link: 0 is down
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:04:34 pve-02 corosync[1827]:   [TOTEM ] Token has not been received in 2737 ms
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync members[2]: 1 2
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync left[1]: 3
Jun 05 03:04:39 pve-02 corosync[1827]:   [TOTEM ] A new membership (1.f3f) was formed. Members left: 3
Jun 05 03:04:39 pve-02 corosync[1827]:   [TOTEM ] Failed to receive the leave message. failed: 3
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Members[2]: 1 2
Jun 05 03:04:39 pve-02 corosync[1827]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 03:08:08 pve-02 corosync[1827]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Jun 05 03:08:08 pve-02 corosync[1827]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:08:08 pve-02 corosync[1827]:   [QUORUM] Sync members[3]: 1 2 3
Jun 05 03:08:08 pve-02 corosync[1827]:   [QUORUM] Sync joined[1]: 3
Jun 05 03:08:08 pve-02 corosync[1827]:   [TOTEM ] A new membership (1.f44) was formed. Members joined: 3
Jun 05 03:08:08 pve-02 corosync[1827]:   [QUORUM] Members[3]: 1 2 3
Jun 05 03:08:08 pve-02 corosync[1827]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 03:08:09 pve-02 corosync[1827]:   [KNET  ] pmtud: Global data MTU changed to: 1397
-- Boot 6b2294757eb24d3196e95b71b1c8be4a --
Jun 05 03:22:47 pve-02 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jun 05 03:22:47 pve-02 corosync[1902]:   [MAIN  ] Corosync Cluster Engine  starting up
Jun 05 03:22:47 pve-02 corosync[1902]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pi>
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] Initializing transport (Kronosnet).
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] totemknet initialized
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] pmtud: MTU manually set to: 0
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss>
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: cmap
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: cfg
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: cpg
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun 05 03:22:47 pve-02 corosync[1902]:   [WD    ] Watchdog not enabled by configuration
Jun 05 03:22:47 pve-02 corosync[1902]:   [WD    ] resource load_15min missing a recovery key.
Jun 05 03:22:47 pve-02 corosync[1902]:   [WD    ] resource memory_used missing a recovery key.
Jun 05 03:22:47 pve-02 corosync[1902]:   [WD    ] no resources configured.
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QUORUM] Using quorum provider corosync_votequorum
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: votequorum
Jun 05 03:22:47 pve-02 corosync[1902]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 05 03:22:47 pve-02 corosync[1902]:   [QB    ] server name: quorum
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] Configuring link 0
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] Configured link number 0: local addr: 192.168.208.20, port=5405
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 1 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:22:47 pve-02 corosync[1902]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:22:47 pve-02 corosync[1902]:   [QUORUM] Sync members[1]: 2
Jun 05 03:22:47 pve-02 corosync[1902]:   [QUORUM] Sync joined[1]: 2
Jun 05 03:22:47 pve-02 corosync[1902]:   [TOTEM ] A new membership (2.f49) was formed. Members joined: 2
Jun 05 03:22:47 pve-02 corosync[1902]:   [QUORUM] Members[1]: 2
Jun 05 03:22:47 pve-02 corosync[1902]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 03:22:47 pve-02 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] rx: host: 3 link: 0 is up
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] rx: host: 1 link: 0 is up
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:22:49 pve-02 corosync[1902]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 05 03:22:50 pve-02 corosync[1902]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jun 05 03:22:50 pve-02 corosync[1902]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jun 05 03:22:50 pve-02 corosync[1902]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 05 03:22:50 pve-02 corosync[1902]:   [QUORUM] Sync members[3]: 1 2 3
Jun 05 03:22:50 pve-02 corosync[1902]:   [QUORUM] Sync joined[2]: 1 3
Jun 05 03:22:50 pve-02 corosync[1902]:   [TOTEM ] A new membership (1.f4d) was formed. Members joined: 1 3
Jun 05 03:22:50 pve-02 corosync[1902]:   [QUORUM] This node is within the primary component and will provide service.
Jun 05 03:22:50 pve-02 corosync[1902]:   [QUORUM] Members[3]: 1 2 3
Jun 05 03:22:50 pve-02 corosync[1902]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 05 03:37:47 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 185e
Jun 05 03:49:28 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 2b5a
Jun 05 04:26:51 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 6822
Jun 05 05:43:59 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: e578
Jun 05 06:04:12 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 10637
Jun 05 06:44:00 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 1468c
Jun 05 06:51:58 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 15376
Jun 05 07:07:32 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 16cc4
Jun 05 07:11:28 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 17317
Jun 05 07:27:08 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 18c78
Jun 05 07:36:30 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 19b9b
Jun 05 08:40:26 pve-02 corosync[1902]:   [TOTEM ] Retransmit List: 20336
 
To start with I’d add your other network as another link. I’d think it’s better to have two unstable links than one.
Sure can do that, however I would still like to uncover the root cause. Will do that though!
 
Just noticed I didn't add the /etc/pve/corosync.conf, so here it is (with the 2nd network added).

The 2nd network runs over the 10Gb networking, which is a Linux bridge with multiple VLANs.

Code:
root@pve-02:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.208.10
    ring1_addr: 192.168.202.10
  }
  node {
    name: pve-02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.208.20
    ring1_addr: 192.168.202.20
  }
  node {
    name: pve-03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.208.30
    ring1_addr: 192.168.202.30
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Homelab
  config_version: 14
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
The second ring you defined (ring1) doesn't actually do anything, as it is not configured.
This can also be seen in the ouput of corosync-cfgtool -n above.

You need to actually configure a second corosync interface, e.g. your totem section should look more like this:
Code:
totem {
  cluster_name: Homelab
  config_version: 14
  interface {
    knet_link_priority: 100
    linknumber: 0
  }
  interface {
    knet_link_priority: 10
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Editing /etc/pve/corosync.conf must be done with care, see the example at https://pve.proxmox.com/wiki/Separate_Cluster_Network#Configure_corosync on how to it properly. Especially pay attention to the config_version entry when updating!
 
The second ring you defined (ring1) doesn't actually do anything, as it is not configured.
This can also be seen in the ouput of corosync-cfgtool -n above.

You need to actually configure a second corosync interface, e.g. your totem section should look more like this:
Code:
totem {
  cluster_name: Homelab
  config_version: 14
  interface {
    knet_link_priority: 100
    linknumber: 0
  }
  interface {
    knet_link_priority: 10
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Editing /etc/pve/corosync.conf must be done with care, see the example at https://pve.proxmox.com/wiki/Separate_Cluster_Network#Configure_corosync on how to it properly. Especially pay attention to the config_version entry when updating!
I see, I have changed this now:
Code:
totem {
  cluster_name: Homelab
  config_version: 15
  interface {
    knet_link_priority: 100
    linknumber: 0
  }
  interface {
    knet_link_priority: 99
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

However I am still hoping to get to the underlying problem, can we troubleshoot that further?
 
What does corosync-cfgtool -n look like now? (Just to confirm :))

W.r.t. to the root cause:
Code:
Jun 05 02:01:03 pve-02 corosync[1827]:   [TOTEM ] Retransmit List: 1c276
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] link: host: 3 link: 0 is down
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:04:34 pve-02 corosync[1827]:   [TOTEM ] Token has not been received in 2737 ms
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync members[2]: 1 2
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync left[1]: 3

This suggests that the link is flappy and/or the latency to high.
Since everything is connected to the same switch, one possible cause could be that the switch simply does not forward packets in time under load.

You can see if there is maybe some QoS setting to prioritise the dedicated corosync links?
That's why we generally recommend a separate switch for corosync, at least for the primary link. Even a little, dumb L2 switch is more than enough often, since there basically aren't bandwidth requirements - just latency.
 
Last edited:
What does corosync-cfgtool -n look like now? (Just to confirm :))

W.r.t. to the root cause:
Code:
Jun 05 02:01:03 pve-02 corosync[1827]:   [TOTEM ] Retransmit List: 1c276
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] link: host: 3 link: 0 is down
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 05 03:04:33 pve-02 corosync[1827]:   [KNET  ] host: host: 3 has no active links
Jun 05 03:04:34 pve-02 corosync[1827]:   [TOTEM ] Token has not been received in 2737 ms
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync members[2]: 1 2
Jun 05 03:04:39 pve-02 corosync[1827]:   [QUORUM] Sync left[1]: 3

This suggests that the link is flappy and/or the latency to high.
Since everything is connected to the same switch, one possible cause could be that the switch simply does not forward packets in time under load.

You can see if there is maybe some QoS setting to prioritise the dedicated corosync links?
That's why we generally recommend a separate switch for corosync, at least for the primary link. Even a little, dumb L2 switch is more than enough often, since there basically aren't bandwidth requirements - just latency.
This is how it looks now (I don't see any difference?):
Code:
root@pve-02:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.10) enabled connected mtu: 1397
   LINK: 1 udp (192.168.202.20->192.168.202.10) enabled connected mtu: 1397

nodeid: 3 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.30) enabled connected mtu: 1397
   LINK: 1 udp (192.168.202.20->192.168.202.30) enabled connected mtu: 1397

Seems like I might have to look into getting a separate switch then... https://community.ui.com/questions/...ate-VLAN/dd75c456-3b0c-4e11-b382-f89b7fdf0d54

Edit: have turned on flow control on the 10 Gb interfaces to see if that fixes anything. Otherwise it seems like adding a switch is the only solution.
 
Last edited:
This is how it looks now (I don't see any difference?):
Yep, that looks good now.
The difference is in the number of LINK: <n> lines - with the old configuration, you had only LINK: 0 per node, now there is LINK: 0 and LINK: 1 for each node.

Looks like a similar problem, a separate switch is definitely a safe bet to avoid future problems. But especially since there are only 3 nodes, a small 4 or 8 port L2 switch would suffice. Corosync only sends small packets, so there aren't really any hard requirements for the switch if used only for corosync.
 
Yep, that looks good now.
The difference is in the number of LINK: <n> lines - with the old configuration, you had only LINK: 0 per node, now there is LINK: 0 and LINK: 1 for each node.

Looks like a similar problem, a separate switch is definitely a safe bet to avoid future problems. But especially since there are only 3 nodes, a small 4 or 8 port L2 switch would suffice. Corosync only sends small packets, so there aren't really any hard requirements for the switch if used only for corosync.
Today I added a dedicated switch for the corosync network, which sadly didn't fix the issue... The corosync ports (192.168.208.0/24) are the only connected ports besides the uplink to the main switch (to which it was previously connected).

edit: would it be helpful to open a ticket with Unifi as well to troubleshoot further, or does the dedicated switch mean the problem lies on my proxmox hosts?

Code:
root@pve-02:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (192.168.208.20->192.168.208.10) enabled connected mtu: 1397
   LINK: 1 udp (192.168.202.20->192.168.202.10) enabled connected mtu: 1397

Journal logs, received the fencing notification at 19:37. (Had to shorten this message due to char limit)
Code:
root@pve-02:~# journalctl -r --since 19:30:00
Jun 07 19:40:24 pve-02 corosync[1886]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 100)
Jun 07 19:40:24 pve-02 corosync[1886]:   [KNET  ] link: Resetting MTU for link 1 because host 3 joined
Jun 07 19:40:24 pve-02 corosync[1886]:   [KNET  ] rx: host: 3 link: 1 is up
Jun 07 19:40:24 pve-02 corosync[1886]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 100)
Jun 07 19:40:24 pve-02 corosync[1886]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Jun 07 19:40:24 pve-02 corosync[1886]:   [KNET  ] rx: host: 3 link: 0 is up
Jun 07 19:39:46 pve-02 sshd[339063]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 07 19:39:46 pve-02 systemd[1]: Started session-152.scope - Session 152 of User root.
Jun 07 19:39:46 pve-02 systemd-logind[1368]: New session 152 of user root.
Jun 07 19:39:46 pve-02 sshd[339063]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jun 07 19:39:46 pve-02 sshd[339063]: Accepted publickey for root from 192.168.1.190 port 52639 ssh2: ED25519 SHA256:ztuca9mPrV9XE6DFmNeA35wUf6qc4MwFV4ZJf4hXH7E
Jun 07 19:38:28 pve-02 pve-ha-lrm[337534]: service status vm:108 started
Jun 07 19:38:28 pve-02 pve-ha-lrm[337534]: <root@pam> end task UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam: OK
Jun 07 19:38:28 pve-02 pve-ha-lrm[337538]: VM 108 started with PID 337875.
Jun 07 19:38:28 pve-02 pve-ha-lrm[337533]: service status vm:103 started
Jun 07 19:38:28 pve-02 pve-ha-lrm[337533]: <root@pam> end task UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam: OK
Jun 07 19:38:28 pve-02 pve-ha-lrm[337537]: VM 103 started with PID 337872.
Jun 07 19:38:28 pve-02 kernel: vmbr0: port 5(tap108i0) entered forwarding state
Jun 07 19:38:28 pve-02 kernel: vmbr0: port 5(tap108i0) entered blocking state
Jun 07 19:38:28 pve-02 kernel: tap108i0: entered allmulticast mode
Jun 07 19:38:28 pve-02 kernel: vmbr0: port 5(tap108i0) entered disabled state
Jun 07 19:38:28 pve-02 kernel: vmbr0: port 5(tap108i0) entered blocking state
Jun 07 19:38:28 pve-02 kernel: tap108i0: entered promiscuous mode
Jun 07 19:38:28 pve-02 kernel: vmbr0: port 4(tap103i0) entered forwarding state
Jun 07 19:38:28 pve-02 kernel: vmbr0: port 4(tap103i0) entered blocking state
Jun 07 19:38:28 pve-02 kernel: tap103i0: entered allmulticast mode
Jun 07 19:38:28 pve-02 kernel: vmbr0: port 4(tap103i0) entered disabled state
Jun 07 19:38:28 pve-02 kernel: vmbr0: port 4(tap103i0) entered blocking state
Jun 07 19:38:27 pve-02 kernel: tap103i0: entered promiscuous mode
Jun 07 19:38:26 pve-02 systemd[1]: Started 108.scope.
Jun 07 19:38:26 pve-02 systemd[1]: Started 103.scope.
Jun 07 19:38:23 pve-02 pve-ha-lrm[337533]: Task 'UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:' still active, waiting
Jun 07 19:38:23 pve-02 pve-ha-lrm[337534]: Task 'UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:' still active, waiting
Jun 07 19:38:18 pve-02 pve-ha-lrm[337533]: Task 'UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:' still active, waiting
Jun 07 19:38:18 pve-02 pve-ha-lrm[337534]: Task 'UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:' still active, waiting
Jun 07 19:38:15 pve-02 pvescheduler[337635]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Jun 07 19:38:14 pve-02 perl[337634]: notified via target `Discord`
Jun 07 19:38:14 pve-02 perl[337634]: could not render value null with renderer Timestamp
Jun 07 19:38:14 pve-02 perl[337634]: notified via target `Pushover-high`
Jun 07 19:38:13 pve-02 pve-ha-lrm[337533]: Task 'UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:' still active, waiting
Jun 07 19:38:13 pve-02 pve-ha-lrm[337534]: Task 'UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:' still active, waiting
Jun 07 19:38:13 pve-02 perl[337634]: could not render value null with renderer Timestamp
Jun 07 19:38:13 pve-02 perl[337634]: notified via target `SMTP`
Jun 07 19:38:13 pve-02 perl[337634]: could not render value null with renderer Timestamp
Jun 07 19:38:13 pve-02 perl[337634]: could not render value null with renderer Timestamp
Jun 07 19:38:13 pve-02 pvescheduler[337634]: 108-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-03' -o 'UserKnownHostsFile=/etc/pve/nodes/pve-03/ssh_known_hosts' -o 'Glob>
Jun 07 19:38:11 pve-02 perl[337634]: notified via target `Discord`
Jun 07 19:38:10 pve-02 perl[337634]: could not render value null with renderer Timestamp
Jun 07 19:38:10 pve-02 perl[337634]: notified via target `SMTP`
Jun 07 19:38:10 pve-02 perl[337634]: could not render value null with renderer Timestamp
Jun 07 19:38:10 pve-02 perl[337634]: could not render value null with renderer Timestamp
Jun 07 19:38:09 pve-02 perl[337634]: notified via target `Pushover-high`
Jun 07 19:38:09 pve-02 perl[337634]: could not render value null with renderer Timestamp
Jun 07 19:38:09 pve-02 pvescheduler[337634]: 103-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-03' -o 'UserKnownHostsFile=/etc/pve/nodes/pve-03/ssh_known_hosts' -o 'Glob>
Jun 07 19:38:08 pve-02 pve-ha-lrm[337534]: Task 'UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:' still active, waiting
Jun 07 19:38:08 pve-02 pve-ha-lrm[337533]: Task 'UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:' still active, waiting
Jun 07 19:38:03 pve-02 pve-ha-lrm[337534]: Task 'UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:' still active, waiting
Jun 07 19:38:03 pve-02 pve-ha-lrm[337533]: Task 'UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:' still active, waiting
Jun 07 19:37:59 pve-02 pve-ha-lrm[337538]: Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 106.
Jun 07 19:37:59 pve-02 pve-ha-lrm[337537]: Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 106.
Jun 07 19:37:58 pve-02 pve-ha-lrm[337538]: start VM 108: UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:
Jun 07 19:37:58 pve-02 pve-ha-lrm[337537]: start VM 103: UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:
Jun 07 19:37:58 pve-02 pve-ha-lrm[337533]: <root@pam> starting task UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:
Jun 07 19:37:58 pve-02 pve-ha-lrm[337534]: <root@pam> starting task UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:
Jun 07 19:37:58 pve-02 pve-ha-lrm[337534]: starting service vm:108
Jun 07 19:37:58 pve-02 pve-ha-lrm[337533]: starting service vm:103
Jun 07 19:37:58 pve-02 pve-ha-lrm[1969]: status change wait_for_agent_lock => active
Jun 07 19:37:58 pve-02 pve-ha-lrm[1969]: watchdog active
Jun 07 19:37:58 pve-02 pve-ha-lrm[1969]: successfully acquired lock 'ha_agent_pve-02_lock'
Jun 07 19:37:15 pve-02 pvescheduler[337026]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [status] notice: dfsm_deliver_queue: queue length 2
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [status] notice: all data is up to date
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [status] notice: received all states
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [dcdb] notice: dfsm_deliver_queue: queue length 4
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [dcdb] notice: all data is up to date
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [dcdb] notice: synced members: 1/1549, 2/1811
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [dcdb] notice: leader is 1/1549
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [dcdb] notice: received all states
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [status] notice: received sync request (epoch 1/1549/00000010)
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [dcdb] notice: received sync request (epoch 1/1549/00000010)
Jun 07 19:36:52 pve-02 corosync[1886]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 07 19:36:52 pve-02 corosync[1886]:   [QUORUM] Members[2]: 1 2
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [status] notice: starting data syncronisation
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [status] notice: members: 1/1549, 2/1811
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [dcdb] notice: starting data syncronisation
Jun 07 19:36:52 pve-02 pmxcfs[1811]: [dcdb] notice: members: 1/1549, 2/1811
Jun 07 19:36:52 pve-02 corosync[1886]:   [TOTEM ] Failed to receive the leave message. failed: 3
Jun 07 19:36:52 pve-02 corosync[1886]:   [TOTEM ] A new membership (1.f99) was formed. Members left: 3
Jun 07 19:36:52 pve-02 corosync[1886]:   [QUORUM] Sync left[1]: 3
Jun 07 19:36:52 pve-02 corosync[1886]:   [QUORUM] Sync members[2]: 1 2
Jun 07 19:36:47 pve-02 corosync[1886]:   [TOTEM ] Token has not been received in 2737 ms
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] host: host: 3 has no active links
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 100)
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] host: host: 3 has no active links
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 100)
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 1 is down
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:35:34 pve-02 sshd[336048]: pam_env(sshd:session): deprecated reading of user environment enabled
Jun 07 19:35:34 pve-02 systemd[1]: Started session-150.scope - Session 150 of User root.
Jun 07 19:35:34 pve-02 systemd[1]: Started user@0.service - User Manager for UID 0.
Jun 07 19:35:34 pve-02 systemd[336059]: Startup finished in 269ms.
Jun 07 19:35:34 pve-02 systemd[336059]: Reached target default.target - Main User Target.
Jun 07 19:35:34 pve-02 systemd[336059]: Reached target basic.target - Basic System.
Jun 07 19:35:34 pve-02 systemd[336059]: Reached target sockets.target - Sockets.
Jun 07 19:35:34 pve-02 systemd[336059]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Jun 07 19:35:34 pve-02 systemd[336059]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Jun 07 19:35:34 pve-02 systemd[336059]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Jun 07 19:35:34 pve-02 systemd[336059]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
 
Last edited:
First off, I see quite a few other, seemingly unrelated errors in the log, such as

Code:
Jun 07 19:38:15 pve-02 pvescheduler[337635]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
[..]
Jun 07 19:38:14 pve-02 perl[337634]: could not render value null with renderer Timestamp
[..]
Jun 07 19:38:09 pve-02 pvescheduler[337634]: 103-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-03' -o 'UserKnownHostsFile=/etc/pve/nodes/pve-03/ssh_known_hosts' -o 'Glob
Jun 07 19:38:08 pve-02 pve-ha-lrm[337534]: Task 'UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:' still active, waiting
Jun 07 19:38:08 pve-02 pve-ha-lrm[337533]: Task 'UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:' still active, waiting
Jun 07 19:38:03 pve-02 pve-ha-lrm[337534]: Task 'UPID:pve-02:00052682:0033355D:684478F6:qmstart:108:root@pam:' still active, waiting
Jun 07 19:38:03 pve-02 pve-ha-lrm[337533]: Task 'UPID:pve-02:00052681:0033355D:684478F6:qmstart:103:root@pam:' still active, waiting
Jun 07 19:37:59 pve-02 pve-ha-lrm[337538]: Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 106.

Can you install debsums (apt update && apt install -y debsums) and provide the output of debsums -c?

Anyway, to the actual problem:
Code:
Jun 07 19:36:52 pve-02 corosync[1886]:   [QUORUM] Sync left[1]: 3
Jun 07 19:36:52 pve-02 corosync[1886]:   [QUORUM] Sync members[2]: 1 2
Jun 07 19:36:47 pve-02 corosync[1886]:   [TOTEM ] Token has not been received in 2737 ms

Looking at the previous log, it was here too node 3 that left/has a flappy link.

What's the hardware configuration of pve-03? Have you checked the physical network cable, if that's maybe faulty?
 
Hi,

Can you install debsums (apt update && apt install -y debsums) and provide the output of debsums -c?
Doesn't provide any output...? Not on other nodes either...
Code:
root@pve-02:~# debsums -c
root@pve-02:~# debsums -c
root@pve-02:~#
What's the hardware configuration of pve-03? Have you checked the physical network cable, if that's maybe faulty?
Hardware is the same as node pve-02, I have tried multiple cables to double check and sadly didn't change anything...
 
Last edited:
That's good, if debsums provides no output - that means that all installed files verified successfully. :)

Hardware is the same as node pve-02, I have tried multiple cables to double check and sadly didn't change anything...
That it was pve-03 both times is basically the only reference here. Could potentially be a faulty NIC, but that's hard to say, especially if there are no errors/warnings in dmesg.

Looking at the logs again, what's a bit confusing here is that both links seem to fail at the same time:
Code:
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 1 is down
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 0 is down

What the exact hardware model of the NICs? Generally some more hardware information could be useful.

Also, Just a sanity (network) check: What does pve-03 resolve to on pve-01 and pve-02, i.e. the output of host pve-03?
 
Also, Just a sanity (network) check: What does pve-03 resolve to on pve-01 and pve-02, i.e. the output of host pve-03?
Code:
root@pve-01:~# host pve-01
pve-01 has address 192.168.208.10

root@pve-02:~# host pve-02
pve-02 has address 192.168.208.20

root@pve-03:~# host pve-03
pve-03 has address 192.168.208.30

What the exact hardware model of the NICs? Generally some more hardware information could be useful.
pve-01: Corosync network: built-in NIC & 10 Gb 'data network': X520-DA2 Intel SFP+
pve-02: Corosync network: RTL8125B network card & 10 Gb 'data network': 647581-B21 HP flexfabric SFP+
pve-03: Corosync network: RTL8125B network card & 10 Gb 'data network': 647581-B21 HP flexfabric SFP+

The data network is connected to the SFP+ ports on my Unifi 48 port PoE switch & the corosync is connected to a dedicated Unifi USW Flex Mini.

That it was pve-03 both times is basically the only reference here
The fencing you mean? Could it be a coincidence that I just posted those?
For example this is a snippet from journal on pve-03:
Code:
root@pve-03:~# journalctl -u corosync | grep "is down"
Jan 13 19:04:38 pve-03 corosync[2022]:   [KNET  ] link: host: 2 link: 0 is down
Jan 13 19:04:38 pve-03 corosync[2022]:   [KNET  ] link: host: 1 link: 0 is down
Jan 13 19:05:27 pve-03 corosync[2022]:   [KNET  ] link: host: 2 link: 0 is down
Jan 13 19:05:27 pve-03 corosync[2022]:   [KNET  ] link: host: 1 link: 0 is down
Jan 13 19:20:18 pve-03 corosync[2022]:   [KNET  ] link: host: 2 link: 0 is down
Jan 13 19:24:56 pve-03 corosync[2022]:   [KNET  ] link: host: 1 link: 0 is down
Jan 13 19:25:42 pve-03 corosync[2022]:   [KNET  ] link: host: 2 link: 0 is down
Jan 13 19:25:42 pve-03 corosync[2022]:   [KNET  ] link: host: 1 link: 0 is down
Jan 14 08:49:34 pve-03 corosync[2022]:   [KNET  ] link: host: 2 link: 0 is down
Jan 14 11:21:40 pve-03 corosync[2022]:   [KNET  ] link: host: 1 link: 0 is down
Jan 16 10:49:39 pve-03 corosync[1733]:   [KNET  ] link: host: 1 link: 0 is down
Jan 16 10:58:52 pve-03 corosync[1716]:   [KNET  ] link: host: 1 link: 0 is down
Jan 16 12:23:13 pve-03 corosync[1717]:   [KNET  ] link: host: 1 link: 0 is down
Jan 17 19:45:52 pve-03 corosync[1732]:   [KNET  ] link: host: 2 link: 0 is down
Jan 17 19:45:52 pve-03 corosync[1732]:   [KNET  ] link: host: 1 link: 0 is down
Jan 17 20:20:44 pve-03 corosync[1734]:   [KNET  ] link: host: 1 link: 0 is down
Jan 18 08:41:51 pve-03 corosync[1717]:   [KNET  ] link: host: 2 link: 0 is down
Jan 18 08:41:51 pve-03 corosync[1717]:   [KNET  ] link: host: 1 link: 0 is down


That it was pve-03 both times is basically the only reference here. Could potentially be a faulty NIC, but that's hard to say, especially if there are no errors/warnings in dmesg.
I have created a pastebin of dmesg just to be sure:
pve-01: https://pastebin.com/rHu5HMai
pve-02: https://pastebin.com/FHdgdWb4
pve-03: https://pastebin.com/hWFxUgPm
 
The fencing you mean? Could it be a coincidence that I just posted those?
Possible - maybe you can check on pve-01 & pve-02 if ever the link to the other got lost.
Based on the next log from pve-03, just confirms that pve-03 loses the link to either one of the first two nodes.

From the pve-03 dmesg:
Code:
[68567.585519] INFO: task txg_sync:743 blocked for more than 122 seconds.
[68567.585535]       Tainted: P          IO       6.8.12-11-pve #1
[68567.585539] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[68567.585542] task:txg_sync        state:D stack:0     pid:743   tgid:743   ppid:2      flags:0x00004000
[..]
[68567.589173] INFO: task zvol_tq-1:694190 blocked for more than 122 seconds.
[68567.589178]       Tainted: P          IO       6.8.12-11-pve #1
[68567.589181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[68567.589184] task:zvol_tq-1       state:D stack:0     pid:694190 tgid:694190 ppid:2      flags:0x00004000
Seems that some ZFS kernel threads get hung up - I'd advise to check the underlying disks.

It's possible that this triggers general system instability and thus further down the line causes these network driver/NIC hiccups on pve-03.
 
  • Like
Reactions: waltar
Possible - maybe you can check on pve-01 & pve-02 if ever the link to the other got lost.
pve-01:
Code:
Jun 07 03:15:43 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 0 is down
Jun 07 04:09:35 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 04:09:35 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 1 is down
Jun 07 10:15:29 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 0 is down
Jun 07 10:15:29 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 1 is down
Jun 07 11:35:33 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 11:56:10 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 13:01:10 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 0 is down
Jun 07 16:27:11 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 17:20:14 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 0 is down
Jun 07 19:11:30 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:11:30 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 0 is down
Jun 07 19:13:51 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:13:52 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 0 is down
Jun 07 19:14:25 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:14:30 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 0 is down
Jun 07 19:36:46 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:36:46 pve-01 corosync[1635]:   [KNET  ] link: host: 3 link: 1 is down
Jun 08 13:32:42 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 0 is down
Jun 08 13:32:42 pve-01 corosync[1635]:   [KNET  ] link: host: 2 link: 1 is down
Jun 11 18:05:00 pve-01 corosync[1565]:   [KNET  ] link: host: 2 link: 0 is down
Jun 11 18:05:00 pve-01 corosync[1565]:   [KNET  ] link: host: 2 link: 1 is down

pve-02:
Code:
Jun 05 15:49:50 pve-02 corosync[1902]:   [KNET  ] link: host: 1 link: 0 is down
Jun 05 16:58:50 pve-02 corosync[1902]:   [KNET  ] link: host: 1 link: 0 is down
Jun 06 10:35:18 pve-02 corosync[1902]:   [KNET  ] link: host: 3 link: 0 is down
Jun 06 10:35:18 pve-02 corosync[1902]:   [KNET  ] link: host: 3 link: 1 is down
Jun 06 10:40:42 pve-02 corosync[1902]:   [KNET  ] link: host: 3 link: 1 is down
Jun 06 10:40:42 pve-02 corosync[1902]:   [KNET  ] link: host: 1 link: 1 is down
Jun 06 11:08:51 pve-02 corosync[1902]:   [KNET  ] link: host: 3 link: 0 is down
Jun 06 12:40:22 pve-02 corosync[1814]:   [KNET  ] link: host: 3 link: 0 is down
Jun 06 12:40:22 pve-02 corosync[1814]:   [KNET  ] link: host: 3 link: 1 is down
Jun 06 14:10:36 pve-02 corosync[1826]:   [KNET  ] link: host: 1 link: 0 is down
Jun 06 16:28:41 pve-02 corosync[1826]:   [KNET  ] link: host: 1 link: 0 is down
Jun 06 16:28:41 pve-02 corosync[1826]:   [KNET  ] link: host: 1 link: 1 is down
Jun 06 23:23:17 pve-02 corosync[1826]:   [KNET  ] link: host: 1 link: 0 is down
Jun 07 04:09:35 pve-02 corosync[1826]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 04:09:35 pve-02 corosync[1826]:   [KNET  ] link: host: 3 link: 1 is down
Jun 07 05:22:18 pve-02 corosync[1826]:   [KNET  ] link: host: 1 link: 0 is down
Jun 07 11:27:15 pve-02 corosync[1886]:   [KNET  ] link: host: 1 link: 0 is down
Jun 07 13:46:13 pve-02 corosync[1886]:   [KNET  ] link: host: 1 link: 0 is down
Jun 07 17:51:45 pve-02 corosync[1886]:   [KNET  ] link: host: 1 link: 0 is down
Jun 07 19:11:31 pve-02 corosync[1886]:   [KNET  ] link: host: 1 link: 0 is down
Jun 07 19:11:59 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:13:51 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:13:52 pve-02 corosync[1886]:   [KNET  ] link: host: 1 link: 0 is down
Jun 07 19:14:25 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:14:30 pve-02 corosync[1886]:   [KNET  ] link: host: 1 link: 0 is down
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 0 is down
Jun 07 19:36:46 pve-02 corosync[1886]:   [KNET  ] link: host: 3 link: 1 is down
Jun 11 08:42:37 pve-02 corosync[1877]:   [KNET  ] link: host: 1 link: 0 is down
Jun 11 08:42:37 pve-02 corosync[1877]:   [KNET  ] link: host: 1 link: 1 is down


Seems that some ZFS kernel threads get hung up - I'd advise to check the underlying disks.
How would I go about that? They are normal consumer SSDs connected in the hot swap bays of the ProLiant DL380p G8 (pve-02 & pve-03) and SATA to mobo on pve-01. The Proliants use a RAID controller however. I also found the same ZFS error you mentioned on pve-02 as well, but not on pve-01...

Edit: I think said RAID controller is running in Raid 0 mode, and not in HBA. Could that be the issue?
Edit 2: Overall it looks like there's mixed opinions but I'm guessing I'm gonna have to buy another HBA to make sure?... https://forum.proxmox.com/threads/issue-with-zfs-on-hp-smart-arraw-p420i.55368/post-440965
 
Last edited:
They are normal consumer SSDs
Using ZFS with consumer SSDs is often probematic on it's own - due to the CoW nature of ZFS it writes a lot more than non-CoW filesystems and consumer SSDs can get really slow in these cases, even slower than spinning rust.

I think said RAID controller is running in Raid 0 mode, and not in HBA.
Using ZFS with hardware RAID is definitely not recommended and should be avoided. It's also unnecessary, as ZFS already does all the things.

Overall, I can't say for sure of course whether these errors are indeed related to the network problems, but hung kernel (I/O) tasks will definitely lead to an unstable system down the road. (At least I wouldn't trust such systems with any kind of workload.)