Hi
We are experiencing an issue with one of our clusters.
Based on our observations, the retransmits of TOTOM start to rise sharply on just one of the nodes.
Simultaneously, we see the appearance of this type of log on all servers.
A few minutes later, yesterday, one of the machines restarted autonomously with the above logs.
Analysis suggests that this node had corosync connection problems with the others.
This was the node with the most logs of missed connections across the entire cluster. Any ideas on the possible cause?
The environment uses a bond of two 40GB Mellanox connectx3 cards as the sole connection.
We are experiencing an issue with one of our clusters.
Based on our observations, the retransmits of TOTOM start to rise sharply on just one of the nodes.
Code:
cl1kvm4 corosync[1704]: [TOTEM ] Retransmit List: 81df1da
Code:
2025-02-26T11:22:19.799125+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 1 (pri: 1)
2025-02-26T11:22:22.800516+01:00 cl11kvm1 corosync[20880]: [KNET ] rx: host: 6 link: 0 is up
2025-02-26T11:22:22.800615+01:00 cl11kvm1 corosync[20880]: [KNET ] link: Resetting MTU for link 0 because host 6 joined
2025-02-26T11:22:22.800644+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:26.802722+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 1 is down
2025-02-26T11:22:26.802828+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:29.804381+01:00 cl11kvm1 corosync[20880]: [KNET ] rx: host: 6 link: 1 is up
2025-02-26T11:22:29.804492+01:00 cl11kvm1 corosync[20880]: [KNET ] link: Resetting MTU for link 1 because host 6 joined
2025-02-26T11:22:29.804522+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:35.005860+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 1 is down
2025-02-26T11:22:35.005948+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:38.006877+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 3 is down
2025-02-26T11:22:38.006984+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:39.607842+01:00 cl11kvm1 corosync[20880]: [KNET ] rx: host: 6 link: 1 is up
2025-02-26T11:22:39.607907+01:00 cl11kvm1 corosync[20880]: [KNET ] link: Resetting MTU for link 1 because host 6 joined
2025-02-26T11:22:39.607936+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:41.008442+01:00 cl11kvm1 corosync[20880]: [KNET ] rx: host: 6 link: 3 is up
2025-02-26T11:22:41.008561+01:00 cl11kvm1 corosync[20880]: [KNET ] link: Resetting MTU for link 3 because host 6 joined
2025-02-26T11:22:41.008589+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:53.411874+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 2 is down
2025-02-26T11:22:53.411972+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:57.612968+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 3 is down
2025-02-26T11:22:57.613007+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:22:57.813255+01:00 cl11kvm1 corosync[20880]: [KNET ] rx: host: 6 link: 2 is up
2025-02-26T11:22:57.813323+01:00 cl11kvm1 corosync[20880]: [KNET ] link: Resetting MTU for link 2 because host 6 joined
2025-02-26T11:22:57.813357+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:23:02.014795+01:00 cl11kvm1 corosync[20880]: [KNET ] rx: host: 6 link: 3 is up
2025-02-26T11:23:02.014954+01:00 cl11kvm1 corosync[20880]: [KNET ] link: Resetting MTU for link 3 because host 6 joined
2025-02-26T11:23:02.014983+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:23:04.615456+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 1 is down
2025-02-26T11:23:04.615560+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:23:07.616839+01:00 cl11kvm1 corosync[20880]: [KNET ] rx: host: 6 link: 1 is up
2025-02-26T11:23:07.616992+01:00 cl11kvm1 corosync[20880]: [KNET ] link: Resetting MTU for link 1 because host 6 joined
2025-02-26T11:23:07.617021+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:23:42.835208+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 1 is down
2025-02-26T11:23:42.835325+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:23:45.636956+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 0 is down
2025-02-26T11:23:45.637112+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 2 is down
2025-02-26T11:23:45.637147+01:00 cl11kvm1 corosync[20880]: [KNET ] link: host: 6 link: 3 is down
2025-02-26T11:23:45.637192+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:23:45.637226+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 has no active links
2025-02-26T11:23:45.637268+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:23:45.637296+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 has no active links
2025-02-26T11:23:45.637373+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
2025-02-26T11:23:45.637398+01:00 cl11kvm1 corosync[20880]: [KNET ] host: host: 6 has no active links
Analysis suggests that this node had corosync connection problems with the others.
This was the node with the most logs of missed connections across the entire cluster. Any ideas on the possible cause?
The environment uses a bond of two 40GB Mellanox connectx3 cards as the sole connection.
proxmox-ve: 8.2.0 (running kernel: 6.5.13-6-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
ceph: 18.2.4-pve3
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
ceph: 18.2.4-pve3
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2