link: host: 7 link: 0 is down

systemctl · Jun 28, 2022

Hello
I have 7 nodes ( currently ) Proxmox cluster.
Some time ago weird things started to happen. Random nodes just drop out and back to cluster. Here is the logs:

Code:

Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] link: host: 7 link: 0 is down
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] sctp: Notifying connect thread that sockfd 33 received a link down event
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] host: host: 7 has no active links
Jun 28 19:29:11 prx1 corosync[1867440]: [TOTEM ] Knet host change callback. nodeid: 7 reachable: 0

Code:

Jun 28 19:29:24 prx1 corosync[1867440]:   [MAIN  ] Member left: r(0) ip(*.*.*.*)
Jun 28 19:29:24 prx1 corosync[1867440]:   [TOTEM ] waiting_trans_ack changed to 1
Jun 28 19:29:24 prx1 corosync[1867440]:   [SYNC  ] call init for locally known services
Jun 28 19:29:24 prx1 corosync[1867440]:   [QUORUM] Sync members[6]: 1 2 4 5 6 9
Jun 28 19:29:24 prx1 corosync[1867440]:   [QUORUM] Sync left[1]: 7
Jun 28 19:29:24 prx1 corosync[1867440]:   [TOTEM ] entering OPERATIONAL state.
Jun 28 19:29:24 prx1 corosync[1867440]:   [TOTEM ] A new membership (1.7b34) was formed. Members left: 7
Jun 28 19:29:24 prx1 corosync[1867440]:   [TOTEM ] Failed to receive the leave message. failed: 7

And then node is back

Code:

Jun 28 19:34:51 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 50
Jun 28 19:34:52 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 60
Jun 28 19:34:53 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 70
Jun 28 19:34:54 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 80
Jun 28 19:34:55 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 90
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 100
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retried 100 times
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] crit: cpg_send_message failed: 6
Jun 28 19:34:57 prx1 corosync[1867440]:   [TOTEM ] got commit token
Jun 28 19:34:57 prx1 corosync[1867440]:   [TOTEM ] Saving state aru 7 high seq received 7

Corosync conf:

Code:

logging {
  debug: on
  to_syslog: yes
}
...
quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: prx
  config_version: 34
  interface {
    knet_transport: sctp
    linknumber: 0
    token: 5000
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Important note is that i have some of the nodes on different datacenters, so basic delay is pretty high ( around 1.5ms ). Additionally there is antiddos and firewalls (but they say that doesnt block anything on 5405 ports etc ).
So basically i think maybe i can handle this by timeouts increase?
As u see 5000ms token doesn't make a difference. So what else can i try?

PS: pveversion

Code:

# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph-fuse: 14.2.21-1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

auranext · Jun 30, 2022

Hi,

One of my nodes have the same behavior
KNET host 1 link: 1 is down
My cluster design is made with 2 ring so there is no impact but we need to anderstand this.
I placed captures of corosync trafic on each node.
On your side did you notice something else ?

systemctl · Jul 8, 2022

auranext said:
Hi,

One of my nodes have the same behavior
KNET host 1 link: 1 is down
My cluster design is made with 2 ring so there is no impact but we need to anderstand this.
I placed captures of corosync trafic on each node.
On your side did you notice something else ?

Hi!

Seems that we've found the issue and made it work well. Long story short - nodes has IP's from different networks. Looks like that fact added some additional delay for routing etc ( including interDC ). As a result after swapping all nodes in cluster to use same network's ips for communication we have not faced any issues.

auranext · Jul 8, 2022

Hi,

Thx for your feedback.
so on your side it was an external network problem.
My nodes are located on differents DC but on same L2 network
I investigate further

systemctl · Jul 8, 2022

auranext said:
so on your side it was an external network problem.

Yep, seems so, still working like a charm after switching to the same network addresses.

auranext said:
My nodes are located on differents DC

Firstly we suspected inter-dc firewalls/antiddos. As far as i heared here multicast can cause a problem and unicast helped someone.

fitbrian · Jul 31, 2022

I am facing with the same issue. New cluster installation (no VMs running yet), 4 nodes, dedicated 10Gbps network, same IPs subnet for the corosync cluster. syslog full of messages:

Code:

Jul 31 20:22:04 node5 corosync[2103]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 31 20:22:04 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:04 node5 corosync[2103]:   [KNET  ] link: host: 3 link: 0 is down
Jul 31 20:27:04 node5 corosync[2103]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 20:27:04 node5 corosync[2103]:   [KNET  ] host: host: 3 has no active links
Jul 31 20:27:07 node5 corosync[2103]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 31 20:27:07 node5 corosync[2103]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 20:27:10 node5 corosync[2103]:   [KNET  ] link: host: 2 link: 0 is down
Jul 31 20:27:10 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:10 node5 corosync[2103]:   [KNET  ] host: host: 2 has no active links
Jul 31 20:27:13 node5 corosync[2103]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 31 20:27:13 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] link: host: 4 link: 0 is down
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] link: host: 3 link: 0 is down
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] link: host: 2 link: 0 is down
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 4 has no active links
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 3 has no active links
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 2 has no active links
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] rx: host: 4 link: 0 is up
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Corosync with default configuration. I have no idea what is the issue

In this state, I am afraid to run nodes to production.

fabian · Aug 1, 2022

@systemctl something is wrong with your package versions - please double check your repository setup is correct and that you use 'apt full-upgrade' for upgrading! your libknet1 is outdated and a known buggy version that can lead to cluster outages upon membership changes.

@everybody: if you have frequent link fluctuations, check the stats cmap: corosync-cmapctl -m stats. link down events are either caused by the heartbeat (a simple ping/pong over UDP) timing out or other network layer errors (like the other end being unreachable when attempting to send corosync data). by default 2 heartbeat pongs need to be received (within the expected timespan) for the link to be marked back up. see 'corosync.conf' manpage for details

Search

Search

link: host: 7 link: 0 is down

systemctl

New Member

auranext

Well-Known Member

systemctl

New Member

auranext

Well-Known Member

systemctl

New Member

fitbrian

New Member

fabian

Proxmox Staff Member