link: host: 7 link: 0 is down

systemctl

New Member
Jun 3, 2022
19
0
1
Hello
I have 7 nodes ( currently ) Proxmox cluster.
Some time ago weird things started to happen. Random nodes just drop out and back to cluster. Here is the logs:

Code:
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] link: host: 7 link: 0 is down
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] sctp: Notifying connect thread that sockfd 33 received a link down event
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Jun 28 19:29:11 prx1 corosync[1867440]: [KNET ] host: host: 7 has no active links
Jun 28 19:29:11 prx1 corosync[1867440]: [TOTEM ] Knet host change callback. nodeid: 7 reachable: 0
Code:
Jun 28 19:29:24 prx1 corosync[1867440]:   [MAIN  ] Member left: r(0) ip(*.*.*.*)
Jun 28 19:29:24 prx1 corosync[1867440]:   [TOTEM ] waiting_trans_ack changed to 1
Jun 28 19:29:24 prx1 corosync[1867440]:   [SYNC  ] call init for locally known services
Jun 28 19:29:24 prx1 corosync[1867440]:   [QUORUM] Sync members[6]: 1 2 4 5 6 9
Jun 28 19:29:24 prx1 corosync[1867440]:   [QUORUM] Sync left[1]: 7
Jun 28 19:29:24 prx1 corosync[1867440]:   [TOTEM ] entering OPERATIONAL state.
Jun 28 19:29:24 prx1 corosync[1867440]:   [TOTEM ] A new membership (1.7b34) was formed. Members left: 7
Jun 28 19:29:24 prx1 corosync[1867440]:   [TOTEM ] Failed to receive the leave message. failed: 7

And then node is back
Code:
Jun 28 19:34:51 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 50
Jun 28 19:34:52 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 60
Jun 28 19:34:53 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 70
Jun 28 19:34:54 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 80
Jun 28 19:34:55 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 90
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retry 100
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] notice: cpg_send_message retried 100 times
Jun 28 19:34:56 prx1 pmxcfs[1867177]: [status] crit: cpg_send_message failed: 6
Jun 28 19:34:57 prx1 corosync[1867440]:   [TOTEM ] got commit token
Jun 28 19:34:57 prx1 corosync[1867440]:   [TOTEM ] Saving state aru 7 high seq received 7

Corosync conf:
Code:
logging {
  debug: on
  to_syslog: yes
}
...
quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: prx
  config_version: 34
  interface {
    knet_transport: sctp
    linknumber: 0
    token: 5000
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Important note is that i have some of the nodes on different datacenters, so basic delay is pretty high ( around 1.5ms ). Additionally there is antiddos and firewalls (but they say that doesnt block anything on 5405 ports etc ).
So basically i think maybe i can handle this by timeouts increase?
As u see 5000ms token doesn't make a difference. So what else can i try?


PS: pveversion
Code:
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph-fuse: 14.2.21-1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
Last edited:
Hi,

One of my nodes have the same behavior
KNET host 1 link: 1 is down
My cluster design is made with 2 ring so there is no impact but we need to anderstand this.
I placed captures of corosync trafic on each node.
On your side did you notice something else ?
 
Hi,

One of my nodes have the same behavior
KNET host 1 link: 1 is down
My cluster design is made with 2 ring so there is no impact but we need to anderstand this.
I placed captures of corosync trafic on each node.
On your side did you notice something else ?
Hi!

Seems that we've found the issue and made it work well. Long story short - nodes has IP's from different networks. Looks like that fact added some additional delay for routing etc ( including interDC ). As a result after swapping all nodes in cluster to use same network's ips for communication we have not faced any issues.
 
Hi,

Thx for your feedback.
so on your side it was an external network problem.
My nodes are located on differents DC but on same L2 network
I investigate further
 
I am facing with the same issue. New cluster installation (no VMs running yet), 4 nodes, dedicated 10Gbps network, same IPs subnet for the corosync cluster. syslog full of messages:

Code:
Jul 31 20:22:04 node5 corosync[2103]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 31 20:22:04 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:04 node5 corosync[2103]:   [KNET  ] link: host: 3 link: 0 is down
Jul 31 20:27:04 node5 corosync[2103]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 20:27:04 node5 corosync[2103]:   [KNET  ] host: host: 3 has no active links
Jul 31 20:27:07 node5 corosync[2103]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 31 20:27:07 node5 corosync[2103]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 20:27:10 node5 corosync[2103]:   [KNET  ] link: host: 2 link: 0 is down
Jul 31 20:27:10 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:10 node5 corosync[2103]:   [KNET  ] host: host: 2 has no active links
Jul 31 20:27:13 node5 corosync[2103]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 31 20:27:13 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] link: host: 4 link: 0 is down
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] link: host: 3 link: 0 is down
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] link: host: 2 link: 0 is down
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 4 has no active links
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 3 has no active links
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:28 node5 corosync[2103]:   [KNET  ] host: host: 2 has no active links
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] rx: host: 4 link: 0 is up
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 31 20:27:31 node5 corosync[2103]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Corosync with default configuration. I have no idea what is the issue :( In this state, I am afraid to run nodes to production.
 
@systemctl something is wrong with your package versions - please double check your repository setup is correct and that you use 'apt full-upgrade' for upgrading! your libknet1 is outdated and a known buggy version that can lead to cluster outages upon membership changes.

@everybody: if you have frequent link fluctuations, check the stats cmap: corosync-cmapctl -m stats. link down events are either caused by the heartbeat (a simple ping/pong over UDP) timing out or other network layer errors (like the other end being unreachable when attempting to send corosync data). by default 2 heartbeat pongs need to be received (within the expected timespan) for the link to be marked back up. see 'corosync.conf' manpage for details ;)
 
  • Like
Reactions: auranext

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!