A lot of cluster fails after upgrade 5.4 to 6.0.4

Shturman

Renowned Member
May 1, 2015
4
4
68
Hi.
The installation of cluster with 6 nodes. Dell servers with bonded interfaces, private switch, almost no load, no traffic. Everything working perfect a years, but after upgrade to 6.0.4 step by step fails happen.

Sometimes nodes lost knet connection without any reason (no load at all)
Code:
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] link: host: 6 link: 0 is down
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] link: host: 3 link: 0 is down
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] link: host: 1 link: 0 is down
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 6 has no active links
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 3 has no active links
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 1 has no active links
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] rx: host: 6 link: 0 is up
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] rx: host: 1 link: 0 is up

Sometimes some MTU problems
Code:
Jul 28 09:29:16 pve3 corosync[7418]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 6 link 0 but the other node is not acknowled ging packets of this size.
Jul 28 09:29:16 pve3 corosync[7418]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Jul 28 09:29:42 pve3 corosync[7418]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 28 09:29:42 pve3 corosync[7418]:   [KNET  ] host: host: 6 has no active links
All nodes with default settings, I did not changed MTU, etc.

Is it possible to came back to multicast in corosync?

===
After upgrade the server wotn start, because it can not mount /dev/pve/data in fstab. But it works perfect before.
Actually, looks like the release 6 is most worst release ever.
 
Or maybe you just have some MTU size problem in your network...
I checked it. Many, many times.
ping -s 65507 from any node to any node without any problems. Same thing between virtual machines. Vary fast, very stable. Could you please, advice, how can I check more?
Each server are connected to one gigabit switch via bonded interface (two gigabit ports). Almost no traffic, almost no load.
 
I see the same problem. Network is not shared, no problems with MTU. Proxmox 5.x clusters works stable, non Proxmox clusters based on corosync 3 (multicast) works stable. One of Proxmox 5 clusters was updated to 6.0-5 and cluster is failing few times a day - a node can go offline but it can ping other nodes in cluster (and it can be pinged from them).

Code:
Jul 31 23:26:51 xxx-cl-a-02 corosync[10533]:   [KNET  ] link: host: 4 link: 0 is down
Jul 31 23:26:51 xxx-cl-a-02 corosync[10533]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 31 23:26:51 xxx-cl-a-02 corosync[10533]:   [KNET  ] host: host: 4 has no active links
Jul 31 23:26:53 xxx-cl-a-02 corosync[10533]:   [KNET  ] rx: host: 4 link: 0 is up
Jul 31 23:26:53 xxx-cl-a-02 corosync[10533]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
 
I see the same problem.
Man, no reply from proxmox team, but I fixed it. I changed protocol to sctp and increased totem timeout. 4 days left without any problems with connection. You can try.
Code:
totem {
  cluster_name: tv
  config_version: 19
  interface {
    bindnetaddr: 10.10.10.0
    ringnumber: 0
    knet_transport: sctp
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 10000
}
 
Man, no reply from proxmox team, but I fixed it. I changed protocol to sctp and increased totem timeout. 4 days left without any problems with connection. You can try.

That did the trick, it works stable now. Thank you a lot!
 
This worked wonders for me after I upgraded.
Edit: Nope. It worked for a day but I still have issues. Any other hints, please?
 
Last edited:
had same thing on a cluster @ovh.
vrack is used for corosync only.

what happens is that one host suddely decided to go for a lower mtu
while i was trying to get this host back into that cluster it was the other way around

i had to restart corosync on all the other nodes, after restart they accepted the new mtu and moved on
now its stable again.

ping -M do -s 1472 on and from each host was still fine, yet corosync switched to 1372 or something like that
 
we are on ovh too. do you have a solution? we get lots of :

Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] link: host: 5 link: 1 is down
Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] link: host: 2 link: 1 is down
Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] rx: host: 5 link: 1 is up
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] rx: host: 6 link: 1 is up
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] rx: host: 2 link: 1 is up
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
 
Ok we will wait. Thanks.
What are "best practice" config in "slow" latency networks?
We use OVH vRACK as Corosync ring.
What suggestions do you have?

PS:
We have one node with libknet1/stable,now 1.11-pve1 amd64 [Installiert,aktualisierbar auf: 1.13-pve1]
No link errors.
But all other nodes with libknet1/stable,now 1.12-pve1 amd64 [Installiert,aktualisierbar auf: 1.13-pve1]
shows lot of "link errors" of KNET.

We have a test Node. We Upgrade libknet1/stable, do we have to restart anythink? Because this node is on libknet1/stable 1.13 and have lot of "link errors" too.
 
Last edited:
Ok we will wait. Thanks.
What are "best practice" config in "slow" latency networks?
We use OVH vRACK as Corosync ring.
What suggestions do you have?

PS:
We have one node with libknet1/stable,now 1.11-pve1 amd64 [Installiert,aktualisierbar auf: 1.13-pve1]
No link errors.
But all other nodes with libknet1/stable,now 1.12-pve1 amd64 [Installiert,aktualisierbar auf: 1.13-pve1]
shows lot of "link errors" of KNET.

We have a test Node. We Upgrade libknet1/stable, do we have to restart anythink? Because this node is on libknet1/stable 1.13 and have lot of "link errors" too.

you really need to upgrade all nodes. If you have upgraded only libknet, you need to restart corosync manually.

Note that the up/down log don't mean that network interface is down, but maybe some communication error occur or packet loss.

About ovh vrack, I known that are good vrack with garanteed bandwith ($$$), and free/cheap vrack of 10mbits. I have see a lot of bug report with this last one. (I'm not sure, but it's quite possible that it's overloaded sometime or qos happen on ovh network)
 
Thanks. Ok we do corosync restart.
"you really need to upgrade all nodes " ok we do start of next week.

we do
apt install libknet1
and then restart corosync.
We have it on 2 of our 6 nodes yet.
One node is on 1.11

We do have vRACK with 3Gbit/s quaranteed.
We configured a tinc private Network (over 1Gbit/s public interface of our servers) as ring1 as backup.
So we get lot of dows/ups of the different rings.

Oct 18 07:12:38 cluster21 corosync[5330]: [KNET ] link: host: 1 link: 1 is down
Oct 18 07:12:38 cluster21 corosync[5330]: [KNET ] link: host: 3 link: 1 is down
Oct 18 07:12:38 cluster21 corosync[5330]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 18 07:12:38 cluster21 corosync[5330]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 18 07:12:40 cluster21 corosync[5330]: [KNET ] rx: host: 1 link: 1 is up
Oct 18 07:12:40 cluster21 corosync[5330]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 18 07:12:40 cluster21 corosync[5330]: [KNET ] rx: host: 3 link: 1 is up
Oct 18 07:12:40 cluster21 corosync[5330]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 18 07:23:30 cluster21 corosync[5330]: [KNET ] link: host: 4 link: 1 is down
Oct 18 07:23:30 cluster21 corosync[5330]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 18 07:23:32 cluster21 corosync[5330]: [KNET ] rx: host: 4 link: 1 is up
Oct 18 07:23:32 cluster21 corosync[5330]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 18 07:24:03 cluster21 corosync[5330]: [KNET ] link: host: 1 link: 1 is down
Oct 18 07:24:03 cluster21 corosync[5330]: [KNET ] link: host: 3 link: 1 is down
Oct 18 07:24:03 cluster21 corosync[5330]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 18 07:24:03 cluster21 corosync[5330]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 18 07:24:05 cluster21 corosync[5330]: [KNET ] rx: host: 3 link: 1 is up
Oct 18 07:24:05 cluster21 corosync[5330]: [KNET ] rx: host: 1 link: 1 is up
Oct 18 07:24:05 cluster21 corosync[5330]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 18 07:24:05 cluster21 corosync[5330]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] link: host: 4 link: 1 is down
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] link: host: 2 link: 1 is down
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] link: host: 6 link: 1 is down
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] rx: host: 4 link: 1 is up
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] rx: host: 2 link: 1 is up
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] rx: host: 6 link: 1 is up
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)

All Nodes on libknet1/stable,now 1.13-pve1 amd64
But lots of Link down / up furthermore

Any suggestions for other/higher timings / timeout settings?
 
Last edited:
Man, no reply from proxmox team, but I fixed it. I changed protocol to sctp and increased totem timeout. 4 days left without any problems with connection. You can try.
Code:
totem {
  cluster_name: tv
  config_version: 19
  interface {
    bindnetaddr: 10.10.10.0
    ringnumber: 0
    knet_transport: sctp
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 10000
}

I know this post is kinda old, so it's my own damn fault for sure for just taking something from the internet and adding it to my server to try to shut up a warning message - but I would NOT advise making these changes - doing so left me with a very broken cluster that took me more than three hours to fix :-/
 
I know this post is kinda old, so it's my own damn fault for sure for just taking something from the internet and adding it to my server to try to shut up a warning message - but I would NOT advise making these changes - doing so left me with a very broken cluster that took me more than three hours to fix :-/

indeed, those settings are not a good idea in almost any case. tweaking the ping/pong and token timeout settings a bit can make sense if you network requires it *and* you know what each setting does, but just blindly setting the timeout to some fantasy value will very likely cause more harm than good.
 
Man, no reply from proxmox team, but I fixed it. I changed protocol to sctp and increased totem timeout. 4 days left without any problems with connection. You can try.
Code:
totem {
  cluster_name: tv
  config_version: 19
  interface {
    bindnetaddr: 10.10.10.0
    ringnumber: 0
    knet_transport: sctp
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 10000
}

Hi,

I have the same problem and where can I do this please ?

Regards,
Chaminda
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!