A lot of cluster fails after upgrade 5.4 to 6.0.4

Shturman · Jul 28, 2019

Hi.
The installation of cluster with 6 nodes. Dell servers with bonded interfaces, private switch, almost no load, no traffic. Everything working perfect a years, but after upgrade to 6.0.4 step by step fails happen.

Sometimes nodes lost knet connection without any reason (no load at all)

Code:

Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] link: host: 6 link: 0 is down
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] link: host: 3 link: 0 is down
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] link: host: 1 link: 0 is down
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 6 has no active links
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 3 has no active links
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 28 11:36:36 pve2 corosync[28832]:   [KNET  ] host: host: 1 has no active links
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] rx: host: 6 link: 0 is up
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 28 11:36:38 pve2 corosync[28832]:   [KNET  ] rx: host: 1 link: 0 is up

Sometimes some MTU problems

Code:

Jul 28 09:29:16 pve3 corosync[7418]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 6 link 0 but the other node is not acknowled ging packets of this size.
Jul 28 09:29:16 pve3 corosync[7418]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Jul 28 09:29:42 pve3 corosync[7418]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 28 09:29:42 pve3 corosync[7418]:   [KNET  ] host: host: 6 has no active links

All nodes with default settings, I did not changed MTU, etc.

Is it possible to came back to multicast in corosync?

===
After upgrade the server wotn start, because it can not mount /dev/pve/data in fstab. But it works perfect before.
Actually, looks like the release 6 is most worst release ever.

dietmar · Jul 28, 2019

Shturman said:
Actually, looks like the release 6 is most worst release ever.

Or maybe you just have some MTU size problem in your network...

Shturman · Jul 28, 2019

dietmar said:
Or maybe you just have some MTU size problem in your network...

I checked it. Many, many times.
ping -s 65507 from any node to any node without any problems. Same thing between virtual machines. Vary fast, very stable. Could you please, advice, how can I check more?
Each server are connected to one gigabit switch via bonded interface (two gigabit ports). Almost no traffic, almost no load.

Idaho · Aug 1, 2019

I see the same problem. Network is not shared, no problems with MTU. Proxmox 5.x clusters works stable, non Proxmox clusters based on corosync 3 (multicast) works stable. One of Proxmox 5 clusters was updated to 6.0-5 and cluster is failing few times a day - a node can go offline but it can ping other nodes in cluster (and it can be pinged from them).

Code:

Jul 31 23:26:51 xxx-cl-a-02 corosync[10533]:   [KNET  ] link: host: 4 link: 0 is down
Jul 31 23:26:51 xxx-cl-a-02 corosync[10533]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 31 23:26:51 xxx-cl-a-02 corosync[10533]:   [KNET  ] host: host: 4 has no active links
Jul 31 23:26:53 xxx-cl-a-02 corosync[10533]:   [KNET  ] rx: host: 4 link: 0 is up
Jul 31 23:26:53 xxx-cl-a-02 corosync[10533]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Shturman · Aug 1, 2019

Idaho said:
I see the same problem.

Man, no reply from proxmox team, but I fixed it. I changed protocol to sctp and increased totem timeout. 4 days left without any problems with connection. You can try.

Code:

totem {
  cluster_name: tv
  config_version: 19
  interface {
    bindnetaddr: 10.10.10.0
    ringnumber: 0
    knet_transport: sctp
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 10000
}

Idaho · Aug 6, 2019

Shturman said:
Man, no reply from proxmox team, but I fixed it. I changed protocol to sctp and increased totem timeout. 4 days left without any problems with connection. You can try.

That did the trick, it works stable now. Thank you a lot!

Frozen Pizza · Sep 2, 2019

This worked wonders for me after I upgraded.
Edit: Nope. It worked for a day but I still have issues. Any other hints, please?

elmacus · Sep 3, 2019

Sounds like same problem everybody else has, check here: https://forum.proxmox.com/threads/pve-5-4-11-corosync-3-x-major-issues.56124/
No solution yet, just be agile then cluster fails.

bofh · Sep 3, 2019

had same thing on a cluster @ovh.
vrack is used for corosync only.

what happens is that one host suddely decided to go for a lower mtu
while i was trying to get this host back into that cluster it was the other way around

i had to restart corosync on all the other nodes, after restart they accepted the new mtu and moved on
now its stable again.

ping -M do -s 1472 on and from each host was still fine, yet corosync switched to 1372 or something like that

TheMrg · Oct 17, 2019

we are on ovh too. do you have a solution? we get lots of :

Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] link: host: 5 link: 1 is down
Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] link: host: 2 link: 1 is down
Oct 17 18:47:48 storage2 corosync[3488]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] rx: host: 5 link: 1 is up
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] rx: host: 6 link: 1 is up
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] rx: host: 2 link: 1 is up
Oct 17 18:47:50 storage2 corosync[3488]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)

spirit · Oct 18, 2019

@TheMrg

The final fixes has been push in pve-no-subscription repository this week
Please upgrade, and test again.

you need theses packages version:

corosync-doc_3.0.2-pve4
libknet1_1.13-pve1

TheMrg · Oct 18, 2019

Ok we will wait. Thanks.
What are "best practice" config in "slow" latency networks?
We use OVH vRACK as Corosync ring.
What suggestions do you have?

PS:
We have one node with libknet1/stable,now 1.11-pve1 amd64 [Installiert,aktualisierbar auf: 1.13-pve1]
No link errors.
But all other nodes with libknet1/stable,now 1.12-pve1 amd64 [Installiert,aktualisierbar auf: 1.13-pve1]
shows lot of "link errors" of KNET.

We have a test Node. We Upgrade libknet1/stable, do we have to restart anythink? Because this node is on libknet1/stable 1.13 and have lot of "link errors" too.

spirit · Oct 18, 2019

TheMrg said:
Ok we will wait. Thanks.
What are "best practice" config in "slow" latency networks?
We use OVH vRACK as Corosync ring.
What suggestions do you have?

PS:
We have one node with libknet1/stable,now 1.11-pve1 amd64 [Installiert,aktualisierbar auf: 1.13-pve1]
No link errors.
But all other nodes with libknet1/stable,now 1.12-pve1 amd64 [Installiert,aktualisierbar auf: 1.13-pve1]
shows lot of "link errors" of KNET.

We have a test Node. We Upgrade libknet1/stable, do we have to restart anythink? Because this node is on libknet1/stable 1.13 and have lot of "link errors" too.

you really need to upgrade all nodes. If you have upgraded only libknet, you need to restart corosync manually.

Note that the up/down log don't mean that network interface is down, but maybe some communication error occur or packet loss.

About ovh vrack, I known that are good vrack with garanteed bandwith ($$$), and free/cheap vrack of 10mbits. I have see a lot of bug report with this last one. (I'm not sure, but it's quite possible that it's overloaded sometime or qos happen on ovh network)

TheMrg · Oct 18, 2019

Thanks. Ok we do corosync restart.
"you really need to upgrade all nodes " ok we do start of next week.

we do
apt install libknet1
and then restart corosync.
We have it on 2 of our 6 nodes yet.
One node is on 1.11

We do have vRACK with 3Gbit/s quaranteed.
We configured a tinc private Network (over 1Gbit/s public interface of our servers) as ring1 as backup.
So we get lot of dows/ups of the different rings.

Oct 18 07:12:38 cluster21 corosync[5330]: [KNET ] link: host: 1 link: 1 is down
Oct 18 07:12:38 cluster21 corosync[5330]: [KNET ] link: host: 3 link: 1 is down
Oct 18 07:12:38 cluster21 corosync[5330]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 18 07:12:38 cluster21 corosync[5330]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 18 07:12:40 cluster21 corosync[5330]: [KNET ] rx: host: 1 link: 1 is up
Oct 18 07:12:40 cluster21 corosync[5330]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 18 07:12:40 cluster21 corosync[5330]: [KNET ] rx: host: 3 link: 1 is up
Oct 18 07:12:40 cluster21 corosync[5330]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 18 07:23:30 cluster21 corosync[5330]: [KNET ] link: host: 4 link: 1 is down
Oct 18 07:23:30 cluster21 corosync[5330]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 18 07:23:32 cluster21 corosync[5330]: [KNET ] rx: host: 4 link: 1 is up
Oct 18 07:23:32 cluster21 corosync[5330]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 18 07:24:03 cluster21 corosync[5330]: [KNET ] link: host: 1 link: 1 is down
Oct 18 07:24:03 cluster21 corosync[5330]: [KNET ] link: host: 3 link: 1 is down
Oct 18 07:24:03 cluster21 corosync[5330]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 18 07:24:03 cluster21 corosync[5330]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 18 07:24:05 cluster21 corosync[5330]: [KNET ] rx: host: 3 link: 1 is up
Oct 18 07:24:05 cluster21 corosync[5330]: [KNET ] rx: host: 1 link: 1 is up
Oct 18 07:24:05 cluster21 corosync[5330]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 18 07:24:05 cluster21 corosync[5330]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] link: host: 4 link: 1 is down
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] link: host: 2 link: 1 is down
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] link: host: 6 link: 1 is down
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 18 07:25:50 cluster21 corosync[5330]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] rx: host: 4 link: 1 is up
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] rx: host: 2 link: 1 is up
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] rx: host: 6 link: 1 is up
Oct 18 07:25:52 cluster21 corosync[5330]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)

All Nodes on libknet1/stable,now 1.13-pve1 amd64
But lots of Link down / up furthermore

Any suggestions for other/higher timings / timeout settings?

Hyacin · Jun 20, 2020

Shturman said:
Man, no reply from proxmox team, but I fixed it. I changed protocol to sctp and increased totem timeout. 4 days left without any problems with connection. You can try.

Code:

totem { cluster_name: tv config_version: 19 interface { bindnetaddr: 10.10.10.0 ringnumber: 0 knet_transport: sctp } ip_version: ipv4 secauth: on version: 2 token: 10000 }

I know this post is kinda old, so it's my own damn fault for sure for just taking something from the internet and adding it to my server to try to shut up a warning message - but I would NOT advise making these changes - doing so left me with a very broken cluster that took me more than three hours to fix :-/

fabian · Jun 22, 2020

Hyacin said:
I know this post is kinda old, so it's my own damn fault for sure for just taking something from the internet and adding it to my server to try to shut up a warning message - but I would NOT advise making these changes - doing so left me with a very broken cluster that took me more than three hours to fix :-/

indeed, those settings are not a good idea in almost any case. tweaking the ping/pong and token timeout settings a bit can make sense if you network requires it *and* you know what each setting does, but just blindly setting the timeout to some fantasy value will very likely cause more harm than good.

Chaminda Bandara · Aug 13, 2021

Shturman said:
Man, no reply from proxmox team, but I fixed it. I changed protocol to sctp and increased totem timeout. 4 days left without any problems with connection. You can try.

Code:

totem { cluster_name: tv config_version: 19 interface { bindnetaddr: 10.10.10.0 ringnumber: 0 knet_transport: sctp } ip_version: ipv4 secauth: on version: 2 token: 10000 }

Hi,

I have the same problem and where can I do this please ?

Regards,
Chaminda

Search

Search

A lot of cluster fails after upgrade 5.4 to 6.0.4

Shturman

Renowned Member

dietmar

Proxmox Staff Member

Shturman

Renowned Member

Idaho

Renowned Member

Shturman

Renowned Member

Idaho

Renowned Member

Frozen Pizza

New Member

elmacus

Renowned Member

bofh

Renowned Member

TheMrg

Well-Known Member

spirit

Distinguished Member

TheMrg

Well-Known Member

spirit

Distinguished Member

TheMrg

Well-Known Member

Hyacin

Well-Known Member

fabian

Proxmox Staff Member

Chaminda Bandara

Active Member

We value your privacy