PVE 5.4-11 + Corosync 3.x: major issues

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
3,474
541
113
Is this a generalized issue or only certain combination of hw/sw see this?
And what about a fresh install of V6, does it have issues with corosync v3?
only certain hardware/environment factors trigger it. all our testlab and production clusters have been running stable with Corosync 3.x and knet for quite a while (obviously, we use our own products and don't ship stuff that we know does not work..), AFAIK @spirit also has been running larger installations for quite a while without issues.

the pmtud issues should hopefully all be fixed now with the latest knet (1.11). in parallel we also fixed some unrelated issues in our cluster file system (and reworked some parts to make debugging / analysis easier). the corosync crashes are currently being analyzed using the coredumps provided by affected users.
 

spirit

Well-Known Member
Apr 2, 2010
3,529
156
63
www.odiso.com
AFAIK @spirit also has been running larger installations for quite a while without issues.
yes, indeed, running since 6months with corosync3 beta (on proxmox 5 with kernel 4.15). no problem until now. (I'm using mellanox connect-x4 card, and 2x10gb lacp bonding). Cluster have 16nodes, without any special tuning of corosync configuration.
 

brad_mssw

Member
Jun 13, 2014
129
7
18
our new cluster will be running mellanox connect-x4 as well, except 2x50Gbps ... so its good to know that mellanox seems good for a long time.
 
Jun 8, 2016
221
41
28
43
Johannesburg, South Africa
Nothing at all, only the usual boot time initialisation messages.

QLogic firmware might be newer/older than other's:
Code:
[root@kvm1 ~]# ethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: bc 5.2.3 NCSI 2.0.6
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
do you see bnx2 error in kern.log or dmesg ?
 

bofh

Member
Nov 7, 2017
97
9
8
39
kernel 5.0.21-1-pve
Cluster 4 identical nodes, 2 network cards each
all fresh installs pve6

network cards
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x80000aee, 1.1927.0
expansion-rom-version:
bus-info: 0000:03:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes


eno2 is dedicated to corsync - runs over vrack at ovh (i know i know)
it was stable with 4 nodes and tinc for about a week
then it was stable for 2 weeks with 3 nodes and using vrack/second nic instead of tinc

since adding node 4 and second nic corosync crashes the hole thing and reports mtu changes


now the wierd thing is the symptoms
node 1,2,4 report 3 missing
node 3 reports beeing single again

but node 1,2,4 are now unresposive everything promox related. pvecm takes 10-20 seconds, qm list even freezes
node 3 is reposive.

trying to resolve #1
now restart corosync 1,2,4 will put each into single mode, cluster is not beeing formed again
when you restart corosync 3 cluster ich back up imiditatly.

trying to resolve #2
next crash, now we try, same symptoms
now we try to restart #3 alone - issue resolved, cluster back up (i still restart coro on each node just to be safe)


since my ability to change hardware is limited (none) and ive no choice but ovh i have setup a backup ring via tinc
backup ring works so let see if at least the freezes stop

edit: no errors in dmesg about the network card
 

spirit

Well-Known Member
Apr 2, 2010
3,529
156
63
www.odiso.com
kernel 5.0.21-1-pve

eno2 is dedicated to corsync - runs over vrack at ovh (i know i know)
it was stable with 4 nodes and tinc for about a week
then it was stable for 2 weeks with 3 nodes and using vrack/second nic instead of tinc

since adding node 4 and second nic corosync crashes the hole thing and reports mtu changes
do you have corosync log ?
# pveversion -v ?
 

bofh

Member
Nov 7, 2017
97
9
8
39
do you have corosync log ?
# pveversion -v ?
Allnodes same version

proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-8
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

Attached logs are as described node 1,2,3,4
I shortend some logs and cut of retransmition lists, i left first 3 and last 3 of a junk to have the timestamps in case its interresting

same time no errors system logs/dmesg about any network related issue.
dont have network monitoring on (yet) i hope i can have a smokeping and alerting running by the end of the week
 

Attachments

spirit

Well-Known Member
Apr 2, 2010
3,529
156
63
www.odiso.com
@bofh
looking at the logs, I'm seeing
"Process pause detected for xxx ms, flushing membership messages."

in corosync code, this code from

Code:
https://github.com/corosync/corosync/blob/master/exec/totemsrp.c

static int pause_flush (struct totemsrp_instance *instance)
{
...
    if ((now_msec - timestamp_msec) > (instance->totem_config->token_timeout / 2)) {
        log_printf (instance->totemsrp_log_level_notice,
            "Process pause detected for %d ms, flushing membership messages.", (unsigned int)(now_msec - timestamp_msec));
So, I think you can try to configure higher token timeout in corosync.conf
maybe try with 10s:

token: 10000
 

Whatever

Member
Nov 19, 2012
228
6
18
Could the problem be related to jumbo frames and/or dual ring configuration?
I'm facing the same issue - corosync randomly hangs on different nodes.
I've two rings 10Gbe + 1Gbe with mtu = 9000 on both nets
 

ahovda

New Member
Sep 2, 2019
13
3
3
42
I found I had a flakey LACP bond between two switches. Rebooted the switches and that seems resolved. I've added options bnx2 disable_msi=1 to /etc/modprobe.d/bnx2.conf and rebooted all hosts (checked lspci -v for MSI-X: Enable- afterwards). So far no more corosync segfaults, but I have yet to reenable HA, I don't really trust it yet.
 
Aug 7, 2018
15
0
1
119
Could the problem be related to jumbo frames and/or dual ring configuration?
I'm facing the same issue - corosync randomly hangs on different nodes.
I've two rings 10Gbe + 1Gbe with mtu = 9000 on both nets
I'd very much like to know as well, I'm having this issue as well and am about to downgrade 10 nodes to Proxmox VE 5 because of this issue.

Also dual ring configuration, one ring mtu 1500 and one mtu 9000.
 

bofh

Member
Nov 7, 2017
97
9
8
39
@bofh


So, I think you can try to configure higher token timeout in corosync.conf
maybe try with 10s:

token: 10000
thanks, will try.

little update on this. since i do not trust OVH or their ability to provide virtual networks i setup a second ring on tinc.

now its much more stable. had today NO corosync logentry in 2 of the nodes.
now wierdly enough, while 2 nodes dont report anything at all for today, 2 others nodes do
if theres a network issue i would have expected every node to report
like node 3 reports link0 on node 1 is down but node one says nothing at all.

now my best guess without knowing the code would be the following
- we all might have different causes for those issue but same symtoms
-corosync cant handle any network error properly
-corosync seems not to be able to even detect network errors or issues all the time properly
i guess they still rely much on multicast in their errorhandling and detection despite we do only have unicast now
-pve handles such errors not well either, resulting in freezes even when quorum still exist
-we should not rely or trust corosync to report, it might or might not.



now i also had an issue on another cluster with 2 nodes no HA.
one of the machine crashed (likely hardware issue) and was unresposive to anything (not even ssh) but still returned ping and seemed to return on corosync

now the result was that node 1 that worked properly freezed pve also completly.
oddly enough even after a reboot of node 1 (and didnt look at 2 at that time, silly me why checking the freezing node if you should check the others)
node 1 freezes imidiatly once corosync gets online

so as long the faulty node 2 was online, node 1 was unuseable. could not even qm list, or handle auth on the web.

regardless of what network issues we have, or how our nodes are crashing, it should never be that a faulty node takes everyone else with them.
that might be based on corosync but also pve messed up here in a big way

oddly enough, while i still have said network issues and those 2 nodes, of the 4machine cluster, report links down
i do not have any longer total freeze and the must to restart corosync manually. the second ring (even it also reports down sometimes) did at least solve the issue with pve and the cluster seems now to take those errors more resilient
 
Last edited:

Ivan Gersi

Member
May 29, 2016
42
4
8
49
I`m really frustrate frrom updating to Corosync 3. Cluster has been unstable for 2 months.
I have 5 nodes cluster...2 nodes are in latest 5 version, 3 nodes are in latest 6 version.
Nodes are disconnected randomly, cluster is online sometimes few hours, few minutes or few days. Yesterday pve1 didn`t want connect to cluster. I tried restart corosync or node but no result.
Paradoxically I had to make this steps:
I tried to make corosync at 2nd ethernet NIC in the pve1 and pve3 (another subnet)...no connection. Pve3 didn`t connect with pve1 but pve1 connect with pve 2, 4 and 5! It was impossible because they had set 1st subnet not the same as pve1.
Next step was edit corosync conf with subnet 1 on pve3 and all nodes in cluster are online again.
I have a question...what corosync conf is the right? There is one in /etc/pve and another one in /etc/corosync.
I edited one in /etc/corosync...maybe this was "the miracle".
I`m going to try add ring1_addr to config...maybe with 2 ips it wil be ok.
 
Last edited:

spirit

Well-Known Member
Apr 2, 2010
3,529
156
63
www.odiso.com
I`m really frustrate frrom updating to Corosync 3. Cluster has been unstable for 2 months.
I have 5 nodes cluster...2 nodes are in latest 5 version, 3 nodes are in latest 6 version.
have you already done last libknet update (on both proxmox5 (with corosync3 repo) && proxmox6 nodes ?)

I have a question...what corosync conf is the right? There is one in /etc/pve and another one in /etc/corosync.
I edited one in /etc/corosync...maybe this was "the miracle".
you need to edit /etc/pve/corosync.conf, and when you save the file, each node will copy the file locally in /etc/corosync.conf, and then node auto reload.
but if you have your cluster breaked, and want to make a change in corosync.conf, maybe the best way is to edit /etc/pve/corosync.conf, and manually copy the file to each node (to be sure to have exact same version), and restart corosync manually.
 

bofh

Member
Nov 7, 2017
97
9
8
39
I`m really frustrate frrom updating to Corosync 3. Cluster has been unstable for 2 months.
I have 5 nodes cluster...2 nodes are in latest 5 version, 3 nodes are in latest 6 version.
Nodes are disconnected randomly, cluster is online sometimes few hours, few minutes or few days. Yesterday pve1 didn`t want connect to cluster. I tried restart corosync or node but no result.
Paradoxically I had to make this steps:
I tried to make corosync at 2nd ethernet NIC in the pve1 and pve3 (another subnet)...no connection. Pve3 didn`t connect with pve1 but pve1 connect with pve 2, 4 and 5! It was impossible because they had set 1st subnet not the same as pve1.
Next step was edit corosync conf with subnet 1 on pve3 and all nodes in cluster are online again.
I have a question...what corosync conf is the right? There is one in /etc/pve and another one in /etc/corosync.
I edited one in /etc/corosync...maybe this was "the miracle".
I`m going to try add ring1_addr to config...maybe with 2 ips it wil be ok.
its /etc/pve/corosync it will rewrite the /etc/corosync and distribute it over the nodes

yea restarting the "faulty" node wont help you, you need to restart corosync on the other nodes.
its paradox but this is how it works for whatever reason.

as i wrote above my hunch is that corosync main issue to deal with minor network issues (or maybe even think there errors where none are) and then cannot recover. it seems like the recovering of network errors wont work well on unicast and we dont have multicast.
so node 1 fails and gets properly excluded from the cluster on node 1,
other nodes cant deal with it properly, node 1 is back up, other nodes cant recover need a restart
 

Ivan Gersi

Member
May 29, 2016
42
4
8
49
yea restarting the "faulty" node wont help you, you need to restart corosync on the other nodes.
its paradox but this is how it works for whatever reason.
That`s not true...sometimes you have to restart disconnected nodes, sometimes connected.
I already have conf with 2 rings in every node and I`m waiting for 1st disconnect.
 

Jema

New Member
Jun 3, 2019
15
0
1
39
We experience daily crashes of corosync3 also and all nodes that crash are on the BNX2 driver.
We run a separate cluster network so don't rely on external networks. It also has it's dedicated switch.

Does proxmox already have a reason / solution for it? This is ongoing since we moved ot PVE 6.
 

Ivan Gersi

Member
May 29, 2016
42
4
8
49
Reason is simple...corosync randomly lost connection and ddidn`t make a reconnection. Solution is in the future. We must wait.
Edit: My 2rings setup is not the right way..cluster disconnected again.
 
Last edited:

brad_mssw

Member
Jun 13, 2014
129
7
18
@Jema have you upgraded to the latest libknet and also tried the suggestion from @ahovda:

I've added options bnx2 disable_msi=1 to /etc/modprobe.d/bnx2.conf and rebooted all hosts (checked lspci -v for MSI-X: Enable- afterwards). So far no more corosync segfaults, but I have yet to reenable HA, I don't really trust it yet.
?

@ahovda your cluster still stable now?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!