[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

spirit · Sep 8, 2019

David Herselman said:
Our small HP system cluster, which has bnx2 NICs, is the only one still experiencing regular problems...

do you see bnx2 error in kern.log or dmesg ?

fabian · Sep 9, 2019

Robert.H said:
Is this a generalized issue or only certain combination of hw/sw see this?
And what about a fresh install of V6, does it have issues with corosync v3?

only certain hardware/environment factors trigger it. all our testlab and production clusters have been running stable with Corosync 3.x and knet for quite a while (obviously, we use our own products and don't ship stuff that we know does not work..), AFAIK @spirit also has been running larger installations for quite a while without issues.

the pmtud issues should hopefully all be fixed now with the latest knet (1.11). in parallel we also fixed some unrelated issues in our cluster file system (and reworked some parts to make debugging / analysis easier). the corosync crashes are currently being analyzed using the coredumps provided by affected users.

spirit · Sep 9, 2019

fabian said:
AFAIK @spirit also has been running larger installations for quite a while without issues.

yes, indeed, running since 6months with corosync3 beta (on proxmox 5 with kernel 4.15). no problem until now. (I'm using mellanox connect-x4 card, and 2x10gb lacp bonding). Cluster have 16nodes, without any special tuning of corosync configuration.

brad_mssw · Sep 9, 2019

our new cluster will be running mellanox connect-x4 as well, except 2x50Gbps ... so its good to know that mellanox seems good for a long time.

David Herselman · Sep 10, 2019

Nothing at all, only the usual boot time initialisation messages.

QLogic firmware might be newer/older than other's:

Code:

[root@kvm1 ~]# ethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: bc 5.2.3 NCSI 2.0.6
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

spirit said:
do you see bnx2 error in kern.log or dmesg ?

bofh · Sep 10, 2019

kernel 5.0.21-1-pve
Cluster 4 identical nodes, 2 network cards each
all fresh installs pve6

network cards
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x80000aee, 1.1927.0
expansion-rom-version:
bus-info: 0000:03:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

eno2 is dedicated to corsync - runs over vrack at ovh (i know i know)
it was stable with 4 nodes and tinc for about a week
then it was stable for 2 weeks with 3 nodes and using vrack/second nic instead of tinc

since adding node 4 and second nic corosync crashes the hole thing and reports mtu changes

now the wierd thing is the symptoms
node 1,2,4 report 3 missing
node 3 reports beeing single again

but node 1,2,4 are now unresposive everything promox related. pvecm takes 10-20 seconds, qm list even freezes
node 3 is reposive.

trying to resolve #1
now restart corosync 1,2,4 will put each into single mode, cluster is not beeing formed again
when you restart corosync 3 cluster ich back up imiditatly.

trying to resolve #2
next crash, now we try, same symptoms
now we try to restart #3 alone - issue resolved, cluster back up (i still restart coro on each node just to be safe)

since my ability to change hardware is limited (none) and ive no choice but ovh i have setup a backup ring via tinc
backup ring works so let see if at least the freezes stop

edit: no errors in dmesg about the network card

spirit · Sep 10, 2019

bofh said:
kernel 5.0.21-1-pve

eno2 is dedicated to corsync - runs over vrack at ovh (i know i know)
it was stable with 4 nodes and tinc for about a week
then it was stable for 2 weeks with 3 nodes and using vrack/second nic instead of tinc

since adding node 4 and second nic corosync crashes the hole thing and reports mtu changes

do you have corosync log ?
# pveversion -v ?

bofh · Sep 10, 2019

spirit said:
do you have corosync log ?
# pveversion -v ?

Allnodes same version

proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-8
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

Attached logs are as described node 1,2,3,4
I shortend some logs and cut of retransmition lists, i left first 3 and last 3 of a junk to have the timestamps in case its interresting

same time no errors system logs/dmesg about any network related issue.
dont have network monitoring on (yet) i hope i can have a smokeping and alerting running by the end of the week

spirit · Sep 11, 2019

@bofh
looking at the logs, I'm seeing
"Process pause detected for xxx ms, flushing membership messages."

in corosync code, this code from

Code:

https://github.com/corosync/corosync/blob/master/exec/totemsrp.c

static int pause_flush (struct totemsrp_instance *instance)
{
...
    if ((now_msec - timestamp_msec) > (instance->totem_config->token_timeout / 2)) {
        log_printf (instance->totemsrp_log_level_notice,
            "Process pause detected for %d ms, flushing membership messages.", (unsigned int)(now_msec - timestamp_msec));

So, I think you can try to configure higher token timeout in corosync.conf
maybe try with 10s:

token: 10000

Whatever · Sep 11, 2019

Could the problem be related to jumbo frames and/or dual ring configuration?
I'm facing the same issue - corosync randomly hangs on different nodes.
I've two rings 10Gbe + 1Gbe with mtu = 9000 on both nets

ahovda · Sep 11, 2019

I found I had a flakey LACP bond between two switches. Rebooted the switches and that seems resolved. I've added options bnx2 disable_msi=1 to /etc/modprobe.d/bnx2.conf and rebooted all hosts (checked lspci -v for MSI-X: Enable- afterwards). So far no more corosync segfaults, but I have yet to reenable HA, I don't really trust it yet.

Menno · Sep 11, 2019

Whatever said:
Could the problem be related to jumbo frames and/or dual ring configuration?
I'm facing the same issue - corosync randomly hangs on different nodes.
I've two rings 10Gbe + 1Gbe with mtu = 9000 on both nets

I'd very much like to know as well, I'm having this issue as well and am about to downgrade 10 nodes to Proxmox VE 5 because of this issue.

Also dual ring configuration, one ring mtu 1500 and one mtu 9000.

bofh · Sep 11, 2019

spirit said:
@bofh

So, I think you can try to configure higher token timeout in corosync.conf
maybe try with 10s:

token: 10000

thanks, will try.

little update on this. since i do not trust OVH or their ability to provide virtual networks i setup a second ring on tinc.

now its much more stable. had today NO corosync logentry in 2 of the nodes.
now wierdly enough, while 2 nodes dont report anything at all for today, 2 others nodes do
if theres a network issue i would have expected every node to report
like node 3 reports link0 on node 1 is down but node one says nothing at all.

now my best guess without knowing the code would be the following
- we all might have different causes for those issue but same symtoms
-corosync cant handle any network error properly
-corosync seems not to be able to even detect network errors or issues all the time properly
i guess they still rely much on multicast in their errorhandling and detection despite we do only have unicast now
-pve handles such errors not well either, resulting in freezes even when quorum still exist
-we should not rely or trust corosync to report, it might or might not.

now i also had an issue on another cluster with 2 nodes no HA.
one of the machine crashed (likely hardware issue) and was unresposive to anything (not even ssh) but still returned ping and seemed to return on corosync

now the result was that node 1 that worked properly freezed pve also completly.
oddly enough even after a reboot of node 1 (and didnt look at 2 at that time, silly me why checking the freezing node if you should check the others)
node 1 freezes imidiatly once corosync gets online

so as long the faulty node 2 was online, node 1 was unuseable. could not even qm list, or handle auth on the web.

regardless of what network issues we have, or how our nodes are crashing, it should never be that a faulty node takes everyone else with them.
that might be based on corosync but also pve messed up here in a big way

oddly enough, while i still have said network issues and those 2 nodes, of the 4machine cluster, report links down
i do not have any longer total freeze and the must to restart corosync manually. the second ring (even it also reports down sometimes) did at least solve the issue with pve and the cluster seems now to take those errors more resilient

Ivan Gersi · Sep 12, 2019

I`m really frustrate frrom updating to Corosync 3. Cluster has been unstable for 2 months.
I have 5 nodes cluster...2 nodes are in latest 5 version, 3 nodes are in latest 6 version.
Nodes are disconnected randomly, cluster is online sometimes few hours, few minutes or few days. Yesterday pve1 didn`t want connect to cluster. I tried restart corosync or node but no result.
Paradoxically I had to make this steps:
I tried to make corosync at 2nd ethernet NIC in the pve1 and pve3 (another subnet)...no connection. Pve3 didn`t connect with pve1 but pve1 connect with pve 2, 4 and 5! It was impossible because they had set 1st subnet not the same as pve1.
Next step was edit corosync conf with subnet 1 on pve3 and all nodes in cluster are online again.
I have a question...what corosync conf is the right? There is one in /etc/pve and another one in /etc/corosync.
I edited one in /etc/corosync...maybe this was "the miracle".
I`m going to try add ring1_addr to config...maybe with 2 ips it wil be ok.

spirit · Sep 12, 2019

Ivan Gersi said:
I`m really frustrate frrom updating to Corosync 3. Cluster has been unstable for 2 months.
I have 5 nodes cluster...2 nodes are in latest 5 version, 3 nodes are in latest 6 version.

have you already done last libknet update (on both proxmox5 (with corosync3 repo) && proxmox6 nodes ?)

I have a question...what corosync conf is the right? There is one in /etc/pve and another one in /etc/corosync.
I edited one in /etc/corosync...maybe this was "the miracle".

you need to edit /etc/pve/corosync.conf, and when you save the file, each node will copy the file locally in /etc/corosync.conf, and then node auto reload.
but if you have your cluster breaked, and want to make a change in corosync.conf, maybe the best way is to edit /etc/pve/corosync.conf, and manually copy the file to each node (to be sure to have exact same version), and restart corosync manually.

bofh · Sep 12, 2019

Ivan Gersi said:
I`m really frustrate frrom updating to Corosync 3. Cluster has been unstable for 2 months.
I have 5 nodes cluster...2 nodes are in latest 5 version, 3 nodes are in latest 6 version.
Nodes are disconnected randomly, cluster is online sometimes few hours, few minutes or few days. Yesterday pve1 didn`t want connect to cluster. I tried restart corosync or node but no result.
Paradoxically I had to make this steps:
I tried to make corosync at 2nd ethernet NIC in the pve1 and pve3 (another subnet)...no connection. Pve3 didn`t connect with pve1 but pve1 connect with pve 2, 4 and 5! It was impossible because they had set 1st subnet not the same as pve1.
Next step was edit corosync conf with subnet 1 on pve3 and all nodes in cluster are online again.
I have a question...what corosync conf is the right? There is one in /etc/pve and another one in /etc/corosync.
I edited one in /etc/corosync...maybe this was "the miracle".
I`m going to try add ring1_addr to config...maybe with 2 ips it wil be ok.

its /etc/pve/corosync it will rewrite the /etc/corosync and distribute it over the nodes

yea restarting the "faulty" node wont help you, you need to restart corosync on the other nodes.
its paradox but this is how it works for whatever reason.

as i wrote above my hunch is that corosync main issue to deal with minor network issues (or maybe even think there errors where none are) and then cannot recover. it seems like the recovering of network errors wont work well on unicast and we dont have multicast.
so node 1 fails and gets properly excluded from the cluster on node 1,
other nodes cant deal with it properly, node 1 is back up, other nodes cant recover need a restart

Ivan Gersi · Sep 12, 2019

yea restarting the "faulty" node wont help you, you need to restart corosync on the other nodes.
its paradox but this is how it works for whatever reason.

That`s not true...sometimes you have to restart disconnected nodes, sometimes connected.
I already have conf with 2 rings in every node and I`m waiting for 1st disconnect.

Jema · Sep 12, 2019

We experience daily crashes of corosync3 also and all nodes that crash are on the BNX2 driver.
We run a separate cluster network so don't rely on external networks. It also has it's dedicated switch.

Does proxmox already have a reason / solution for it? This is ongoing since we moved ot PVE 6.

Ivan Gersi · Sep 12, 2019

Reason is simple...corosync randomly lost connection and ddidn`t make a reconnection. Solution is in the future. We must wait.
Edit: My 2rings setup is not the right way..cluster disconnected again.

brad_mssw · Sep 12, 2019

@Jema have you upgraded to the latest libknet and also tried the suggestion from @ahovda:

ahovda said:
I've added options bnx2 disable_msi=1 to /etc/modprobe.d/bnx2.conf and rebooted all hosts (checked lspci -v for MSI-X: Enable- afterwards). So far no more corosync segfaults, but I have yet to reenable HA, I don't really trust it yet.

?

@ahovda your cluster still stable now?

[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Distinguished Member

Proxmox Staff Member

Distinguished Member

Well-Known Member

Renowned Member

Well-Known Member

Distinguished Member

Well-Known Member

Attachments

Distinguished Member

Renowned Member

Active Member

Member

Well-Known Member

Renowned Member

Distinguished Member

Well-Known Member

Renowned Member

New Member

Renowned Member

Well-Known Member

We value your privacy