[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

bofh · Sep 12, 2019

little update on a second ring.

i have now
eno2 - vlan for ring0 (10gbit interface to OVH vrack)
eno1 - bridge via tinc for ring1

while most nodes report no errors, one special problem node does report link downs, sometimes even both at least to some hosts like

Code:

Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] link: host: 4 link: 0 is down
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] link: host: 1 link: 0 is down
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] host: host: 4 has no active links
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Sep 12 14:06:50 h3 corosync[19797]:   [KNET  ] rx: host: 4 link: 0 is up
Sep 12 14:06:50 h3 corosync[19797]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Now the strange part is:
Now given the cause could be very well OVH and their praised virtual network, however the way those errors are handled is wierd.

With one ring only link down would result in freezing the hole cluster and not recover (manual restart required)
Now we shoudl expect if both rings are down the result should be the same

well it isnt. it recovers imidiatly and no freezes where noticed (maybe there are but then only for one second as they recover imidiatly anyway)

so to summarize 3 basic issues to investigate
-cause of the problem (maybe driver, setting or network related and maybe individual different, maybe corobug)
-corosync handling of the issue (symptom) and ability recover or not
-even if coro can recover with second link, question remains - specially for HA user - if those short outages have consequences asside from spamming the log

ediT: i will test the behaviour on the weekend with replacing tinc with a simple 2nd VLAN on eno2. in theory this should be the same result as with only one ring, as by all logic if theres any fault in the network chain (driver - interface - cableling - switches - whatever) it shoudl affect both at the very same time and act as if thers only one ring

i come to think that tinc maybe unique able to midigate errors because of its meshed nature.
or corosync error corrects differently with a second ring.
ofc this would be only a workaround of the rootcause but the root seems to be very different for most people

Whatever · Sep 13, 2019

Another hang which breaks even NFS connection and trace linux kernel

Jema · Sep 13, 2019

Where can the latest libknet be found for proxmox VE 6?

ahovda · Sep 13, 2019

brad_mssw said:
@ahovda your cluster still stable now

Unfortunately, I don't think it made a difference, I still have problems, and had to recover from a cluster meltdown yesterday. I have tried both intremap=off and disable_msi=1, both with no success -- which was expected, since all the posts suggesting those flags are for older kernel versions. There are timeout errors with the bnx2 driver, supposedly in combination with the chipset (these are all Dell PE[37]10 servers). Everything is patched up, software and firmware.

I get quite a few of these: NETDEV WATCHDOG: eno2 (bnx2): transmit queue 0 timed out which seems to indicate that the hardware does not respond to low-level PCI messages in time (or something like that). I've seen corosync trying to resend pending messages, counting 20, 30, 40 and boom: hardware timeout and it falls apart. Apparently it is not exclusive to bnx2 and I'm now suspecting it has to do with NIC bonding, https://bugzilla.kernel.org/show_bug.cgi?id=197325 . I have been using openvswitch with active-backup bonds, but I'll try Linux bonds instead.

What makes it worse is the automatic host reboot due to HA (which I've had to disable for now) and that it messes with pmxcfs in a way that makes /etc/pve inaccessible due to timeouts on the fuse file system. When I try to SSH to the host, it can sometimes time out too, since the /etc/ssh/ssh_known_hosts file is sitting inside /etc/pve and sshd will try to read it. I could trigger the hardware fault on another cluster node in that state by issuing find /etc/pve which would freeze at some file, before eventually completing. Access to /etc/pve was super slow.

I had to go to each node, stop pve-cluster, restart corosync, watch for cluster join (corosync-quorumtool -m), and then start pve-cluster. Once I had done that on all hosts, it finally let go and access to /etc/pve was fast as normal again.

I'm digging through the logs.

HTH, Ådne

bofh · Sep 13, 2019

ahovda said:
inaccessible due to timeouts on the fuse file system.

I had to go to each node, stop pve-cluster, restart corosync, watch for cluster join (corosync-quorumtool -m), and then start pve-cluster. Once I had done that on all hosts, it finally let go and access to /etc/pve was fast as normal again.

question is why do we have those timeouts of the fuse filesystem.
promox is supposed to have /etc/pve readonly on nodes that fall out and keep it as is on the other nodes.
but for me, the faulty node works fine while all other nodes (that are still joined and working) freezing.

it would at least midigate the issue without freezes.

for me just restart corosync solves it, i do not need to restart the cluster as it comes back on its own. i restart every node to be safe but usually the faulty node itself is enough. is this different for HA clusters?

joshin · Sep 13, 2019

bofh said:
Now given the cause could be very well OVH and their praised virtual network, however the way those errors are handled is wierd.

If you're using their vRack backend network, make sure it's not throttled to 100Mbps - that's not enough for cluster ops and any sort of back end traffic.

iperf is your friend

brad_mssw · Sep 14, 2019

@ahovda ah interesting, we had major issues with OpenVSwitch and finally switched away from it once PVE started supporting vlan-aware native linux bridges (and of course after we figured out how to use them properly, had some issues breaking out vlan interfaces for the host).

Anyhow, once we killed OVS everything stabilized, granted we're still on PVE 5.4. Wonder how many people having issues with corosync are using OVS.

Whatever · Sep 14, 2019

Another observation is that in my setups only nodes with no swap (zfs as root and NFS share as datastore) and vm.swappiness=0 in sysctl.conf are affected

I do remember the unresolved issue with PVE 5.x where swap has been used even with vm.swappiness=0 by pve process. Couldn't this be the case with corosync as well?

Ivan Gersi · Sep 15, 2019

I already tried everything...set totem time to 10k, set 2 rings for every node, automatic corosync restart in 12h period...and I`m still waiting for the miracle.
The corosync maner is totally illogically.
E.g.
My node1 don`t want to connect to cluster in this morning. Restart corosync didin`t help but I have a new issue. When I restarted corosync in node2, node 2 lost connection a can`t connect anymore to node3,4,5. Node 1 is offline too. But when I restared corosync in node 1 (offline all the time) node2 is connected suddenly! I`ve repeat this scenario several times, with the same results. It seems like node1 blocked node2 to connect with 3,4,5.
Edit: When I restarted corosync in node3, node 1 connected suddenly too and all nodes are online in the cluster.
2ndEdit: Node5 is on via pvecm status but offline via web. (resolved restart pve-cluster service).

bofh · Sep 16, 2019

joshin said:
If you're using their vRack backend network, make sure it's not throttled to 100Mbps - that's not enough for cluster ops and any sort of back end traffic.

iperf is your friend

sorry but this is not correct. it is far enough for corosync alone, just make shure you dont run your migrations over this.
corosync itself uses nothing just needs super low latency

on a 4 node cluster we have stable 150kbit down and upload on that line.
so no you do not need more than those 100mbit. i dont since i have corosync seperated and run the rest of ove traffics over the other network card.

bofh · Sep 16, 2019

Whatever said:
Another observation is that in my setups only nodes with no swap (zfs as root and NFS share as datastore) and vm.swappiness=0 in sysctl.conf are affected

I do remember the unresolved issue with PVE 5.x where swap has been used even with vm.swappiness=0 by pve process. Couldn't this be the case with corosync as well?

swapiness 0 doesnt mean swap will not be used. it coudl be used if the system is out of memory.
what you can do to test this theory is install zram and give it a bit swappiness

gizmo15 · Sep 16, 2019

the problem is still present for me but it evolved :

Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] link: host: 5 link: 0 is down
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 has no active links
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257

and lot of retransmit

bofh · Sep 16, 2019

gizmo15 said:
the problem is still present for me but it evolved :

Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] link: host: 5 link: 0 is down
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 has no active links
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257

and lot of retransmit

jup thats how it looks like after reconnect. same here.

bofh · Sep 16, 2019

Ivan Gersi said:
1 don`t want to connect to cluster in this morning. Restart corosync didin`t help but I have a new issue. When I restarted corosync in node2, node 2 lost connection a can`t connect anymore to node3,4,5. Node 1 is offline too. But when I restared corosync in node 1 (offline all the time) node2 is connected suddenly! I`ve repeat this scenario several times, with the same results. It seems like node1 blocked node2 to connect with 3,4,5.
Edit: When I restarted corosync in node3, node 1 connected suddenly too and all nodes are online in the cluster.
2ndEdit: Node5 is on via pvecm status but offline via web. (resolved restart pve-cluster service).

yep, down nodes block running nodes, in often a wierd way. how is your network setup?

spirit · Sep 16, 2019

gizmo15 said:
the problem is still present for me but it evolved :

Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] link: host: 5 link: 0 is down
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 has no active links
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257

and lot of retransmit

you do have something in /var/log/kern.log or #dmesg related to nic ?
nic model / driver ?
do you use ovs or linux bridge ?
do you host locally or in a public provider (ovh, heizner,....)

(maybe a public googlesheet to centralize all differents user config could help to compare setup ?)

spirit · Sep 16, 2019

maybe this libknet patch could help for a node blocking other node?

https://github.com/kronosnet/kronosnet/commit/0f67ee86745d52d68f376c92e96e1dd6661e9f5d

bofh · Sep 17, 2019

spirit said:
maybe this libknet patch could help for a node blocking other node?

https://github.com/kronosnet/kronosnet/commit/0f67ee86745d52d68f376c92e96e1dd6661e9f5d

i dont think so, if anything it would raise more issues.
it does nto address the completly lockup of running nodes, it just adresses when we see a timeout.

as much i understand this:

so currently they measure at start the max ping and set it as a default, which is worst case latency. so it wont timeout next time it hits that sealing.
downside corosync itself has a bit higher latency transfering changes.
however that latency dont bother us at all.

however once they use average, even little spikes will cause timeouts, causing link downs, make things even worse.
right now we seem to have a problem that we either have once in a while a spike, or something, or a coro throw up and then imidiate lockup once all links are down.
so with even lowering the expected ping value we would have a non relevant faster corosync but more link downs

yet we dont know the real source of the issues and secondly dont know why nodes are locking up

spirit · Sep 17, 2019

looking at knet code, I'm seem the link down message only at 2 places (/libknet/threads_heartbeat.c)

Code:

static void _handle_check_each {
...
if (dst_link->transport_connected == 0) {
        _link_down(knet_h, dst_host, dst_link);
        return;
    }

(I think this one is when the node is really disconnect)

and later in same function

Code:

timespec_diff(pong_last, clock_now, &diff_ping);
    if ((pong_last.tv_nsec) && 
        (diff_ping >= (dst_link->pong_timeout_adj * 1000llu))) {
        _link_down(knet_h, dst_host, dst_link);
    }

(and this one when latency is too big)

Don't have checked all the code already, but the previous patch for knet, was changing method to calc
pong_timeout_adj, so "maybe" it could help/change current behaviour.

spirit · Sep 17, 2019

if someone want to test, I have build libknet with 2 upstream patchs:
https://github.com/kronosnet/kronosnet/commit/f45e4c67902b95bcd212275f5f6081fa31311793.patch
https://github.com/kronosnet/kronosnet/commit/0f67ee86745d52d68f376c92e96e1dd6661e9f5d.patch

the deb is here (need to install it on all nodes (dpkg -i libknet1_1.11-pve1_amd64.deb) and restart corosync after update)
http://odisoweb1.odiso.net/libknet1_1.11-pve1_amd64.deb

I have keep the same version than proxmox, so to rollback from proxmox repo: "apt reinstall libknet1"

bofh · Sep 17, 2019

spirit said:
if someone want to test, I have build libknet with 2 upstream patchs:
https://github.com/kronosnet/kronosnet/commit/f45e4c67902b95bcd212275f5f6081fa31311793.patch
https://github.com/kronosnet/kronosnet/commit/0f67ee86745d52d68f376c92e96e1dd6661e9f5d.patch

the deb is here (need to install it on all nodes (dpkg -i libknet1_1.11-pve1_amd64.deb) and restart corosync after update)
http://odisoweb1.odiso.net/libknet1_1.11-pve1_amd64.deb

I have keep the same version than proxmox, so to rollback from proxmox repo: "apt reinstall libknet1"

i would test it bnut do you reall think it would help to even further reduce the time pong has to come?
the patch lowers this value based on average instead of peak.

so before, if you max ping was 1.5ms you wont time out on a pong at 1.5. with the new patch you may time out sooner.
specially if you ping has a bit of a jitter

[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Well-Known Member

Renowned Member

Attachments

New Member

Active Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Distinguished Member

Distinguished Member

Well-Known Member

Distinguished Member

Distinguished Member

Well-Known Member

We value your privacy