[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Status
Not open for further replies.
little update on a second ring.

i have now
eno2 - vlan for ring0 (10gbit interface to OVH vrack)
eno1 - bridge via tinc for ring1

while most nodes report no errors, one special problem node does report link downs, sometimes even both at least to some hosts like

Code:
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] link: host: 4 link: 0 is down
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] link: host: 1 link: 0 is down
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] host: host: 4 has no active links
Sep 12 14:06:49 h3 corosync[19797]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Sep 12 14:06:50 h3 corosync[19797]:   [KNET  ] rx: host: 4 link: 0 is up
Sep 12 14:06:50 h3 corosync[19797]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)

Now the strange part is:
Now given the cause could be very well OVH and their praised virtual network, however the way those errors are handled is wierd.

With one ring only link down would result in freezing the hole cluster and not recover (manual restart required)
Now we shoudl expect if both rings are down the result should be the same

well it isnt. it recovers imidiatly and no freezes where noticed (maybe there are but then only for one second as they recover imidiatly anyway)


so to summarize 3 basic issues to investigate
-cause of the problem (maybe driver, setting or network related and maybe individual different, maybe corobug)
-corosync handling of the issue (symptom) and ability recover or not
-even if coro can recover with second link, question remains - specially for HA user - if those short outages have consequences asside from spamming the log


ediT: i will test the behaviour on the weekend with replacing tinc with a simple 2nd VLAN on eno2. in theory this should be the same result as with only one ring, as by all logic if theres any fault in the network chain (driver - interface - cableling - switches - whatever) it shoudl affect both at the very same time and act as if thers only one ring

i come to think that tinc maybe unique able to midigate errors because of its meshed nature.
or corosync error corrects differently with a second ring.
ofc this would be only a workaround of the rootcause but the root seems to be very different for most people
 
Last edited:
Another hang which breaks even NFS connection and trace linux kernel
 

Attachments

  • syslog.txt
    106.5 KB · Views: 4
@ahovda your cluster still stable now
Unfortunately, I don't think it made a difference, I still have problems, and had to recover from a cluster meltdown yesterday. I have tried both intremap=off and disable_msi=1, both with no success -- which was expected, since all the posts suggesting those flags are for older kernel versions. There are timeout errors with the bnx2 driver, supposedly in combination with the chipset (these are all Dell PE[37]10 servers). Everything is patched up, software and firmware.

I get quite a few of these: NETDEV WATCHDOG: eno2 (bnx2): transmit queue 0 timed out which seems to indicate that the hardware does not respond to low-level PCI messages in time (or something like that). I've seen corosync trying to resend pending messages, counting 20, 30, 40 and boom: hardware timeout and it falls apart. Apparently it is not exclusive to bnx2 and I'm now suspecting it has to do with NIC bonding, https://bugzilla.kernel.org/show_bug.cgi?id=197325 . I have been using openvswitch with active-backup bonds, but I'll try Linux bonds instead.

What makes it worse is the automatic host reboot due to HA (which I've had to disable for now) and that it messes with pmxcfs in a way that makes /etc/pve inaccessible due to timeouts on the fuse file system. When I try to SSH to the host, it can sometimes time out too, since the /etc/ssh/ssh_known_hosts file is sitting inside /etc/pve and sshd will try to read it. I could trigger the hardware fault on another cluster node in that state by issuing find /etc/pve which would freeze at some file, before eventually completing. Access to /etc/pve was super slow.

I had to go to each node, stop pve-cluster, restart corosync, watch for cluster join (corosync-quorumtool -m), and then start pve-cluster. Once I had done that on all hosts, it finally let go and access to /etc/pve was fast as normal again.

I'm digging through the logs.

HTH, Ådne
 
inaccessible due to timeouts on the fuse file system.

I had to go to each node, stop pve-cluster, restart corosync, watch for cluster join (corosync-quorumtool -m), and then start pve-cluster. Once I had done that on all hosts, it finally let go and access to /etc/pve was fast as normal again.

question is why do we have those timeouts of the fuse filesystem.
promox is supposed to have /etc/pve readonly on nodes that fall out and keep it as is on the other nodes.
but for me, the faulty node works fine while all other nodes (that are still joined and working) freezing.

it would at least midigate the issue without freezes.


for me just restart corosync solves it, i do not need to restart the cluster as it comes back on its own. i restart every node to be safe but usually the faulty node itself is enough. is this different for HA clusters?
 
@ahovda ah interesting, we had major issues with OpenVSwitch and finally switched away from it once PVE started supporting vlan-aware native linux bridges (and of course after we figured out how to use them properly, had some issues breaking out vlan interfaces for the host).

Anyhow, once we killed OVS everything stabilized, granted we're still on PVE 5.4. Wonder how many people having issues with corosync are using OVS.
 
Another observation is that in my setups only nodes with no swap (zfs as root and NFS share as datastore) and vm.swappiness=0 in sysctl.conf are affected

I do remember the unresolved issue with PVE 5.x where swap has been used even with vm.swappiness=0 by pve process. Couldn't this be the case with corosync as well?
 
I already tried everything...set totem time to 10k, set 2 rings for every node, automatic corosync restart in 12h period...and I`m still waiting for the miracle.
The corosync maner is totally illogically.
E.g.
My node1 don`t want to connect to cluster in this morning. Restart corosync didin`t help but I have a new issue. When I restarted corosync in node2, node 2 lost connection a can`t connect anymore to node3,4,5. Node 1 is offline too. But when I restared corosync in node 1 (offline all the time) node2 is connected suddenly! I`ve repeat this scenario several times, with the same results. It seems like node1 blocked node2 to connect with 3,4,5.
Edit: When I restarted corosync in node3, node 1 connected suddenly too and all nodes are online in the cluster.
2ndEdit: Node5 is on via pvecm status but offline via web. (resolved restart pve-cluster service).
 
Last edited:
  • Like
Reactions: Pourya Mehdinejad
If you're using their vRack backend network, make sure it's not throttled to 100Mbps - that's not enough for cluster ops and any sort of back end traffic.

iperf is your friend
sorry but this is not correct. it is far enough for corosync alone, just make shure you dont run your migrations over this.
corosync itself uses nothing just needs super low latency

on a 4 node cluster we have stable 150kbit down and upload on that line.
so no you do not need more than those 100mbit. i dont since i have corosync seperated and run the rest of ove traffics over the other network card.
 
Another observation is that in my setups only nodes with no swap (zfs as root and NFS share as datastore) and vm.swappiness=0 in sysctl.conf are affected

I do remember the unresolved issue with PVE 5.x where swap has been used even with vm.swappiness=0 by pve process. Couldn't this be the case with corosync as well?
swapiness 0 doesnt mean swap will not be used. it coudl be used if the system is out of memory.
what you can do to test this theory is install zram and give it a bit swappiness
 
the problem is still present for me but it evolved :

Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] link: host: 5 link: 0 is down
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 has no active links
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257

and lot of retransmit
 
the problem is still present for me but it evolved :

Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] link: host: 5 link: 0 is down
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 has no active links
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257

and lot of retransmit

jup thats how it looks like after reconnect. same here.
 
1 don`t want to connect to cluster in this morning. Restart corosync didin`t help but I have a new issue. When I restarted corosync in node2, node 2 lost connection a can`t connect anymore to node3,4,5. Node 1 is offline too. But when I restared corosync in node 1 (offline all the time) node2 is connected suddenly! I`ve repeat this scenario several times, with the same results. It seems like node1 blocked node2 to connect with 3,4,5.
Edit: When I restarted corosync in node3, node 1 connected suddenly too and all nodes are online in the cluster.
2ndEdit: Node5 is on via pvecm status but offline via web. (resolved restart pve-cluster service).

yep, down nodes block running nodes, in often a wierd way. how is your network setup?
 
the problem is still present for me but it evolved :

Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] link: host: 5 link: 0 is down
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Sep 16 14:36:43 hv02 corosync[10181]: [KNET ] host: host: 5 has no active links
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257
Sep 16 14:36:43 hv02 corosync[10181]: [TOTEM ] Retransmit List: 255 256 257

and lot of retransmit

you do have something in /var/log/kern.log or #dmesg related to nic ?
nic model / driver ?
do you use ovs or linux bridge ?
do you host locally or in a public provider (ovh, heizner,....)


(maybe a public googlesheet to centralize all differents user config could help to compare setup ?)
 
maybe this libknet patch could help for a node blocking other node?

https://github.com/kronosnet/kronosnet/commit/0f67ee86745d52d68f376c92e96e1dd6661e9f5d

i dont think so, if anything it would raise more issues.
it does nto address the completly lockup of running nodes, it just adresses when we see a timeout.

as much i understand this:

so currently they measure at start the max ping and set it as a default, which is worst case latency. so it wont timeout next time it hits that sealing.
downside corosync itself has a bit higher latency transfering changes.
however that latency dont bother us at all.

however once they use average, even little spikes will cause timeouts, causing link downs, make things even worse.
right now we seem to have a problem that we either have once in a while a spike, or something, or a coro throw up and then imidiate lockup once all links are down.
so with even lowering the expected ping value we would have a non relevant faster corosync but more link downs


yet we dont know the real source of the issues and secondly dont know why nodes are locking up
 
looking at knet code, I'm seem the link down message only at 2 places (/libknet/threads_heartbeat.c)
Code:
static void _handle_check_each {
...
if (dst_link->transport_connected == 0) {
        _link_down(knet_h, dst_host, dst_link);
        return;
    }
(I think this one is when the node is really disconnect)

and later in same function

Code:
timespec_diff(pong_last, clock_now, &diff_ping);
    if ((pong_last.tv_nsec) && 
        (diff_ping >= (dst_link->pong_timeout_adj * 1000llu))) {
        _link_down(knet_h, dst_host, dst_link);
    }
(and this one when latency is too big)

Don't have checked all the code already, but the previous patch for knet, was changing method to calc
pong_timeout_adj, so "maybe" it could help/change current behaviour.
 
if someone want to test, I have build libknet with 2 upstream patchs:
https://github.com/kronosnet/kronosnet/commit/f45e4c67902b95bcd212275f5f6081fa31311793.patch
https://github.com/kronosnet/kronosnet/commit/0f67ee86745d52d68f376c92e96e1dd6661e9f5d.patch

the deb is here (need to install it on all nodes (dpkg -i libknet1_1.11-pve1_amd64.deb) and restart corosync after update)
http://odisoweb1.odiso.net/libknet1_1.11-pve1_amd64.deb

I have keep the same version than proxmox, so to rollback from proxmox repo: "apt reinstall libknet1"
 
  • Like
Reactions: ahovda
if someone want to test, I have build libknet with 2 upstream patchs:
https://github.com/kronosnet/kronosnet/commit/f45e4c67902b95bcd212275f5f6081fa31311793.patch
https://github.com/kronosnet/kronosnet/commit/0f67ee86745d52d68f376c92e96e1dd6661e9f5d.patch

the deb is here (need to install it on all nodes (dpkg -i libknet1_1.11-pve1_amd64.deb) and restart corosync after update)
http://odisoweb1.odiso.net/libknet1_1.11-pve1_amd64.deb

I have keep the same version than proxmox, so to rollback from proxmox repo: "apt reinstall libknet1"

i would test it bnut do you reall think it would help to even further reduce the time pong has to come?
the patch lowers this value based on average instead of peak.

so before, if you max ping was 1.5ms you wont time out on a pong at 1.5. with the new patch you may time out sooner.
specially if you ping has a bit of a jitter
 
Status
Not open for further replies.

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!