[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

bofh · Sep 20, 2019

finally cluster broke today, installed patch, pray for the best

spirit · Sep 20, 2019

bofh said:
finally cluster broke today, installed patch, pray for the best

don't forgot to restart corosync after libknet upgrade.

bofh · Sep 20, 2019

spirit said:
don't forgot to restart corosync after libknet upgrade.

lol ofc,..

i also deactivated debugging for now and the second ring, run it now purely on the vrack ring.
which was the most unstable condition yet, lets see

spirit · Sep 20, 2019

bofh said:
lol ofc,..

i also deactivated debugging for now and the second ring, run it now purely on the vrack ring.
which was the most unstable condition yet, lets see

Just curious, do use the the free vrack offer at ovh (I think it's 100mbits but without guaranted), or the paid option to have 1gibabit guaranted ?)

bofh · Sep 20, 2019

the free one,
reason is simple that we wanted to leave corosync on a dedicaded alone so that gbit wouldnt help me at all anyway.
also 25€ additional for a 100€ server is kinda ridicolous

oh btw they DONT gurantee anything at the 1gbit vrack either. what you maybe mean is the public interface.
there you get the 1gbit for free and 1gbit guranteed for 100€ more

)

edit the issues doenst seem saturation issues on that vrack either. its more like spikes a fraction of a second several times a day.
usually to very compareable times. so my best guess would be a port resets there, which maybe has the same effect as saturation would have.

our line itself is sqeeuky clean, stable couple of kbits, no traffic spikes.

David Herselman · Sep 20, 2019

The following could perhaps be a separate but related bug. These logs are from a remaining node when one of the nodes dropped off. Cluster memberships updates successfully to 5/6 but notification message then start escalating with processing pause being reported. Eventually settles but I'm sure the 'token: 10000' setting saved us an escalating node fencing scenario...

The following messages are increasingly rapidly received, starting at 09:26:21:

Code:

Sep 20 09:26:27 kvm5b corosync[677695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] A new membership (2:604) was formed. Members
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [QUORUM] Members[5]: 2 3 4 5 6

Processes get overloaded:

Code:

Sep 20 09:26:27 kvm5b corosync[677695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6347 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6347 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6347 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6347 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]:   [TOTEM ] A new membership (2:608) was formed. Members
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]:   [QUORUM] Members[5]: 2 3 4 5 6

Recovers 4 seconds later:

Code:

Sep 20 09:26:31 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 10549 ms, flushing membership messages.
Sep 20 09:26:31 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 10549 ms, flushing membership messages.
Sep 20 09:26:31 kvm5b corosync[677695]:   [TOTEM ] Process pause detected for 10551 ms, flushing membership messages.
Sep 20 09:26:32 kvm5b corosync[677695]:   [TOTEM ] A new membership (2:656) was formed. Members
Sep 20 09:26:32 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]:   [QUORUM] Members[5]: 2 3 4 5 6

Continues looping until it finally settles another minute later. Node 1 rejoined cluster successfully when it finished restarting:

Code:

Sep 20 09:27:35 kvm5b corosync[677695]:   [TOTEM ] A new membership (2:816) was formed. Members left: 1
Sep 20 09:27:35 kvm5b corosync[677695]:   [TOTEM ] Failed to receive the leave message. failed: 1
Sep 20 09:27:35 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]:   [QUORUM] Members[5]: 2 3 4 5 6
Sep 20 09:27:35 kvm5b corosync[677695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 09:30:24 kvm5b corosync[677695]:   [KNET  ] rx: host: 1 link: 0 is up
Sep 20 09:30:24 kvm5b corosync[677695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 20 09:30:27 kvm5b corosync[677695]:   [TOTEM ] A new membership (1:820) was formed. Members joined: 1
Sep 20 09:30:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]:   [CPG   ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]:   [QUORUM] Members[6]: 1 2 3 4 5 6
Sep 20 09:30:27 kvm5b corosync[677695]:   [MAIN  ] Completed service synchronization, ready to provide service.

spirit · Sep 20, 2019

bofh said:
the free one,
reason is simple that we wanted to leave corosync on a dedicaded alone so that gbit wouldnt help me at all anyway.
also 25€ additional for a 100€ server is kinda ridicolous

oh btw they DONT gurantee anything at the 1gbit vrack either. what you maybe mean is the public interface.
there you get the 1gbit for free and 1gbit guranteed for 100€ more )

sorry, this is french page, but
https://www.ovh.com/fr/serveurs_dedies/advance/adv-1/
1Gbps unmetered and guaranteed private bandwidth - vRack option

edit the issues doenst seem saturation issues on that vrack either. its more like spikes a fraction of a second several times a day.
usually to very compareable times. so my best guess would be a port resets there, which maybe has the same effect as saturation would have.
our line itself is sqeeuky clean, stable couple of kbits, no traffic spikes.

What I mean is that as their vrack is basically a vxlan overlay tunneling, I wonder if with 100mbits they don't use qos and put a lot of users inside it.
And maybe you don't see saturation on your side, but the underlay network have spike or buffer exhaustion.
I have some coworker who's work at ovh beforce, I'll try to ask them. (I'm 10km away from ovh RBX datacenter

spirit · Sep 20, 2019

David Herselman said:
The following could perhaps be a separate but related bug. These logs are from a remaining node when one of the nodes dropped off. Cluster memberships updates successfully to 5/6 but notification message then start escalating with processing pause being reported. Eventually settles but I'm sure the 'token: 10000' setting saved us an escalating node fencing scenario...

difficult to known without the debug enabled.
do you have already upgrade libknet to 1.11-pve2 ?

http://download.proxmox.com/debian/...est/binary-amd64/libknet1_1.11-pve2_amd64.deb

bofh · Sep 20, 2019

spirit said:
What I mean is that as their vrack is basically a vxlan overlay tunneling, I wonder if with 100mbits they don't use qos and put a lot of users inside it.
And maybe you don't see saturation on your side, but the underlay network have spike or buffer exhaustion.
I have some coworker who's work at ovh beforce, I'll try to ask them. (I'm 10km away from ovh RBX datacenter

lol

btw ask him if their colleagues in london are drinking on the job, i have a horrifying support log as a testimony

but seriously, im aware that i could hit a bottleneck on the vrouter when some jackass saturates their line but this doesnt really fit the error picture here. my issues where always just for a fraction of a second. a regular mtr and smokeping didnt find anything during errortimes, its that short.

its smells more like spikes in the 2-10ms range and this doesnt fit the - my neighboor floods the vrouter so i have no bandwith now - bill

then again i learned anything is possible with ovh, quiet the expierience

...
but hey i need hundreds of IPs and here i dont pay monthly for them so its worth it already lol

spirit · Sep 20, 2019

bofh said:
but seriously, im aware that i could hit a bottleneck on the vrouter when some jackass saturates their line but this doesnt really fit the error picture here.

yes, anyway it's a corosync bug. The latency trigger the bug, but without the bug, the state should come back ok after.
Keep us in touch if the patch is working fine. (The corosync devs seem to be reactive now that we can reproduce it)

bofh · Sep 21, 2019

looks promising first 24 hours without any incident

Code:

Sep 21 08:48:00 h3 systemd[1]: Started Proxmox VE replication runner.
Sep 21 08:48:59 h3 corosync[29388]:   [KNET  ] link: host: 4 link: 0 is down
Sep 21 08:48:59 h3 corosync[29388]:   [KNET  ] link: host: 1 link: 0 is down
Sep 21 08:48:59 h3 corosync[29388]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 21 08:48:59 h3 corosync[29388]:   [KNET  ] host: host: 4 has no active links
Sep 21 08:48:59 h3 corosync[29388]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 08:48:59 h3 corosync[29388]:   [KNET  ] host: host: 1 has no active links
Sep 21 08:49:00 h3 systemd[1]: Starting Proxmox VE replication runner...
Sep 21 08:49:00 h3 corosync[29388]:   [KNET  ] rx: host: 1 link: 0 is up
Sep 21 08:49:00 h3 corosync[29388]:   [KNET  ] rx: host: 4 link: 0 is up
Sep 21 08:49:00 h3 corosync[29388]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 08:49:00 h3 corosync[29388]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 21 08:49:01 h3 pvesr[29545]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 21 08:49:02 h3 systemd[1]: pvesr.service: Succeeded.

this would have been a crash normaly, as we can see recovered imidiatly without a hassle
looks like that patch is golden

David Herselman · Sep 21, 2019

I upgraded libknet to 1.11-pve2 and restarted corosync with debugging enabled.

Was 'lucky' to have fairly recently started concurrent pings between all three nodes when a corosync host down event occurred.

I hope there's something useful in these logs.

kvm1 = 1.1.7.9
kvm2 = 1.1.7.10
kvm3 = 1.1.7.11

Simultaneous pings between all three nodes yields (IMHO) marginal packet loss. Corosync competes with Ceph on 2x1GbE LACP (OvS).

kvm1 -> kvm2 pings lost: 2 / 8980
kvm1 -> kvm3 pings lost: 2 / 8980
kvm3 -> kvm2 pings lost: 7 / 8980

Pings are sent at a rate of 10 a second, as a 9000 byte frame (8972 bytes + 28 byte ICMP header)

Code:

[root@kvm1 ~]# ping -i 0.1 -s 8972 1.1.7.10
<snip>
8980 bytes from 1.1.7.10: icmp_seq=8886 ttl=64 time=0.628 ms
8980 bytes from 1.1.7.10: icmp_seq=8887 ttl=64 time=0.592 ms
8980 bytes from 1.1.7.10: icmp_seq=8888 ttl=64 time=0.606 ms
^C
--- 1.1.7.10 ping statistics ---
8888 packets transmitted, 8886 received, 0.0225023% packet loss, time 1903ms
rtt min/avg/max/mdev = 0.468/0.765/22.508/0.956 ms


[root@kvm1 ~]# ping -i 0.1 -s 8972 1.1.7.11
8980 bytes from 1.1.7.11: icmp_seq=8847 ttl=64 time=0.507 ms
8980 bytes from 1.1.7.11: icmp_seq=8848 ttl=64 time=0.521 ms
8980 bytes from 1.1.7.11: icmp_seq=8849 ttl=64 time=0.569 ms
^C
--- 1.1.7.11 ping statistics ---
8849 packets transmitted, 8847 received, 0.0226014% packet loss, time 1590ms
rtt min/avg/max/mdev = 0.478/0.651/8.176/0.449 ms


[root@kvm3 ~]# ping -i 0.1 -s 8972 1.1.7.10
<snip>
8980 bytes from 1.1.7.10: icmp_seq=8800 ttl=64 time=0.593 ms
8980 bytes from 1.1.7.10: icmp_seq=8801 ttl=64 time=0.627 ms
8980 bytes from 1.1.7.10: icmp_seq=8802 ttl=64 time=0.620 ms
^C
--- 1.1.7.10 ping statistics ---
8802 packets transmitted, 8795 received, 0.0795274% packet loss, time 1088ms
rtt min/avg/max/mdev = 0.478/0.797/12.302/0.938 ms

kvm1:

Code:

Sep 21 11:54:28 kvm1 corosync[2956667]:   [KNET  ] link: host: 2 link: 0 is down
Sep 21 11:54:28 kvm1 corosync[2956667]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:54:28 kvm1 corosync[2956667]:   [KNET  ] host: host: 2 has no active links
Sep 21 11:54:28 kvm1 corosync[2956667]:   [TOTEM ] Knet host change callback. nodeid: 2 reachable: 0
Sep 21 11:54:28 kvm1 corosync[2956667]:   [KNET  ] rx: host: 2 link: 0 received pong: 1
Sep 21 11:54:31 kvm1 corosync[2956667]:   [KNET  ] rx: host: 2 link: 0 received pong: 2
Sep 21 11:54:34 kvm1 corosync[2956667]:   [KNET  ] rx: host: 2 link: 0 is up
Sep 21 11:54:34 kvm1 corosync[2956667]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:54:36 kvm1 corosync[2956667]:   [TOTEM ] Token has not been received in 382 ms

kvm2:

Code:

Sep 21 11:54:36 kvm2 corosync[2513045]:   [TOTEM ] Token has not been received in 7987 ms

kvm3:

Code:

Sep 21 11:54:36 kvm3 corosync[3689364]:   [TOTEM ] Token has not been received in 382 ms
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] link: host: 2 link: 0 is down
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] host: host: 2 has no active links
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]:   [KNET  ] rx: Source host 2 not reachable yet

spirit · Sep 22, 2019

David Herselman said:
I upgraded libknet to 1.11-pve2 and restarted corosync with debugging enabled.

do y ou have corosync crash, hang, node isolated with this version ? or after the network msh "not reachable yet", does it auto come back ?

Simultaneous pings between all three nodes yields (IMHO) marginal packet loss. Corosync competes with Ceph on 2x1GbE LACP (OvS).

kvm1 -> kvm2 pings lost: 2 / 8980
kvm1 -> kvm3 pings lost: 2 / 8980
kvm3 -> kvm2 pings lost: 7 / 8980

Pings are sent at a rate of 10 a second, as a 9000 byte frame (8972 bytes + 28 byte ICMP header)

Sorry, but you shouldnt have any packetloss if your network is stable. are you sure that your physical switch is not overloaded ? Does do you monitor your links bandwith ? (be carefull that with lacp, only 1 link is used for 1tcp/udp connection (src-dst-port )

Maybe can you do a more accurate ping flood "ping -f", and see if you have jitter

(Note that mixing corosync on ceph network is a bad idea, as ceph can easily fill the links with osd replication)

bofh · Sep 22, 2019

@David Herselman

also that ping give you only a little indication that something is wrong.
i would advise you get some monitoring going like https://www.librenms.org/
should be sufficent. that would give you an excet time state of network and you can compare it to error logs

@spirit i bet he is in a datacenter too, ping is a 0.5ms which isnt lan

David Herselman · Sep 22, 2019

Those 3 nodes are in a client's cluster using older equipment. They were previously 3 x VMware hosts using a HP SAN device where the IT guy's predecessor had set it up as RAID-0... Irrespective, their Proxmox + Ceph cluster has been super stable for the last 2 years but plagued by frequent restarts since upgrading to PVE 6.

The pings were full frame 9000 bytes on wire (excluding MAC src/dst headers) and whilst there is some jitter (22ms max latency) the packet loss was distributed and not all consecutive. Whilst the cluster didn't fence it shows Corosync 3 reporting not having received tokens for 8 seconds and node 3 having left the cluster whilst pings flowed freely during this time.

ie: As with spirit's report to the Corosync developers; a brief interruption appears to cause a cascade effect which can lead to nodes unnecessarily loosing quorum and subsequently being fenced when VMs and Ceph continue to function normally up until nodes are reset by the IPMI watchdog counters.

@bofh, latency of 0.5 ms for a 9000 byte frames is in line with 1 GbE trunk member capacities...

Again, to be crystal clear:
PVE 4 and 5 were rock solid with zero fencing occurring over a 2+ year period. Since upgrading to PVE 6 there have been countless false positive fencing events which result in a highly unstable environment:

Code:

[root@kvm1 ~]# last | grep -i boot
reboot   system boot  5.0.21-1-pve     Fri Sep 13 10:14   still running
reboot   system boot  5.0.21-1-pve     Fri Sep 13 04:06   still running
reboot   system boot  5.0.21-1-pve     Tue Sep 10 21:35   still running
reboot   system boot  5.0.21-1-pve     Sun Sep  8 03:34   still running
reboot   system boot  5.0.21-1-pve     Thu Sep  5 00:43   still running
reboot   system boot  5.0.21-1-pve     Wed Sep  4 03:58   still running
reboot   system boot  5.0.21-1-pve     Sun Sep  1 20:50   still running
reboot   system boot  5.0.21-1-pve     Sat Aug 31 16:06   still running
reboot   system boot  5.0.21-1-pve     Fri Aug 30 05:35   still running
reboot   system boot  5.0.18-1-pve     Sun Aug 25 03:02   still running
reboot   system boot  5.0.15-1-pve     Tue Aug 20 09:48   still running
reboot   system boot  5.0.15-1-pve     Mon Aug 19 12:51   still running
reboot   system boot  5.0.15-1-pve     Sun Aug 18 12:08   still running
reboot   system boot  5.0.15-1-pve     Sat Aug 17 07:46   still running
reboot   system boot  5.0.15-1-pve     Fri Aug  9 09:16   still running
reboot   system boot  5.0.15-1-pve     Thu Aug  8 08:49   still running
reboot   system boot  5.0.15-1-pve     Wed Aug  7 09:48   still running
reboot   system boot  5.0.15-1-pve     Tue Aug  6 07:46   still running
reboot   system boot  5.0.15-1-pve     Tue Aug  6 02:48   still running
reboot   system boot  5.0.15-1-pve     Mon Aug  5 20:21   still running
reboot   system boot  5.0.15-1-pve     Mon Aug  5 16:37   still running
reboot   system boot  5.0.15-1-pve     Mon Aug  5 16:12   still running
reboot   system boot  5.0.15-1-pve     Mon Aug  5 15:53   still running
reboot   system boot  5.0.15-1-pve     Sun Aug  4 22:29   still running
reboot   system boot  5.0.15-1-pve     Sat Aug  3 21:44   still running

The environment has been more stable since the 13th of September but there are corosync events logged almost hourly.

NB: Our own production clusters comprise of separate 2 x 10 GbE LACP trunks on each node (ie 4 x 10GbE NICs in each node) and this problem occurs much less frequently but we've also had unnecessary disruptions. If the problems with Corosync 3's knet protocol can be reproduced more frequently in this case it would also stabilise other clusters with better equipment.

spirit · Sep 22, 2019

Hi David,

do you still have quorum loss or fencing of node with libknet 1.11-pve2 ?

The patch only fix but where random node goes in bad state (corosync crash, looping at 100% cpu, don't see other members).
But It'll not help with suck latency (0,5s seem really huge, even with a 15years old cisco switch, I'm around 0,150ms at 1500 mtu ,and 0,25ms at 9000 mtu. That with packet loss, that seem really strange. )

With such high latency, as they are no more multicast, it's possible that default timeout are too low. (I don't remember but do you have already increase the token timeout value ?)

I'm running in production a 15nodes cluster with 2x10g (connect-x4 mlx card) with mellanox switch since 5months (lacp with linux bridge) : 0 disruption, only 1 or 2 retransmit by day. (but latency is around 0.05ms)

Last thing: I have seen user reported having bug with openvswitch+bond too and corosync.(revert back to linux bridge fix it). Can't confirm this.

David Herselman · Sep 23, 2019

We increase totem timeout to 10000 and yes, we're running libknet 1.11-pve2. We haven't had a false positive fencing scenario since the 13th but looking at logs indicates continuous problems so I assume it's a matter of time...

Symptoms appear very similar, in that relatively minor network congestion can cause knet to spiral.

OvS with LACP is our preferred integration and is stable on 6 other clusters (node sizes: 3, 3, 3, 6, 6, 8).

I'm fully aware that 2 x 1 GbE is not ideal but this still shouldn't cause a node to get kicked out of the cluster if jumbo sized pings run continually during one of these events...

How did you collect the detailed debug logging that you opened a fault with?

spirit · Sep 23, 2019

>>I'm fully aware that 2 x 1 GbE is not ideal but this still shouldn't cause a node to get kicked out of the cluster if jumbo sized pings run continually >>during one of these events...
One possibility, is that the ping don't use same link than corosync udp. (depend of hash algorithm).
maybe a tcpdump on each link could confirm this

But from your last logs, this was host2 not reacheable from host1 and host3 at same time.

Another possibility, could be a freeze,lag of corosync process on node2. (if you have a way to monitor cpu usage of corosync process, maybe it could help)

>>How did you collect the detailed debug logging that you opened a fault with?
I have enabled debug in corosync config.
logging {
debug: on
to_syslog: yes
}

The other stack strace logs was for coredump when corosync process crash. (apt install systemd-coredump, then when crashing, a log is generated in /var/lib/systemd/coredump/)

ahovda · Sep 23, 2019

spirit said:
Last thing: I have seen user reported having bug with openvswitch+bond too and corosync.(revert back to linux bridge fix it). Can't confirm this.

If that was referring to me, not really any problems with ovs. I'm running balance-tcp lacp after patch was installed on 16 node cluster. In fact, not even a single log entry for corosync since, and I've tried to stress test it by running iperf sessions in both directions (even simultaneously, to the point where there is a bit of packet loss). What is the reasoning behind not recommending active-backup anyway?

HTH, Ådne

spirit · Sep 23, 2019

ahovda said:
If that was referring to me, not really any problems with ovs. I'm running balance-tcp lacp after patch was installed on 16 node cluster. In fact, not even a single log entry for corosync since, and I've tried to stress test it by running iperf sessions in both directions (even simultaneously, to the point where there is a bit of packet loss). What is the reasoning behind not recommending active-backup anyway?
HTH, Ådne

Great that the patch is working fine for you

. For active-backup, should works fine too, I don't see any reason to not use it. (you just can't increase bandwith, but it should be pretty stable)

For the reference, they are a bug open about ovs and random network loss. (seem related to kernel version + openvswitch)
https://bugzilla.proxmox.com/show_bug.cgi?id=2296

[SOLVED] PVE 5.4-11 + Corosync 3.x: major issues

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Attachments

Distinguished Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Attachments

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Active Member

Distinguished Member

We value your privacy