Proxmox 6.2 / Corosync 3 - rare and spontaneous disruptive UDP:5405-storm/flood

spirit

Famous Member
Apr 2, 2010
4,434
319
103
www.odiso.com
1. Yes, we try stop pve-cluster and pkill pmxcfs.
This stop flood only for short time; After restart service some times later flood appear again.

ok. but the flood don't have appear again until you have restart pve-cluster ?

I ask that, because I'm currently debugging a pmxcfs bug (a different bug with /etc/pve is locked), but who's known, maybe it could be related.

Another question: when you stop pmxcfs/pve-cluster, and the flood stop, how is the state of corosync on differents nodes ? does the nodes see others nodes again ?

(I'm trying to known if the flood is the cause of corosync split brain, or the corosync split brain the cause of flood)
 

spirit

Famous Member
Apr 2, 2010
4,434
319
103
www.odiso.com
if possible, you can enable debug log for pve-cluster.

edit:

/lib/systemd/system/pve-cluster.service

and add "-d" to:

Code:
ExecStart=/usr/bin/pmxcfs -d


then,
Code:
systemctl daemon-reload
systemctl restart pve-cluster

but be carefull, it's really verbose, so check you disk space for log.

do this for all nodes, and if the problem occur again,
send log of the flooding node

Code:
cat /var/log/daemon.log |grep "corosync\|pmxcfs"
 

Aminuxer

New Member
Sep 13, 2020
16
0
1
38
aminux.wordpress.com
ok. but the flood don't have appear again until you have restart pve-cluster ?

Another question: when you stop pmxcfs/pve-cluster, and the flood stop, how is the state of corosync on differents nodes ? does the nodes see others nodes again ?

(I'm trying to known if the flood is the cause of corosync split brain, or the corosync split brain the cause of flood)
Flood can appear after restart corosync/pve-cluster random time delay, even 2-3 days after restart.

No. We have before 22 nodes in cluster, and after stop corosync and repeated restarts one-by-one some split-brains exists, some nodes view only self - no quorum and blocked activity. But when we try pkill pmxcfs and restart, some nodes can view another ones; But cluster not stable.

Main problem similar appear in corosync (knet ?), not pve-cluster; when corosync can catch quorum, pve-cluster work fine;

At first we try deep diag of networks - check pings by this script running on each node:
Bash:
#!/bin/bash

servers6=(vps1 vps2 vps3 vps4 vps5 vps6 vps7 vps8 vps9 vpsN)

log='/tmp/corosync-pings-diag.txt';
echo "" > $log

pmax_val=0;
pmax_node='';

pdev_val=0;
pdev_node='';

for srv in ${servers6[*]}
do
    echo "=-==-==-==-==-- $srv --==-==-==-==-=";
    ping -A -c 4 -s 800 $srv | tee $log

    v1=`cat $log | grep rtt | cut -d ' ' -f 4`
    v2=`cat $log | grep rtt | cut -d ' ' -f 7`

    pmin=`echo -n "$v1" | cut -d '/' -f 1`
    pavg=`echo -n "$v1" | cut -d '/' -f 2`
    pmax=`echo -n "$v1" | cut -d '/' -f 3`
    pdev=`echo -n "$v1" | cut -d '/' -f 4`

    pimg=`echo -n "$v2" | cut -d '/' -f 1`
    pwma=`echo -n "$v3" | cut -d '/' -f 2`

    pmax=`echo "$pmin*10000/10" | bc`
    pdev=`echo "$pdev*10000/10" | bc`

    if [ $pmax -ge $pmax_val ]
    then
       pmax_val=$pmax;
       pmax_node=$srv;
    fi

    if [ $pdev -ge $pdev_val ]
    then
       pdev_val=$pdev;
       pdev_node=$srv;
    fi

done

echo "
================================================================";
echo "Maximum ping $pmax_val mks detected for node $pmax_node"
echo "Maximum pind-deviation $pdev_val mks detected for node $pdev_node"
We found that maximum ping ever between server-rooms was not above 0.6 ms with deviance near 45-60 mks. It't like correct. We also have not excessive utilization over 10GbE links between rooms;

Output from corosync-cmapctl -m state too long even for spoiler;
I see options like stats.knet.node1.link0.tx_pong_errors, stats.knet.node1.link0.tx_pong_retries, stats.knet.node1.link0.tx_ping_errors, stats.knet.node1.link0.tx_ping_retries. In bad state this non-zero, in normal state - zero.

I assume what many retransmits is result of packet oss under storm.
 

Aminuxer

New Member
Sep 13, 2020
16
0
1
38
aminux.wordpress.com
In progress of recovery and try understanding issue we perform this steps:

0. Check errors on switch ports and option like storm control - this correct;

1. disable encryption for traffic analyze;
crypto_cipher: none
crypto_hash: none

and found this;
Code:
Sep 21 10:20:32 vps4 corosync[9641]: [TOTEM ] Message received from 37.153.1.53 has bad magic number (probably sent by encrypted Kronosnet/Corosync 2.0/2.1/1.x/OpenAIS or unknown).. Ignoring

Sep 21 10:20:33 vps4 corosync[9641]: [TOTEM ] Message received from 37.153.1.51 has bad magic number (probably sent by encrypted Kronosnet/Corosync 2.0/2.1/1.x/OpenAIS or unknown).. Ignoring

We found related code:

http://lira.epac.to:8080/doc/corosync/api/html/totemsrp_8c_source.html
string 4912

and strange errors about magic bit;

After this we restore most important VMs on another hardware, remove 6 nodes forcebly and build new cluster from 7 nodes, switching this new one to `sctp` transport and increase timeouts:

Code:
corosync.conf :

totem {
  cluster_name: TEST-20200920
  config_version: 27
  interface {
    knet_transport: sctp                    <---------------
    linknumber: 0
    token_retransmits_before_loss_const: 10  <---------------
    join: 150    <---------------
    token: 5000   <---------------

After this new small cluster catch quorum and still working normally; We watch for logs nowtime and try found correlations and other potentially useful info. In new sctp cluster, we don't have retransmits, and quorum OK; But i don't know can it survivev after growing to 22 nodes;

Also we can't fully exclude network issues, even we have dedicated 10G networks and formally ideal pings and bandwiths.
Nowtime we watch for old cluster with udp and new cluster with sctp transports;

In any case, even under network packet loss corosync don't will start so massive and disruptive flood - this similar to bug;
We will hope to help solve this problem in future;

We try enable pve-cluster debug and analyze this data by your recommendations.
We can answer some times later after complete next diagnostics steps.

Thank your very much any case;
 

spirit

Famous Member
Apr 2, 2010
4,434
319
103
www.odiso.com
can you send stats of :


stats.knet.node*.link*.latency_ave - Average latency of the given link

stats.srp.time_since_token_last_received - global stats on token time


Network latency is not enough, you also also the compute time in corosync (if for example you have an old server with weak cpu in your cluster).

On my network, with 20 nodes cluster working fine out of the box, I'm around 0.019ms ping latency between nodes.

latency_ave counter : is around 80
stats.srp.time_since_token_last_received: is between 100-200
 

Aminuxer

New Member
Sep 13, 2020
16
0
1
38
aminux.wordpress.com
At node vps1 with sctp:

Code:
root@vps1:~#  corosync-cmapctl -m stats | grep -E "(latency_ave|time_since_token_last_received)"
stats.knet.node1.link0.latency_ave (u32) = 0
stats.knet.node2.link0.latency_ave (u32) = 254
stats.knet.node3.link0.latency_ave (u32) = 141
stats.knet.node4.link0.latency_ave (u32) = 239
stats.knet.node5.link0.latency_ave (u32) = 233
stats.knet.node6.link0.latency_ave (u32) = 348
stats.knet.node7.link0.latency_ave (u32) = 340
stats.srp.time_since_token_last_received (u64) = 241

At node vps7 with udp:
Code:
root@vps7:~# corosync-cmapctl -m stats | grep -E "(latency_ave|time_since_token_last_received)"
stats.knet.node1.link0.latency_ave (u32) = 4750
stats.knet.node10.link0.latency_ave (u32) = 5939
stats.knet.node11.link0.latency_ave (u32) = 4427
stats.knet.node12.link0.latency_ave (u32) = 168871
stats.knet.node13.link0.latency_ave (u32) = 10437
stats.knet.node14.link0.latency_ave (u32) = 5328
stats.knet.node15.link0.latency_ave (u32) = 7105
stats.knet.node16.link0.latency_ave (u32) = 13045
stats.knet.node17.link0.latency_ave (u32) = 5248
stats.knet.node18.link0.latency_ave (u32) = 4844
stats.knet.node19.link0.latency_ave (u32) = 9522
stats.knet.node2.link0.latency_ave (u32) = 371980
stats.knet.node20.link0.latency_ave (u32) = 15181
stats.knet.node21.link0.latency_ave (u32) = 15176
stats.knet.node22.link0.latency_ave (u32) = 2309
stats.knet.node3.link0.latency_ave (u32) = 0
stats.knet.node4.link0.latency_ave (u32) = 4586
stats.knet.node5.link0.latency_ave (u32) = 14547
stats.knet.node6.link0.latency_ave (u32) = 5030
stats.knet.node7.link0.latency_ave (u32) = 1371
stats.knet.node8.link0.latency_ave (u32) = 1187
stats.knet.node9.link0.latency_ave (u32) = 4298
stats.srp.time_since_token_last_received (u64) = 4383473310


My topology some wide:
{vps1-9 nodes}--1G--{switch1}---10G-10km-fiber-link---{switch2}--{Blade* nodes}

Nowtime both clusters (with udp and with sctp) has nodes at both switches;
But old cluster with udp-based transport generate flood and exhaust firewalls;
New cluster with tcp-based transport work fine;

At this point in old cluster with udp-transport remain only 11 nodes;
We set pvecm expected 11 for try catch quorum but under flood corosync don't catch quorum due to links / firewall exhausted;
Try to accurate restart corosync on udp-transport cluster;

P.S. We found some silmilar issue here: https://github.com/corosync/corosync/issues/389
but this about old version of corosync.
 
Last edited:

spirit

Famous Member
Apr 2, 2010
4,434
319
103
www.odiso.com
At node vps7 with udp:
values are really huge here.
I'm not sure if average value are resetted (don't know if the flood have impacted the stats).
but it's really bad values.

My topology some wide:
{vps1-9 nodes}--1G--{switch1}---10G-10km-fiber-link---{switch2}--{Blade* nodes}
oh ok, you are on 2 sites with 10km.
are you sure that you never have small disruption on this link ? (small lag, or small saturation,..)

Nowtime both clusters (with udp and with sctp) has nodes at both switches;
But old cluster with udp-based transport generate flood and exhaust firewalls;
New cluster with tcp-based transport work fine;
sctp could improve retransmit. I known it's not the default yet because it was still buggy in previous corosync 3.X version, but they have done a lot of fixes recently.
So if it's working fine for you, maybe it could be the solution, but I don't have any experience with sctp.


for udp cluster,you could try to increase "token" value in corosync.conf

total Token timeout is computed as a "token + (number_of_nodes - 2) * token_coefficient" where token is by default 1000ms and token_coefficient is by default 650ms.

I have read on redhat bugzilla that "token:3000" or "token:5000" give better result sometimes.

Code:
/etc/pve/corosync.conf

totem {
   ....
   token:3000
}

but don't increase it too much
 

spirit

Famous Member
Apr 2, 2010
4,434
319
103
www.odiso.com
@Aminuxer

Hi,
a new pve-cluster package is available in pvetest repository (should be available soon in other repo).
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/pve-cluster_6.2-1_amd64.deb


It's fixing a big bug that could happen on corosync join/left

http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/pve-cluster_6.2-1.changelog


I'm not sure it's related your flood bug, but if you can test it, it could be great.
(It need to be installed on all nodes, and restart pve-cluster service.


It's fixing the bug I had 6 month ago when my flood bug occured after shutting down a node, but maybe it's happend to you too, if a node is leaving because of latency, then the flood occur and blocking the others nodes.
 

Aminuxer

New Member
Sep 13, 2020
16
0
1
38
aminux.wordpress.com
Nowtime we switched to SCTP and increase timeouts. Also we setting up sysctl option:
Code:
net.core.netdev_max_backlog = 50000

Working over SCTP produce more readable and useful logs, we remain with this option's set.
We trying stay updated at actual versions too, but not ready install test version on production cluster.

P.S. We can't live under massive flood, and now we can't test/reproduce problem with old configuration;
We must disassemble old cluster due to extremely disaster;
In smaller cluster flood problem don't appear;

With SCTP, bigger timeouts and fixed sysctl our distributed cluster work fine.
I hope that my reports help to fix this so destructive bug.

Thank your very much.
 
Last edited:

Karbon

Member
Dec 9, 2016
1
0
6
40
Hi i have very similar situation, cluster is about 20 nodes and there is a big problem with usp corosync flood (sometimes one per month but sometimes daily).

I try to avoid cluster spliting.
Should I switch to sctp? Is it safe if this isn't default configuration?
@spirit wrote about bugfix, where can I find details about it?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!