Proxmox 6.2 / Corosync 3 - rare and spontaneous disruptive UDP:5405-storm/flood

spirit · Sep 21, 2020

1. Yes, we try stop pve-cluster and pkill pmxcfs.
This stop flood only for short time; After restart service some times later flood appear again.

ok. but the flood don't have appear again until you have restart pve-cluster ?

I ask that, because I'm currently debugging a pmxcfs bug (a different bug with /etc/pve is locked), but who's known, maybe it could be related.

Another question: when you stop pmxcfs/pve-cluster, and the flood stop, how is the state of corosync on differents nodes ? does the nodes see others nodes again ?

(I'm trying to known if the flood is the cause of corosync split brain, or the corosync split brain the cause of flood)

spirit · Sep 21, 2020

if possible, you can enable debug log for pve-cluster.

edit:

/lib/systemd/system/pve-cluster.service

and add "-d" to:

Code:

ExecStart=/usr/bin/pmxcfs -d

then,

Code:

systemctl daemon-reload
systemctl restart pve-cluster

but be carefull, it's really verbose, so check you disk space for log.

do this for all nodes, and if the problem occur again,
send log of the flooding node

Code:

cat /var/log/daemon.log |grep "corosync\|pmxcfs"

spirit · Sep 21, 2020

Also, you send current corosync stats when it's working fine ?

(#corosync-cmapctl -m stats)

Aminuxer · Sep 22, 2020

spirit said:
ok. but the flood don't have appear again until you have restart pve-cluster ?

Another question: when you stop pmxcfs/pve-cluster, and the flood stop, how is the state of corosync on differents nodes ? does the nodes see others nodes again ?

(I'm trying to known if the flood is the cause of corosync split brain, or the corosync split brain the cause of flood)

Flood can appear after restart corosync/pve-cluster random time delay, even 2-3 days after restart.

No. We have before 22 nodes in cluster, and after stop corosync and repeated restarts one-by-one some split-brains exists, some nodes view only self - no quorum and blocked activity. But when we try pkill pmxcfs and restart, some nodes can view another ones; But cluster not stable.

Main problem similar appear in corosync (knet ?), not pve-cluster; when corosync can catch quorum, pve-cluster work fine;

At first we try deep diag of networks - check pings by this script running on each node:

Bash:

#!/bin/bash

servers6=(vps1 vps2 vps3 vps4 vps5 vps6 vps7 vps8 vps9 vpsN)

log='/tmp/corosync-pings-diag.txt';
echo "" > $log

pmax_val=0;
pmax_node='';

pdev_val=0;
pdev_node='';

for srv in ${servers6[*]}
do
    echo "=-==-==-==-==-- $srv --==-==-==-==-=";
    ping -A -c 4 -s 800 $srv | tee $log

    v1=`cat $log | grep rtt | cut -d ' ' -f 4`
    v2=`cat $log | grep rtt | cut -d ' ' -f 7`

    pmin=`echo -n "$v1" | cut -d '/' -f 1`
    pavg=`echo -n "$v1" | cut -d '/' -f 2`
    pmax=`echo -n "$v1" | cut -d '/' -f 3`
    pdev=`echo -n "$v1" | cut -d '/' -f 4`

    pimg=`echo -n "$v2" | cut -d '/' -f 1`
    pwma=`echo -n "$v3" | cut -d '/' -f 2`

    pmax=`echo "$pmin*10000/10" | bc`
    pdev=`echo "$pdev*10000/10" | bc`

    if [ $pmax -ge $pmax_val ]
    then
       pmax_val=$pmax;
       pmax_node=$srv;
    fi

    if [ $pdev -ge $pdev_val ]
    then
       pdev_val=$pdev;
       pdev_node=$srv;
    fi

done

echo "
================================================================";
echo "Maximum ping $pmax_val mks detected for node $pmax_node"
echo "Maximum pind-deviation $pdev_val mks detected for node $pdev_node"

We found that maximum ping ever between server-rooms was not above 0.6 ms with deviance near 45-60 mks. It't like correct. We also have not excessive utilization over 10GbE links between rooms;

Output from corosync-cmapctl -m state too long even for spoiler;
I see options like stats.knet.node1.link0.tx_pong_errors, stats.knet.node1.link0.tx_pong_retries, stats.knet.node1.link0.tx_ping_errors, stats.knet.node1.link0.tx_ping_retries. In bad state this non-zero, in normal state - zero.

I assume what many retransmits is result of packet oss under storm.

Aminuxer · Sep 22, 2020

In progress of recovery and try understanding issue we perform this steps:

0. Check errors on switch ports and option like storm control - this correct;

1. disable encryption for traffic analyze;
crypto_cipher: none
crypto_hash: none

and found this;

Code:

Sep 21 10:20:32 vps4 corosync[9641]: [TOTEM ] Message received from 37.153.1.53 has bad magic number (probably sent by encrypted Kronosnet/Corosync 2.0/2.1/1.x/OpenAIS or unknown).. Ignoring

Sep 21 10:20:33 vps4 corosync[9641]: [TOTEM ] Message received from 37.153.1.51 has bad magic number (probably sent by encrypted Kronosnet/Corosync 2.0/2.1/1.x/OpenAIS or unknown).. Ignoring

We found related code:

http://lira.epac.to:8080/doc/corosync/api/html/totemsrp_8c_source.html
string 4912

and strange errors about magic bit;

After this we restore most important VMs on another hardware, remove 6 nodes forcebly and build new cluster from 7 nodes, switching this new one to `sctp` transport and increase timeouts:

Code:

corosync.conf :

totem {
  cluster_name: TEST-20200920
  config_version: 27
  interface {
    knet_transport: sctp                    <---------------
    linknumber: 0
    token_retransmits_before_loss_const: 10  <---------------
    join: 150    <---------------
    token: 5000   <---------------

After this new small cluster catch quorum and still working normally; We watch for logs nowtime and try found correlations and other potentially useful info. In new sctp cluster, we don't have retransmits, and quorum OK; But i don't know can it survivev after growing to 22 nodes;

Also we can't fully exclude network issues, even we have dedicated 10G networks and formally ideal pings and bandwiths.
Nowtime we watch for old cluster with udp and new cluster with sctp transports;

In any case, even under network packet loss corosync don't will start so massive and disruptive flood - this similar to bug;
We will hope to help solve this problem in future;

We try enable pve-cluster debug and analyze this data by your recommendations.
We can answer some times later after complete next diagnostics steps.

Thank your very much any case;

spirit · Sep 22, 2020

can you send stats of :

stats.knet.node*.link*.latency_ave - Average latency of the given link

stats.srp.time_since_token_last_received - global stats on token time

Network latency is not enough, you also also the compute time in corosync (if for example you have an old server with weak cpu in your cluster).

On my network, with 20 nodes cluster working fine out of the box, I'm around 0.019ms ping latency between nodes.

latency_ave counter : is around 80
stats.srp.time_since_token_last_received: is between 100-200

Aminuxer · Sep 22, 2020

At node vps1 with sctp:

Code:

root@vps1:~#  corosync-cmapctl -m stats | grep -E "(latency_ave|time_since_token_last_received)"
stats.knet.node1.link0.latency_ave (u32) = 0
stats.knet.node2.link0.latency_ave (u32) = 254
stats.knet.node3.link0.latency_ave (u32) = 141
stats.knet.node4.link0.latency_ave (u32) = 239
stats.knet.node5.link0.latency_ave (u32) = 233
stats.knet.node6.link0.latency_ave (u32) = 348
stats.knet.node7.link0.latency_ave (u32) = 340
stats.srp.time_since_token_last_received (u64) = 241

At node vps7 with udp:

Code:

root@vps7:~# corosync-cmapctl -m stats | grep -E "(latency_ave|time_since_token_last_received)"
stats.knet.node1.link0.latency_ave (u32) = 4750
stats.knet.node10.link0.latency_ave (u32) = 5939
stats.knet.node11.link0.latency_ave (u32) = 4427
stats.knet.node12.link0.latency_ave (u32) = 168871
stats.knet.node13.link0.latency_ave (u32) = 10437
stats.knet.node14.link0.latency_ave (u32) = 5328
stats.knet.node15.link0.latency_ave (u32) = 7105
stats.knet.node16.link0.latency_ave (u32) = 13045
stats.knet.node17.link0.latency_ave (u32) = 5248
stats.knet.node18.link0.latency_ave (u32) = 4844
stats.knet.node19.link0.latency_ave (u32) = 9522
stats.knet.node2.link0.latency_ave (u32) = 371980
stats.knet.node20.link0.latency_ave (u32) = 15181
stats.knet.node21.link0.latency_ave (u32) = 15176
stats.knet.node22.link0.latency_ave (u32) = 2309
stats.knet.node3.link0.latency_ave (u32) = 0
stats.knet.node4.link0.latency_ave (u32) = 4586
stats.knet.node5.link0.latency_ave (u32) = 14547
stats.knet.node6.link0.latency_ave (u32) = 5030
stats.knet.node7.link0.latency_ave (u32) = 1371
stats.knet.node8.link0.latency_ave (u32) = 1187
stats.knet.node9.link0.latency_ave (u32) = 4298
stats.srp.time_since_token_last_received (u64) = 4383473310

My topology some wide:
{vps1-9 nodes}--1G--{switch1}---10G-10km-fiber-link---{switch2}--{Blade* nodes}

Nowtime both clusters (with udp and with sctp) has nodes at both switches;
But old cluster with udp-based transport generate flood and exhaust firewalls;
New cluster with tcp-based transport work fine;

At this point in old cluster with udp-transport remain only 11 nodes;
We set pvecm expected 11 for try catch quorum but under flood corosync don't catch quorum due to links / firewall exhausted;
Try to accurate restart corosync on udp-transport cluster;

P.S. We found some silmilar issue here: https://github.com/corosync/corosync/issues/389
but this about old version of corosync.

spirit · Sep 22, 2020

At node vps7 with udp:

values are really huge here.
I'm not sure if average value are resetted (don't know if the flood have impacted the stats).
but it's really bad values.

My topology some wide:
{vps1-9 nodes}--1G--{switch1}---10G-10km-fiber-link---{switch2}--{Blade* nodes}

oh ok, you are on 2 sites with 10km.
are you sure that you never have small disruption on this link ? (small lag, or small saturation,..)

Nowtime both clusters (with udp and with sctp) has nodes at both switches;
But old cluster with udp-based transport generate flood and exhaust firewalls;
New cluster with tcp-based transport work fine;

sctp could improve retransmit. I known it's not the default yet because it was still buggy in previous corosync 3.X version, but they have done a lot of fixes recently.
So if it's working fine for you, maybe it could be the solution, but I don't have any experience with sctp.

for udp cluster,you could try to increase "token" value in corosync.conf

total Token timeout is computed as a "token + (number_of_nodes - 2) * token_coefficient" where token is by default 1000ms and token_coefficient is by default 650ms.

I have read on redhat bugzilla that "token:3000" or "token:5000" give better result sometimes.

Code:

/etc/pve/corosync.conf

totem {
   ....
   token:3000
}

but don't increase it too much

spirit · Sep 30, 2020

@Aminuxer

Hi,
a new pve-cluster package is available in pvetest repository (should be available soon in other repo).
http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/pve-cluster_6.2-1_amd64.deb

It's fixing a big bug that could happen on corosync join/left

http://download.proxmox.com/debian/pve/dists/buster/pvetest/binary-amd64/pve-cluster_6.2-1.changelog

I'm not sure it's related your flood bug, but if you can test it, it could be great.
(It need to be installed on all nodes, and restart pve-cluster service.

It's fixing the bug I had 6 month ago when my flood bug occured after shutting down a node, but maybe it's happend to you too, if a node is leaving because of latency, then the flood occur and blocking the others nodes.

Aminuxer · Oct 12, 2020

Nowtime we switched to SCTP and increase timeouts. Also we setting up sysctl option:

Code:

net.core.netdev_max_backlog = 50000

Working over SCTP produce more readable and useful logs, we remain with this option's set.
We trying stay updated at actual versions too, but not ready install test version on production cluster.

P.S. We can't live under massive flood, and now we can't test/reproduce problem with old configuration;
We must disassemble old cluster due to extremely disaster;
In smaller cluster flood problem don't appear;

With SCTP, bigger timeouts and fixed sysctl our distributed cluster work fine.
I hope that my reports help to fix this so destructive bug.

Thank your very much.

spirit · Oct 15, 2020

thanks for the report !

btw, the pve-cluster bugfix have been pushed to no-subscription-repo.

Karbon · Nov 22, 2020

Hi i have very similar situation, cluster is about 20 nodes and there is a big problem with usp corosync flood (sometimes one per month but sometimes daily).

I try to avoid cluster spliting.
Should I switch to sctp? Is it safe if this isn't default configuration?
@spirit wrote about bugfix, where can I find details about it?

spirit · Nov 25, 2020

Karbon said:
Hi i have very similar situation, cluster is about 20 nodes and there is a big problem with usp corosync flood (sometimes one per month but sometimes daily).

I try to avoid cluster spliting.
Should I switch to sctp? Is it safe if this isn't default configuration?
@spirit wrote about bugfix, where can I find details about it?

Hi, which version of pve-cluster && corosync package are you using ?

#pveversion -v ?

Karbon · Nov 25, 2020

Code:

$ # pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.4.60-1-pve: 5.4.60-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-1
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
pve-zsync: 2.0-3
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

spirit · Nov 25, 2020

@Karbon , maybe it could be related, you should upgrade pve-cluster package to 6.2-1 to have the fix.

(I'm not sure that it's 100% related to the flood, but you should really upgrade because it's fixing a lock bug of /etc/pve on all nodes when node leave or join the cluster)

is your 20 nodes cluster local ? or splitted on multiple location with high latencies ?

Karbon · Nov 25, 2020

Ok, I will try with pve-cluster 6.2-1
I have 20 nodes in local cluster, single location and low latencies.

spirit · Nov 26, 2020

Karbon said:
Ok, I will try with pve-cluster 6.2-1
I have 20 nodes in local cluster, single location and low latencies.

ok great.

if it's still happen, it could be interesting to have corosync logs off all nodes when the problem occur, with debug enabled.
(edit /etc/pve/corosync.conf , and change debug: off with debug: on, and increase config_version

also, can you send your /etc/corosync/corosync.conf ?

Karbon · Nov 26, 2020

It's happening again, but I've only updated a few nodes.

I have one more problematic problem. When the udp flood problem appeared, I also had a connectivity problem on another network.
Corosync runs on a dedicated 1G network and I have a second 10G network for communicating between virtual machines
Udp flood appeared on both network.
When I stop corosync everything back to normal.
corosync.conf everyfing looks good (for me).

Maybe it is a multicast problem and should I enable igmp snooping on the 10g switch?
But the documentation suggests that this shouldn't be a problem:
“The Proxmox VE 6.0 uses corosync 3 which switched the basic transport stack to the Kronosnet (knet). Kronosnet currently only supports unicast."

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node01
    nodnode: 6
    quorum_votes: 1
    ring0_addr: 10.10.10.1
  }
  node {
    name: node02
    nodnode: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.2
  }
  node {
    name: node03
    nodnode: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.3
  }
  ...
  node {
    name: node20
    nodnode: 9
    quorum_votes: 1
    ring0_addr: 10.10.10.20
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: HQ
  config_version: 116
  interface {
    bindnetaddr: 10.10.10.2
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  token: 3000 # optimization test, but no success
  version: 2
  window_size: 150 # optimization test, but no success
}

Stoiko Ivanov · Nov 26, 2020

Karbon said:
Corosync runs on a dedicated 1G network and I have a second 10G network for communicating between virtual machines
Udp flood appeared on both network.

Maybe the links are bridged together somewhere?
the corosync.conf only has one interface per node defined - I assume that's the 1G link - so corosync traffic from the PVE-cluster should only be visible on the 1G link.
If the 10G network also sees the corosync traffic/burst - both are linked/bridged somewhere (which if unintentional can cause quite heavy network disruptions)

I hope this helps!

spirit · Nov 27, 2020

Udp flood appeared on both network.

I missed this, this is strange.

The udp flood reported before was unicast flood. (with only 1 target node ip as destination).

If you see corosync traffic flooded everywhere, something is wrong on your network.
corosync3/proxmox6 don't use multicast anymore. so you really shouldn't see traffic on other network.

for example, it could be something like mac ageing table on your physical switch loosing mac addresss, and flooding BUM traffic everywhere.

Proxmox 6.2 / Corosync 3 - rare and spontaneous disruptive UDP:5405-storm/flood

Distinguished Member

Distinguished Member

Distinguished Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Proxmox Staff Member

Distinguished Member

We value your privacy