Proxmox 6.2 / Corosync 3 - rare and spontaneous disruptive UDP:5405-storm/flood

Aminuxer · Sep 13, 2020

Hello.

We have Proxmox VE 6.2 cluster with 22 nodes, places over L2-network (2 physical server rooms, 10G link over, 1G to each node).
All VMs/CTs use only local storages; we use NFS only for backups. HA-containers not used;
We host only our services, inspected and trusted; No infected VMs or VMs from external clients;
All updates set up at minimum 2020-07 date-level.

Since 2020-07 (date of cluster setup) we catch two big disruptive udp-flood from corosync.
- At first case we catch big flood after migrate of one VM.
- At second case we catch flood spontaneously - no migration / backups / attacks taking place;

Some data from logs:

2 hour before mass-flood:

Code:

Sep 13 18:23:01 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:23:01 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:23:01 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:24:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:24:01 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:24:01 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:25:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:25:01 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:25:01 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:26:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:26:01 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:26:01 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:27:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:27:01 vps6 pvesr[12758]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 13 18:27:02 vps6 pvesr[12758]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 13 18:27:03 vps6 pvesr[12758]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 13 18:27:04 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:27:04 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:27:20 vps6 pmxcfs[32613]: [dcdb] notice: data verification successful
Sep 13 18:28:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:28:00 vps6 pvesr[14883]: trying to acquire cfs lock 'file-replication_cfg' ...

one minute before flood:

Code:

Sep 13 21:55:45 vps6 corosync[32624]:   [KNET  ] rx: host: 10 link: 0 is up
Sep 13 21:55:45 vps6 corosync[32624]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Sep 13 21:55:48 vps6 corosync[32624]:   [TOTEM ] Retransmit List: 38e 378 390 392 362 370 37f 380 389 38b
Sep 13 21:55:52 vps6 corosync[32624]:   [TOTEM ] Retransmit List: 39e 398 38e
Sep 13 21:55:59 vps6 corosync[32624]:   [TOTEM ] Retransmit List: 3a3 3a7
Sep 13 21:56:02 vps6 corosync[32624]:   [TOTEM ] Retransmit List: 3bb 3b7 3b6 3b5 3b9
Sep 13 21:56:03 vps6 corosync[32624]:   [TOTEM ] Retransmit List: 3c4 3bb
Sep 13 21:56:03 vps6 corosync[32624]:   [TOTEM ] Retransmit List: 3d1 3d4 3d5 3c9 3cd 3d0 3d3 3c4 3cb 3cc 3cf
Sep 13 21:56:03 vps6 corosync[32624]:   [TOTEM ] Retransmit List: 3e1 3e6 3e7 3eb 3ec 3cd 3d1 3d2 3de 3df 3e5 3ea 3ed 3c4 3c7 3cc 3e0 3e4 3e9 3c3 3d3
Sep 13 21:56:03 vps6 corosync[32624]:   [TOTEM ] Retransmit List: 3fc 3fd 3fe 400 3eb 405 408 40a 40b 40c 40e 40f 3d1 3ea 3ed 40d 3c4 3e4 3e9 403 404 409 3e6 3d3

and in process of flood:

Code:

in hell of storm:
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: all data is up to date
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: dfsm_deliver_queue: queue length 49
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 22/3180
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 21/1519
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 20/28683
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 20/28683
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 2/25172
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 5/2013
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 5/2013
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 5/2013
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 10/32092
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 10/32092
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 10/32092
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 13/5473
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 12/18037
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 4/23669

In netstat output i see big Recv-Q = 14273280 value for corosync:

Code:

A.B.C.D# netstat -naltup
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     
tcp        0      0 0.0.0.0:10050           0.0.0.0:*               LISTEN      15135/zabbix_agentd  
tcp        0      0 0.0.0.0:8006            0.0.0.0:*               LISTEN      1070/pveproxy        
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/init               
tcp        0      0 127.0.0.1:85            0.0.0.0:*               LISTEN      1061/pvedaemon       
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      820/sshd             
tcp        0      0 0.0.0.0:3128            0.0.0.0:*               LISTEN      1077/spiceproxy      
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      1018/master          
...
tcp6       0      0 :::10050                :::*                    LISTEN      15135/zabbix_agentd  
tcp6       0      0 :::111                  :::*                    LISTEN      1/init               
tcp6       0      0 :::22                   :::*                    LISTEN      820/sshd             
tcp6       0      0 ::1:25                  :::*                    LISTEN      1018/master          
udp        0      0 0.0.0.0:111             0.0.0.0:*                           1/init               
udp   14273280  23040   A.B.C.D:5405        0.0.0.0:*                           8093/corosync        <----    RecvQ ???
udp6       0      0 :::111                  :::*                                1/init

In monitoring system all interfaces to nodes extremely loaded up to 300-800 Mbps :

I can catch 128k packets from one node and see ~250 kpps (~519 Mbps) of UDP-traffic with source and destination port 5405 with packet size from 250 to 270 bytes.
This traffic is unicast from one node to all other nodes of cluster;

This flood so disruptive; Juniper switches can survive under this attack, but many neighbours in cluster vlan and ever in some routed-accesible vlans exhaust resources and degraded;

When we make hard restart of corosync and pve-cluster by this command:

Bash:

killall -9 pmxcfs && killall -9 corosync && systemctl restart pve-cluster && sleep 5 && systemctl restart pve-cluster

UDP-flood stop immediately, and after some seconds / minutes cluster re-survive.
Networks return to stable working;
All VM and CT continue working;
Rate of UDP-5405 return to normal small value (~ 90 pps / 500 kbps for each node);

1). What is it ? Why corosync generate too much udp flood ? Do anyone see some similar ?
2). Can we prevent this by apply --hashlimit iptables rules in OUTPUT ?

Similar to relaively rare but heavy disruptive bug;

Thank you for any help;

spirit · Sep 14, 2020

Hi,
mmm, I think I have this once, with 1 corosync process flooded (corosync process was 100% cpu), and I had a lot in rx.
it was 6 months ago with older corosync version, and I never have been able to reproduce it.

what is your corosync && libknet packages versions ? can you send your /etc/pve/corosync.conf ?

That's great than you have pcap, maybe it'll be possible to analyse content ?

During the flood, do you also see a lot of "Retransmit List:" ? or only "[dcdb] notice: remove message from non-member " ?

spirit · Sep 14, 2020

it could be interesting to see corosync logs of each node, to see if a cluster split have occure (some nodes was seeing only some others nodes)

Aminuxer · Sep 14, 2020

I also can't reproduce - i don't know pre-conditions for this. =[

Problem taking place with corosync v. 3.0.4;

# apt info libknet*
Package: libknet1
Version: 1.16-pve1
Priority: optional
Section: libs
Source: kronosnet
Maintainer: Proxmox Support Team <support@proxmox.com>
Installed-Size: 329 kB

corosync confit contain node addresses and names primarily:

Code:

# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: Blade2s1
    nodeid: 10
    quorum_votes: 1
    ring0_addr: A.B.1.63
  }
  node {
    name: Blade2s2
    nodeid: 9
    quorum_votes: 1
    ring0_addr: A.B.1.64
  }
............
  node {
    name: Blade6s2
    nodeid: 21
    quorum_votes: 1
    ring0_addr: A.B.1.62
  }
  node {
    name: vps1
    nodeid: 22
    quorum_votes: 1
    ring0_addr: A.B.1.51
  }
  node {
    name: vps6
    nodeid: 1
    quorum_votes: 1
    ring0_addr: A.B.1.56
  }
  node {
    name: vps11
    nodeid: 5
    quorum_votes: 1
    ring0_addr: A.B.0.131
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: AMIN-2020-07
  config_version: 24
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Yes, strings like "corosync[5820]: [TOTEM ] Retransmit List: ..." present in logs in flood phase.

I prepare pcap file and detailed logs later.

spirit · Sep 14, 2020

What I remember when I had this problem, is that I had some cluster split (you can see it in different host with corosync logs),

then it was like traffic was still send everywhere, and node refuse the packet with "non member" (because it didn't see the other sending memberr in his own cluster).

and with the floop of udp, the corosync process was stuck 100%, so itself didn't see others nodes,...

like a snowball effect.

(I have looked at my graph history, I was around 1-2millions pps).
My switch was able to handle it, but not the corosync process.

I don't remember if the traffic was coming from only 1 node like you, but it should be easier to debug here.
could be interesting to see log of this sending node (cat /var/log/daemon.log |grep "corosync\|pmxcfs")

Aminuxer · Sep 14, 2020

Yes, we have split too.
We stop corosync at all nodes, and run again one-by-one consequently with 2 seconds pause before run on next node.
After this cluster restore quorum correctly.
But we badly surprise too many UDP-flood traffic - we realy have some nodes in flood tsunami.

At early version (prox5, corosync at multicast) we also catch some similar floods at some rare undetected conditions, but with lower data-rate;

What about idea to apply per-ip hashlimit in iptables node firewall, in OUTPUT chain ? Or this must be solved in root of problem, in proxmox code anyway ?

Proxmox cluster can loose quorum sometimes at some conidtions, but disruptive udp-flood so destructive for services.

In my case, not one node flooded, but some nodes - this see in mionitoring screenshot; I show logs from one affected;
In tcpdump traffic i see massive udp packets from affected node to another members of cluster, no more;

full logs from one node attached;

Aminuxer · Sep 14, 2020

At some nodes we don't install this updates:

bind9-host dnsutils libbind9-161 libdns-export1104 libdns1104 libirs161 libisc-export1100 libisc1100
libisccc161 libisccfg163 liblwres161 libproxmox-acme-perl libproxmox-backup-qemu0 libpve-common-perl
libpve-guest-common-perl libzmq5 proxmox-backup-client proxmox-widget-toolkit pve-container pve-firmware
pve-ha-manager pve-i18n pve-kernel-helper pve-qemu-kvm pve-xtermjs qemu-server

But i don't see any corosync related packages;
We update all nodes to actual software version;

We will watch to cluster; If i found any anomalies, i answer here about this;
I wiil hope if this issue was solved fully; Also i wil planned hashlimit for udp traffic iptables limiting rule implementation;

spirit · Sep 14, 2020

Aminuxer said:
Yes, we have split too.
We stop corosync at all nodes, and run again one-by-one consequently with 2 seconds pause before run on next node.
After this cluster restore quorum correctly.

yes, it was the same for me

But we badly surprise too many UDP-flood traffic - we realy have some nodes in flood tsunami.
At early version (prox5, corosync at multicast) we also catch some similar floods at some rare undetected conditions, but with lower data-rate;

proxmox6 use corosync3 with libknet protocol, so it's totally different now

What about idea to apply per-ip hashlimit in iptables node firewall, in OUTPUT chain ? Or this must be solved in root of problem, in proxmox code anyway ?

I remember to have done some test the traffic shapping (tc ...), it was working fine, but I don't find the sample

I don't think it's a proxmox bug here, but more a corosync bug. (it's more like corosync is doing a lot of retransmit, proxmox daemons don't send so much datas)

In normal condition, do you already see some "retransmit ..." in your logs ?

spirit · Sep 14, 2020

Also, if you really able to reproduce soon, you can enable corosync debug log

/etc/pve/corosync.conf

Code:

logging {
  debug: on
}

and increase config_version to applied it

Code:

totem {
  config_version: xx

That's a lof of log, so be carefull of disk space. But this could give really some great information if it's a bug in corosync or libknet.

Aminuxer · Sep 14, 2020

I can't reproduce this bug;
I catch this twice, but without correlations to any manipulations.
I try tune OUTPUT firewall;

Thank you for help;

Aminuxer · Sep 16, 2020

Nowday we catch udp-impact again;

I simple remove some old VMs, this removed correctly without problems; and after 30-40 minutes we catch udp-impact spontaneously;

I apply OUTPUT hashlimit rules for supress udp-flood from corosync at each node in cluster:

Bash:

iptables -P OUTPUT ACCEPT
iptables -A OUTPUT -p udp -m udp --dport 5405 --match hashlimit --hashlimit 2048/sec --hashlimit-burst 2500 --hashlimit-mode dstip --hashlimit-name Coro5405 -j ACCEPT
iptables -A OUTPUT -p udp -m udp --dport 5405 -j DROP

This rules allow only 1kpps of UDP to dst-port 5405 for each destination node;
This suppress storm very efficiently;

At normal conditions overall udp-5405 near 300 pps to all nodes;

After activation i see that some excess traffic dropped:

But hashlimit counters filled equally:

Code:

# cat /proc/net/ipt_hashlimit/Coro5405
0 0.0.0.0:0->A.B.1.55:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.64:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.68:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.63:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.53:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.0.131:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.71:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.56:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.72:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.51:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.54:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.57:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.70:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.69:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.66:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.58:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.59:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.61:0 670702090640 670702090640 134140418128
0 0.0.0.0:0->A.B.1.62:0 670702090640 670702090640 134140418128

Service pve-cluster at some nodes in bad condition:

That is strange bug ? Corosync 3 catch head ill and start udp-bombing flood to all neighbours nodes;

spirit · Sep 16, 2020

what is the corosync status on differents nodes ? ("pvecm status" ?)
do you have cluster split or not ?

Aminuxer · Sep 16, 2020

corosync in status active (running)
cluster splitted; Total votes is 1 or 5 on nodes, not 22 as usually; Some nodes have 13 votes and stay quorumed;
pvecm status answered with lag;

servive pve-cluster at nodes in status active / running; but in status of some nodes present rows like this:

Code:

Sep 16 15:15:33 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retry 90
Sep 16 15:15:34 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retry 100
Sep 16 15:15:34 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retried 100 times
Sep 16 15:15:34 Blade2s2 pmxcfs[22828]: [status] crit: cpg_send_message failed: 6      <---  red text
Sep 16 15:15:35 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retry 10
Sep 16 15:15:36 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retry 20
Sep 16 15:15:37 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retry 30
Sep 16 15:15:38 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retry 40
Sep 16 15:15:39 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retry 50
Sep 16 15:15:40 Blade2s2 pmxcfs[22828]: [status] notice: cpg_send_message retry 60

VMs worked, but this split-brains and especially - massive udp-floods very badly affect our networks;

iptables hashlimit really help to stop massive local ddos;
All versions up to date;

spirit · Sep 16, 2020

maybe the flood come from pmxcfs retries, but I think this is because of corosync bug.

could be great to have corosync logs of each nodes (really all the nodes to try to compare and find if it's coming from a specific node) , and if you can enable debug in /etc/pve/corosync.conf it could be great to have more informations too.

spirit · Sep 16, 2020

also, what is your bandwith usage of your hosts nics when the problem occur ? do you have some saturation? (bandwith or pps) . (Just before the flood).

Aminuxer · Sep 16, 2020

No, nic was not saturated or overloaded before.

Some nodes transmit up to 80 Mbps (maximum) of traffic - it so far from 1G link saturation/

When storm occur we exhaust only routing hardware and links;
Only apply iptables script restrict flood level and allow continue work and resurvive cluster;

I tuned up hashlimit parameters: 2048/sec + 2500 burst in dstip mode enough for start corosync and make quorum;
In normal mode rates sufficiently less;

Root cause not understanded;

I avoid change corosync.conf in degraded state;
How danger to enable debug on working cluster ?

As i understand, i can simple edit corosync.conf, change debug option and config_version , save file;
What's service must be restarted or checke d after this ?

Where places logs / fast growing files for debug data ?
How long time debug option can be in `on` status ?

On some nodes rootfs relatively small:

Code:

# df -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/pve-root   46G  9.0G   35G  21% /

spirit · Sep 16, 2020

I don't think you can change the corosync.conf with no quorum.

at least /etc/pve/corosync.conf , don't touch it.

when you edit /etc/pve/corosync.conf, and change the config_version, proxmox will generate on each node, a new /etc/corosync/corosync.conf + restart of corosync.

so , maybe you can try to edit locally /etc/corosync/corosync.conf, and enable debug. (but don't increase version).
then restart corosync.

but maybe you'll don't have problem after corosync restart....

Anyway, it could be great to have debug enabled if it's happen again next time.
So maybe restart corosync/pmxcfs everywhere first, then edit /etc/pve/corosync.conf + increase_version.

Aminuxer · Sep 17, 2020

I avoid touch /etc/pve/corosync.conf.
But at one node with problems i stopped corosync, enable debug in /etc/corosync/corosync.conf and start corosync again.
I collect logs from journalctl -f and attach this.

Some interesting logs:

Code:

Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] link: Configuring default access lists for host: 1 link: 0 socket: 30
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] link: host: 1 link: 0 is configured
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] host: host: 1 has no active links
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] host: host: 1 has no active links
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] link: host: 1 link: 0 priority set to: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] link: host: 1 link: 0 timeout update - interval: 3500000 timeout: 7000000 precision: 2048
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] link: host: 1 link: 0 pong count update: 2
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] host: host: 1 has no active links
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] host: Host 12 has new switching policy: 0
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] udp: Re-using existing UDP socket for new link

At one moment catch this:

Code:

Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] pmtud: PMTUd request to rerun has been received
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 19 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 18 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] udp: Received ICMP error from A.B.1.70: Connection refused A.B.1.70
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 5 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 16 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 15 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 6 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 4 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 8 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 14 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 2 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 11 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 12 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 13 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 22 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 3 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 1 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 21 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 9 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 10 link: 0 received pong: 1
Sep 17 11:12:59 vps4 corosync[24235]:   [KNET  ] rx: host: 20 link: 0 received pong: 1

But .1.70 is free , no VM on this, and this node really accesible.

After this too many recorgs like this:

Code:

Sep 17 11:13:02 vps4 corosync[24235]: [KNET ] rx: Source host 18 not reachable yet. Discarding packet.

and this:

Code:

Sep 17 11:13:29 vps4 corosync[24235]: [TOTEM ] entering GATHER state from 11(merge during join).

At this point extremely many traffic try to overload network. Even separated network will be degraded under this local-ddos. (without filters up to 700 Mbps per node !)

After some time corosync catch main config from distributed storage /etc/pve and overwrite /etc/corosync, debug turned off.
I hope this logs can be useful.

spirit · Sep 20, 2020

Hi,

1 question: on the node sending the flood, do you have tried to stop pve-cluster.service ? (or kill pmxcfs process).
I would like to known if it's pmcxfs which is flood through corosync, or corosync itself.

(or maybe corosync doing a small lag, pmxcfs is doing a lot of retry making thinks worste, flood corosync,... like a bad snowball effect...)

just to known, about

"
Sep 17 11:12:59 vps4 corosync[24235]: [TOTEM ] Token Timeout (14000 ms) retransmit timeout (3333 ms)
Sep 17 11:12:59 vps4 corosync[24235]: [TOTEM ] Token warning every 10500 ms (75% of Token Timeout)
Sep 17 11:12:59 vps4 corosync[24235]: [TOTEM ] token hold (2656 ms) retransmits before loss (4 retrans)
"

In normal time, do you see retransmit logs ?

Aminuxer · Sep 21, 2020

1. Yes, we try stop pve-cluster and pkill pmxcfs.
This stop flood only for short time; After restart service some times later flood appear again.

In normal time logs not contain retransmits; Retransmits appear only under storm or shaper overload.

We continue diagnostics and try understand of problem source; If we can found someone, write here;
Very strange bug...

Proxmox 6.2 / Corosync 3 - rare and spontaneous disruptive UDP:5405-storm/flood

New Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

New Member

Attachments

New Member

Distinguished Member

Distinguished Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

New Member

Attachments

Distinguished Member

New Member