Hello.
We have Proxmox VE 6.2 cluster with 22 nodes, places over L2-network (2 physical server rooms, 10G link over, 1G to each node).
All VMs/CTs use only local storages; we use NFS only for backups. HA-containers not used;
We host only our services, inspected and trusted; No infected VMs or VMs from external clients;
All updates set up at minimum 2020-07 date-level.
Since 2020-07 (date of cluster setup) we catch two big disruptive udp-flood from corosync.
- At first case we catch big flood after migrate of one VM.
- At second case we catch flood spontaneously - no migration / backups / attacks taking place;
Some data from logs:
2 hour before mass-flood:
one minute before flood:
and in process of flood:
In netstat output i see big Recv-Q = 14273280 value for corosync:
In monitoring system all interfaces to nodes extremely loaded up to 300-800 Mbps :
I can catch 128k packets from one node and see ~250 kpps (~519 Mbps) of UDP-traffic with source and destination port 5405 with packet size from 250 to 270 bytes.
This traffic is unicast from one node to all other nodes of cluster;
This flood so disruptive; Juniper switches can survive under this attack, but many neighbours in cluster vlan and ever in some routed-accesible vlans exhaust resources and degraded;
When we make hard restart of corosync and pve-cluster by this command:
UDP-flood stop immediately, and after some seconds / minutes cluster re-survive.
Networks return to stable working;
All VM and CT continue working;
Rate of UDP-5405 return to normal small value (~ 90 pps / 500 kbps for each node);
1). What is it ? Why corosync generate too much udp flood ? Do anyone see some similar ?
2). Can we prevent this by apply --hashlimit iptables rules in OUTPUT ?
Similar to relaively rare but heavy disruptive bug;
Thank you for any help;
We have Proxmox VE 6.2 cluster with 22 nodes, places over L2-network (2 physical server rooms, 10G link over, 1G to each node).
All VMs/CTs use only local storages; we use NFS only for backups. HA-containers not used;
We host only our services, inspected and trusted; No infected VMs or VMs from external clients;
All updates set up at minimum 2020-07 date-level.
Since 2020-07 (date of cluster setup) we catch two big disruptive udp-flood from corosync.
- At first case we catch big flood after migrate of one VM.
- At second case we catch flood spontaneously - no migration / backups / attacks taking place;
Some data from logs:
2 hour before mass-flood:
Code:
Sep 13 18:23:01 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:23:01 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:23:01 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:24:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:24:01 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:24:01 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:25:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:25:01 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:25:01 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:26:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:26:01 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:26:01 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:27:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:27:01 vps6 pvesr[12758]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 13 18:27:02 vps6 pvesr[12758]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 13 18:27:03 vps6 pvesr[12758]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 13 18:27:04 vps6 systemd[1]: pvesr.service: Succeeded.
Sep 13 18:27:04 vps6 systemd[1]: Started Proxmox VE replication runner.
Sep 13 18:27:20 vps6 pmxcfs[32613]: [dcdb] notice: data verification successful
Sep 13 18:28:00 vps6 systemd[1]: Starting Proxmox VE replication runner...
Sep 13 18:28:00 vps6 pvesr[14883]: trying to acquire cfs lock 'file-replication_cfg' ...
one minute before flood:
Code:
Sep 13 21:55:45 vps6 corosync[32624]: [KNET ] rx: host: 10 link: 0 is up
Sep 13 21:55:45 vps6 corosync[32624]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Sep 13 21:55:48 vps6 corosync[32624]: [TOTEM ] Retransmit List: 38e 378 390 392 362 370 37f 380 389 38b
Sep 13 21:55:52 vps6 corosync[32624]: [TOTEM ] Retransmit List: 39e 398 38e
Sep 13 21:55:59 vps6 corosync[32624]: [TOTEM ] Retransmit List: 3a3 3a7
Sep 13 21:56:02 vps6 corosync[32624]: [TOTEM ] Retransmit List: 3bb 3b7 3b6 3b5 3b9
Sep 13 21:56:03 vps6 corosync[32624]: [TOTEM ] Retransmit List: 3c4 3bb
Sep 13 21:56:03 vps6 corosync[32624]: [TOTEM ] Retransmit List: 3d1 3d4 3d5 3c9 3cd 3d0 3d3 3c4 3cb 3cc 3cf
Sep 13 21:56:03 vps6 corosync[32624]: [TOTEM ] Retransmit List: 3e1 3e6 3e7 3eb 3ec 3cd 3d1 3d2 3de 3df 3e5 3ea 3ed 3c4 3c7 3cc 3e0 3e4 3e9 3c3 3d3
Sep 13 21:56:03 vps6 corosync[32624]: [TOTEM ] Retransmit List: 3fc 3fd 3fe 400 3eb 405 408 40a 40b 40c 40e 40f 3d1 3ea 3ed 40d 3c4 3e4 3e9 403 404 409 3e6 3d3
and in process of flood:
Code:
in hell of storm:
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: all data is up to date
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: dfsm_deliver_queue: queue length 49
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 22/3180
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 21/1519
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 20/28683
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 20/28683
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 2/25172
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 5/2013
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 5/2013
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 5/2013
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 10/32092
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 10/32092
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 10/32092
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 13/5473
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 12/18037
Sep 13 22:12:29 vps6 pmxcfs[32613]: [dcdb] notice: remove message from non-member 4/23669
In netstat output i see big Recv-Q = 14273280 value for corosync:
Code:
A.B.C.D# netstat -naltup
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:10050 0.0.0.0:* LISTEN 15135/zabbix_agentd
tcp 0 0 0.0.0.0:8006 0.0.0.0:* LISTEN 1070/pveproxy
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/init
tcp 0 0 127.0.0.1:85 0.0.0.0:* LISTEN 1061/pvedaemon
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 820/sshd
tcp 0 0 0.0.0.0:3128 0.0.0.0:* LISTEN 1077/spiceproxy
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 1018/master
...
tcp6 0 0 :::10050 :::* LISTEN 15135/zabbix_agentd
tcp6 0 0 :::111 :::* LISTEN 1/init
tcp6 0 0 :::22 :::* LISTEN 820/sshd
tcp6 0 0 ::1:25 :::* LISTEN 1018/master
udp 0 0 0.0.0.0:111 0.0.0.0:* 1/init
udp 14273280 23040 A.B.C.D:5405 0.0.0.0:* 8093/corosync <---- RecvQ ???
udp6 0 0 :::111 :::* 1/init
In monitoring system all interfaces to nodes extremely loaded up to 300-800 Mbps :
I can catch 128k packets from one node and see ~250 kpps (~519 Mbps) of UDP-traffic with source and destination port 5405 with packet size from 250 to 270 bytes.
This traffic is unicast from one node to all other nodes of cluster;
This flood so disruptive; Juniper switches can survive under this attack, but many neighbours in cluster vlan and ever in some routed-accesible vlans exhaust resources and degraded;
When we make hard restart of corosync and pve-cluster by this command:
Bash:
killall -9 pmxcfs && killall -9 corosync && systemctl restart pve-cluster && sleep 5 && systemctl restart pve-cluster
UDP-flood stop immediately, and after some seconds / minutes cluster re-survive.
Networks return to stable working;
All VM and CT continue working;
Rate of UDP-5405 return to normal small value (~ 90 pps / 500 kbps for each node);
1). What is it ? Why corosync generate too much udp flood ? Do anyone see some similar ?
2). Can we prevent this by apply --hashlimit iptables rules in OUTPUT ?
Similar to relaively rare but heavy disruptive bug;
Thank you for any help;