Proxmox Cluster DDosing each other.

sonuyos

Member
Jun 18, 2020
41
0
11
28
I have about 12 nodes in my server and for some weird reason they are red and ddosing each other (according to my provider)

https://prnt.sc/tlekgl

But all the servers are using 500mbps up and down and downloading and uploading to each other at rapid pace.

I am not sure whats going on, any help? This has happened second time.
 
Update.

Stopping corosync and starting it again fixes the bandwidth issue. But i am afraid this will come back.

And i do not have unlimited bandwidth, anyway to fix this? :(
 
Is very difficult to assist, if no details... A beginning point could be to provide the Proxmox version on all the nodes is deployed, that is under summary "Package Versions".

Regards,
 
Is very difficult to assist, if no details... A beginning point could be to provide the Proxmox version on all the nodes is deployed, that is under summary "Package Versions".

Regards,
Oh sorry, i am new around here.

These are the 3 versions across my 12 nodes.

pve-manager/6.2-6/ee1d7754
pve-manager/6.2-10/a20769ed
pve-manager/6.2-9/4d363c5b

The nodes just go red (all online but red).

Stopping corosync and then starting again fixes it. But this happened 2nd time. in 3 weeks.
 
Please describe as good as possible the full architecture of your Cluster network.
Hi Tom,

Alright, so i have 13 nodes in my cluster - https://prnt.sc/tlyy8r


And they work perfectly fine, for the most of it.

All are in local disk (i manually move/create VM templates on each server).

All have vmbr0.


These are the 3 versions across my 13 nodes.

pve-manager/6.2-6/ee1d7754
pve-manager/6.2-10/a20769ed
pve-manager/6.2-9/4d363c5b

But randomly (happened 3 times, without my intervention of any kind) they all go red. And then bandwidth goes crazy, as if its transferring files or anything around.

Also i assume the size is small (of the files?) as DC is detecting it as DDOS attack.


Here - https://prnt.sc/tlekgl

Bandwidth Graph from one of the server - https://prnt.sc/tlz0j2
As you can see whole day, its normal, but in past 24hours, it spiked twice, during the issue.

It gets fixed instantly too, i stop corosync and then it automatically stops the transfer (Idk what it is transferring) between the nodes.

Its not only one server, but i think all are transferring in between each other.

They go haywire, and then when i start corosync back again, it works just its supposed to. Flat bandwidth curve.
 
PVE Manager Version - pve-manager/6.2-6/ee1d7754

For the server i provided screenshot of bandwidth.
Also lemme clarify, it is not only 1 node, its almost all of them, exchanging files/traffic.

I can see in nload that they are connected to each other.
 
Thanks, but I still do not know how you configure your cluster network (the corosync links).
  1. Are all nodes in the same datacenter?
  2. How many physical corosync links did you use?
  3. Did you use physically separated corosync links?
  4. What latency do you have on these links?
 
Thanks, but I still do not know how you configure your cluster network (the corosync links).
  1. Are all nodes in the same datacenter?
  2. How many physical corosync links did you use?
  3. Did you use physically separated corosync links?
  4. What latency do you have on these links?

1 - No, spread across 3 DC.
2 & 3 - I have not configured corosync at all, i just installed proxmox across all the server, and then joined its onto one cluster, nothing else.
4 - I am not really sure.
 
By the looks of it, it happened earlier today too.

view from usa19 node proxmox interface
usa19 - https://prnt.sc/tlzqsh
usa01 - https://prnt.sc/tlzu0x

Same for others too

view from usa01 node proxmox interface
usa01 - https://prnt.sc/tlztg4
usa19 - https://prnt.sc/tlzuoa

You can see that because of corosync being down or haywired (i assume), the proxmox interface of usa19 network summary is not showing anything for usa01 during that time, being blank, but usa01 shows bandwidth being used during that time, but shows usa19 in its own summary page being blank.

I know its confusing, but idk how to explain it better.
 
What is the latency between the links (used by corosync) ?

e,g,
NodeXinDC1 to NodeYinDC2 latency in ms/ns
NodeXinDC1 to NodeZinDC3 latency in ms/ns
NodeYinDC2 to NodeZinDC3 latency in ms/ns


Are you using a "Virtual Private Network" technology to tie the underlying network together, created a internal network using a separate nic, or just added the nodes to the cluster using the public IP of the nodes (internet facing)
 
What is the latency between the links (used by corosync) ?

e,g,
NodeXinDC1 to NodeYinDC2 latency in ms/ns
NodeXinDC1 to NodeZinDC3 latency in ms/ns
NodeYinDC2 to NodeZinDC3 latency in ms/ns


Are you using a "Virtual Private Network" technology to tie the underlying network together, created a internal network using a separate nic, or just added the nodes to the cluster using the public IP of the nodes (internet facing)

Can you tell me the command to check? I am not sure how to check for corosync, i can tell u iperf result if u want between nodes.

Also they are connected on the public IP, not internal network, VPN or separate nic, that's the main issue. As it is using my bandwidth.
 
a ping from Node X to Node Y on the network used by corosync will suffice. If you set up the cluster using the public (internet facing) IP of the hosts, and did not manually configure corosync afterwards, it would be:

On host X: "ping [IP Host y]"
On host X: "ping [IP Host Z]"
On host Y: "ping [IP Host z]"

ps.: this is what tom was after in line item 4.
4. What latency do you have on these links?
 
Thanks, but I still do not know how you configure your cluster network (the corosync links).
  1. Are all nodes in the same datacenter?
  2. How many physical corosync links did you use?
  3. Did you use physically separated corosync links?
  4. What latency do you have on these links?
a ping from Node X to Node Y on the network used by corosync will suffice. If you set up the cluster using the public (internet facing) IP of the hosts, and did not manually configure corosync afterwards, it would be:

On host X: "ping [IP Host y]"
On host X: "ping [IP Host Z]"
On host Y: "ping [IP Host z]"

ps.: this is what tom was after in line item 4.

Thanks @Q-wulf

Here. Also, i just remembered that i haven't introduced 3rd DC in this. But my servers are in 4 different locations. Added them below

DC-A to DC-B (LAX to Buffalo)
64 bytes from xxxxxxxxxxx: icmp_seq=1 ttl=47 time=68.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=2 ttl=47 time=68.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=3 ttl=47 time=68.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=4 ttl=47 time=68.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=5 ttl=47 time=68.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=6 ttl=47 time=68.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=7 ttl=47 time=68.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=8 ttl=47 time=68.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=9 ttl=47 time=68.8 ms


DC-A to Another DC-A (LAX to Atlanta)
64 bytes from xxxxxxxxxxx: icmp_seq=3 ttl=58 time=46.6 ms
64 bytes from xxxxxxxxxxx: icmp_seq=4 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=5 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=6 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=7 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=8 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=9 ttl=58 time=46.5 ms

DC-A to Another DC-A (LAX to Dallas)
64 bytes from xxxxxxxxxxx: icmp_seq=2 ttl=56 time=28.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=3 ttl=56 time=28.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=4 ttl=56 time=28.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=5 ttl=56 time=28.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=6 ttl=56 time=28.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=7 ttl=56 time=28.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=8 ttl=56 time=28.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=9 ttl=56 time=28.8 ms


DC-A to Another DC-A (Atlanta to Dallas)
64 bytes from xxxxxxxxxxx: icmp_seq=1 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=2 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=3 ttl=56 time=19.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=4 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=5 ttl=56 time=19.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=6 ttl=56 time=19.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=7 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=8 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=9 ttl=56 time=19.10 ms
 
New Insight.

Maybe. A very strong maybe, this happens when one of my corosync crashes, because i stopped corosync in one of my server and i see the bandwidth going crazy.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!