Proxmox Cluster DDosing each other.

sonuyos · Jul 20, 2020

I have about 12 nodes in my server and for some weird reason they are red and ddosing each other (according to my provider)

https://prnt.sc/tlekgl

But all the servers are using 500mbps up and down and downloading and uploading to each other at rapid pace.

I am not sure whats going on, any help? This has happened second time.

sonuyos · Jul 20, 2020

Update.

Stopping corosync and starting it again fixes the bandwidth issue. But i am afraid this will come back.

And i do not have unlimited bandwidth, anyway to fix this?

Tekuno-Kage · Jul 20, 2020

Is very difficult to assist, if no details... A beginning point could be to provide the Proxmox version on all the nodes is deployed, that is under summary "Package Versions".

Regards,

sonuyos · Jul 20, 2020

Tekuno-Kage said:
Is very difficult to assist, if no details... A beginning point could be to provide the Proxmox version on all the nodes is deployed, that is under summary "Package Versions".

Regards,

Oh sorry, i am new around here.

These are the 3 versions across my 12 nodes.

pve-manager/6.2-6/ee1d7754
pve-manager/6.2-10/a20769ed
pve-manager/6.2-9/4d363c5b

The nodes just go red (all online but red).

Stopping corosync and then starting again fixes it. But this happened 2nd time. in 3 weeks.

sonuyos · Jul 21, 2020

Today again it happened.

I dont know why it is happening

tom · Jul 21, 2020

sonuyos said:
I dont know why it is happening

Please describe as good as possible the full architecture of your Cluster network.

sonuyos · Jul 21, 2020

tom said:
Please describe as good as possible the full architecture of your Cluster network.

Hi Tom,

Alright, so i have 13 nodes in my cluster - https://prnt.sc/tlyy8r

And they work perfectly fine, for the most of it.

All are in local disk (i manually move/create VM templates on each server).

All have vmbr0.

These are the 3 versions across my 13 nodes.

pve-manager/6.2-6/ee1d7754
pve-manager/6.2-10/a20769ed
pve-manager/6.2-9/4d363c5b

But randomly (happened 3 times, without my intervention of any kind) they all go red. And then bandwidth goes crazy, as if its transferring files or anything around.

Also i assume the size is small (of the files?) as DC is detecting it as DDOS attack.

Here - https://prnt.sc/tlekgl

Bandwidth Graph from one of the server - https://prnt.sc/tlz0j2
As you can see whole day, its normal, but in past 24hours, it spiked twice, during the issue.

It gets fixed instantly too, i stop corosync and then it automatically stops the transfer (Idk what it is transferring) between the nodes.

Its not only one server, but i think all are transferring in between each other.

They go haywire, and then when i start corosync back again, it works just its supposed to. Flat bandwidth curve.

sonuyos · Jul 21, 2020

PVE Manager Version - pve-manager/6.2-6/ee1d7754

For the server i provided screenshot of bandwidth.
Also lemme clarify, it is not only 1 node, its almost all of them, exchanging files/traffic.

I can see in nload that they are connected to each other.

tom · Jul 21, 2020

Thanks, but I still do not know how you configure your cluster network (the corosync links).

Are all nodes in the same datacenter?
How many physical corosync links did you use?
Did you use physically separated corosync links?
What latency do you have on these links?

sonuyos · Jul 21, 2020

tom said:
Thanks, but I still do not know how you configure your cluster network (the corosync links).

Are all nodes in the same datacenter?

How many physical corosync links did you use?

Did you use physically separated corosync links?

What latency do you have on these links?

1 - No, spread across 3 DC.
2 & 3 - I have not configured corosync at all, i just installed proxmox across all the server, and then joined its onto one cluster, nothing else.
4 - I am not really sure.

sonuyos · Jul 21, 2020

By the looks of it, it happened earlier today too.

view from usa19 node proxmox interface
usa19 - https://prnt.sc/tlzqsh
usa01 - https://prnt.sc/tlzu0x

Same for others too

view from usa01 node proxmox interface
usa01 - https://prnt.sc/tlztg4
usa19 - https://prnt.sc/tlzuoa

You can see that because of corosync being down or haywired (i assume), the proxmox interface of usa19 network summary is not showing anything for usa01 during that time, being blank, but usa01 shows bandwidth being used during that time, but shows usa19 in its own summary page being blank.

I know its confusing, but idk how to explain it better.

Q-wulf · Jul 21, 2020

What is the latency between the links (used by corosync) ?

e,g,
NodeXinDC1 to NodeYinDC2 latency in ms/ns
NodeXinDC1 to NodeZinDC3 latency in ms/ns
NodeYinDC2 to NodeZinDC3 latency in ms/ns

Are you using a "Virtual Private Network" technology to tie the underlying network together, created a internal network using a separate nic, or just added the nodes to the cluster using the public IP of the nodes (internet facing)

sonuyos · Jul 21, 2020

Q-wulf said:
What is the latency between the links (used by corosync) ?

e,g,
NodeXinDC1 to NodeYinDC2 latency in ms/ns
NodeXinDC1 to NodeZinDC3 latency in ms/ns
NodeYinDC2 to NodeZinDC3 latency in ms/ns

Are you using a "Virtual Private Network" technology to tie the underlying network together, created a internal network using a separate nic, or just added the nodes to the cluster using the public IP of the nodes (internet facing)

Can you tell me the command to check? I am not sure how to check for corosync, i can tell u iperf result if u want between nodes.

Also they are connected on the public IP, not internal network, VPN or separate nic, that's the main issue. As it is using my bandwidth.

Q-wulf · Jul 22, 2020

a ping from Node X to Node Y on the network used by corosync will suffice. If you set up the cluster using the public (internet facing) IP of the hosts, and did not manually configure corosync afterwards, it would be:

On host X: "ping [IP Host y]"
On host X: "ping [IP Host Z]"
On host Y: "ping [IP Host z]"

ps.: this is what tom was after in line item 4.

4. What latency do you have on these links?

sonuyos · Jul 22, 2020

tom said:
Thanks, but I still do not know how you configure your cluster network (the corosync links).

Are all nodes in the same datacenter?

How many physical corosync links did you use?

Did you use physically separated corosync links?

What latency do you have on these links?

Q-wulf said:
a ping from Node X to Node Y on the network used by corosync will suffice. If you set up the cluster using the public (internet facing) IP of the hosts, and did not manually configure corosync afterwards, it would be:

On host X: "ping [IP Host y]"
On host X: "ping [IP Host Z]"
On host Y: "ping [IP Host z]"

ps.: this is what tom was after in line item 4.

Thanks @Q-wulf

Here. Also, i just remembered that i haven't introduced 3rd DC in this. But my servers are in 4 different locations. Added them below

DC-A to DC-B (LAX to Buffalo)
64 bytes from xxxxxxxxxxx: icmp_seq=1 ttl=47 time=68.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=2 ttl=47 time=68.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=3 ttl=47 time=68.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=4 ttl=47 time=68.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=5 ttl=47 time=68.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=6 ttl=47 time=68.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=7 ttl=47 time=68.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=8 ttl=47 time=68.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=9 ttl=47 time=68.8 ms

DC-A to Another DC-A (LAX to Atlanta)
64 bytes from xxxxxxxxxxx: icmp_seq=3 ttl=58 time=46.6 ms
64 bytes from xxxxxxxxxxx: icmp_seq=4 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=5 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=6 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=7 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=8 ttl=58 time=46.5 ms
64 bytes from xxxxxxxxxxx: icmp_seq=9 ttl=58 time=46.5 ms

DC-A to Another DC-A (LAX to Dallas)
64 bytes from xxxxxxxxxxx: icmp_seq=2 ttl=56 time=28.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=3 ttl=56 time=28.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=4 ttl=56 time=28.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=5 ttl=56 time=28.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=6 ttl=56 time=28.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=7 ttl=56 time=28.8 ms
64 bytes from xxxxxxxxxxx: icmp_seq=8 ttl=56 time=28.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=9 ttl=56 time=28.8 ms

DC-A to Another DC-A (Atlanta to Dallas)
64 bytes from xxxxxxxxxxx: icmp_seq=1 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=2 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=3 ttl=56 time=19.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=4 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=5 ttl=56 time=19.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=6 ttl=56 time=19.9 ms
64 bytes from xxxxxxxxxxx: icmp_seq=7 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=8 ttl=56 time=19.10 ms
64 bytes from xxxxxxxxxxx: icmp_seq=9 ttl=56 time=19.10 ms

sonuyos · Jul 22, 2020

New Insight.

Maybe. A very strong maybe, this happens when one of my corosync crashes, because i stopped corosync in one of my server and i see the bandwidth going crazy.

Search

Search

Proxmox Cluster DDosing each other.

sonuyos

Active Member

sonuyos

Active Member

Tekuno-Kage

Renowned Member

sonuyos

Active Member

sonuyos

Active Member

tom

Proxmox Staff Member

sonuyos

Active Member

sonuyos

Active Member

tom

Proxmox Staff Member

sonuyos

Active Member

sonuyos

Active Member

Q-wulf

Renowned Member

sonuyos

Active Member

Q-wulf

Renowned Member

sonuyos

Active Member

sonuyos

Active Member

We value your privacy