Weird Cluster bandwidth behavior

andy77 · Nov 22, 2020

Hi @ all,

yesterday we had a pretty weird behaviour of our 25 node cluster. I did two things that seem to broke the whole cluster.

1) Started a live zfs migration form a node with a 6.1-x version to a 6.2-x node.
2) Installed in the meantime a new node and added it to the cluster (cause I forgot that the live migration is running)

After adding the new node I did recognized that the cluster seemed to be unhealthy. Checking the availability of all the nodes I saw that the latency of the nodes is pretty bad and sometimes even timeouts happan. This let me think that I have a network problem, and lead me in a different analysis direction (collisions). After checking the switches status I recognized that on all ports of the switch, where nodes from this cluster are active, we had almost 1GB/s traffic per port which leads to arroung 25GB/s traffic through the whole switch. To be honest, in my hurry and still not sure what causes the real problem (maybe broken network card or really the cluster filesystem), I did shutdown every node and started it again, which fixed the problem.

Now I would like to understand what happaned. My assumtion is, that the cluster filesystem got somehow a loop where every node changed something and rechanged it again. That lead to a super high traffic on the switch.

Any other ideas or explanations on that?

Regards
Andy

wolfgang · Nov 24, 2020

Hi,

do you have a dedicated redundant Network for corosync? And what network do you use for migration?
I can imagine you migrate over the corosync network, so the latency rise and the corosync queue gets filed and resend pagages.
With large clusters, this can be that there is a point where this can't keep up and stays on this level.

andy77 · Nov 26, 2020

Hi Wolfgang, thanks for your effort answering.
No corosync uses the same 1GB network as was used for migrtion.

Search

Search

Weird Cluster bandwidth behavior

andy77

Renowned Member

wolfgang

Proxmox Retired Staff

andy77

Renowned Member

We value your privacy