Weird Cluster bandwidth behavior

andy77

Renowned Member
Jul 6, 2016
248
14
83
41
Hi @ all,

yesterday we had a pretty weird behaviour of our 25 node cluster. I did two things that seem to broke the whole cluster.

1) Started a live zfs migration form a node with a 6.1-x version to a 6.2-x node.
2) Installed in the meantime a new node and added it to the cluster (cause I forgot that the live migration is running)

After adding the new node I did recognized that the cluster seemed to be unhealthy. Checking the availability of all the nodes I saw that the latency of the nodes is pretty bad and sometimes even timeouts happan. This let me think that I have a network problem, and lead me in a different analysis direction (collisions). After checking the switches status I recognized that on all ports of the switch, where nodes from this cluster are active, we had almost 1GB/s traffic per port which leads to arroung 25GB/s traffic through the whole switch. To be honest, in my hurry and still not sure what causes the real problem (maybe broken network card or really the cluster filesystem), I did shutdown every node and started it again, which fixed the problem.

Now I would like to understand what happaned. My assumtion is, that the cluster filesystem got somehow a loop where every node changed something and rechanged it again. That lead to a super high traffic on the switch.

Any other ideas or explanations on that?

Regards
Andy
 
Hi,

do you have a dedicated redundant Network for corosync? And what network do you use for migration?
I can imagine you migrate over the corosync network, so the latency rise and the corosync queue gets filed and resend pagages.
With large clusters, this can be that there is a point where this can't keep up and stays on this level.