Hi @ all,
yesterday we had a pretty weird behaviour of our 25 node cluster. I did two things that seem to broke the whole cluster.
1) Started a live zfs migration form a node with a 6.1-x version to a 6.2-x node.
2) Installed in the meantime a new node and added it to the cluster (cause I forgot that the live migration is running)
After adding the new node I did recognized that the cluster seemed to be unhealthy. Checking the availability of all the nodes I saw that the latency of the nodes is pretty bad and sometimes even timeouts happan. This let me think that I have a network problem, and lead me in a different analysis direction (collisions). After checking the switches status I recognized that on all ports of the switch, where nodes from this cluster are active, we had almost 1GB/s traffic per port which leads to arroung 25GB/s traffic through the whole switch. To be honest, in my hurry and still not sure what causes the real problem (maybe broken network card or really the cluster filesystem), I did shutdown every node and started it again, which fixed the problem.
Now I would like to understand what happaned. My assumtion is, that the cluster filesystem got somehow a loop where every node changed something and rechanged it again. That lead to a super high traffic on the switch.
Any other ideas or explanations on that?
Regards
Andy
yesterday we had a pretty weird behaviour of our 25 node cluster. I did two things that seem to broke the whole cluster.
1) Started a live zfs migration form a node with a 6.1-x version to a 6.2-x node.
2) Installed in the meantime a new node and added it to the cluster (cause I forgot that the live migration is running)
After adding the new node I did recognized that the cluster seemed to be unhealthy. Checking the availability of all the nodes I saw that the latency of the nodes is pretty bad and sometimes even timeouts happan. This let me think that I have a network problem, and lead me in a different analysis direction (collisions). After checking the switches status I recognized that on all ports of the switch, where nodes from this cluster are active, we had almost 1GB/s traffic per port which leads to arroung 25GB/s traffic through the whole switch. To be honest, in my hurry and still not sure what causes the real problem (maybe broken network card or really the cluster filesystem), I did shutdown every node and started it again, which fixed the problem.
Now I would like to understand what happaned. My assumtion is, that the cluster filesystem got somehow a loop where every node changed something and rechanged it again. That lead to a super high traffic on the switch.
Any other ideas or explanations on that?
Regards
Andy