Hi All,
I have a cluster of 7 nodes in a HP C7000 chassis. Each Ceph is running over a 10G switch.
The ceph cluster has been running perfectly for almost a year but all of a sudden, something is causing the nodes to loose connectivity and all nodes show as unavailable and CT's crash.
I have found the only way to fix the issue is to reboot all the nodes.
I've read that this could be a problem with multicast however I have run a ten minute test and the results seem ok :-
-------
10.10.11.48 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.169/0.837/0.072
10.10.11.48 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.179/0.838/0.073
10.10.11.49 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.165/0.652/0.068
10.10.11.49 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.042/0.178/0.640/0.069
10.10.11.51 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.154/0.617/0.058
10.10.11.51 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.163/0.616/0.058
10.10.11.52 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.146/1.240/0.077
10.10.11.52 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.159/1.236/0.078
10.10.11.53 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.039/0.154/0.423/0.059
10.10.11.53 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.164/0.435/0.061
10.10.11.54 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.142/0.352/0.047
10.10.11.54 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.043/0.150/0.378/0.047
--------
Can anyone suggest how to fix this problem?
I have a cluster of 7 nodes in a HP C7000 chassis. Each Ceph is running over a 10G switch.
The ceph cluster has been running perfectly for almost a year but all of a sudden, something is causing the nodes to loose connectivity and all nodes show as unavailable and CT's crash.
I have found the only way to fix the issue is to reboot all the nodes.
I've read that this could be a problem with multicast however I have run a ten minute test and the results seem ok :-
-------
10.10.11.48 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.169/0.837/0.072
10.10.11.48 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.179/0.838/0.073
10.10.11.49 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.165/0.652/0.068
10.10.11.49 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.042/0.178/0.640/0.069
10.10.11.51 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.154/0.617/0.058
10.10.11.51 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.163/0.616/0.058
10.10.11.52 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.037/0.146/1.240/0.077
10.10.11.52 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.159/1.236/0.078
10.10.11.53 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.039/0.154/0.423/0.059
10.10.11.53 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.041/0.164/0.435/0.061
10.10.11.54 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.040/0.142/0.352/0.047
10.10.11.54 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.043/0.150/0.378/0.047
--------
Can anyone suggest how to fix this problem?