Since earlier today my cluster keeps losing nodes / quorum. I was configuring a new vlan on my switch and just re-configured the port-channels on the switches and all trouble started.
What happens:
Corosync bails out with an error. I stop pve-cluster and corosync, start pve-cluster, it runs for a couple of minutes and the same happens again. See logfiles from the time I started pve-cluster until it errors out again.
I also ran omping, see results:
hv01:
hv02:
hv03:
Anyone can help me understand what the problem is?
What happens:
Corosync bails out with an error. I stop pve-cluster and corosync, start pve-cluster, it runs for a couple of minutes and the same happens again. See logfiles from the time I started pve-cluster until it errors out again.
I also ran omping, see results:
hv01:
Code:
root@hv01:~# omping -c 10000 -i 0.001 -F -q hv01 hv02 hv03
hv02 : waiting for response msg
hv03 : waiting for response msg
7hv02 : waiting for response msg
hv03 : waiting for response msg
hv02 : waiting for response msg
hv03 : waiting for response msg
hv03 : joined (S,G) = (*, 232.43.211.234), pinging
hv02 : joined (S,G) = (*, 232.43.211.234), pinging
hv02 : given amount of query messages was sent
hv03 : given amount of query messages was sent
hv02 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.026/0.065/1.752/0.050
hv02 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.064/1.766/0.028
hv03 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.024/0.056/0.204/0.019
hv03 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.028/0.063/0.209/0.022
hv02:
Code:
root@hv02:~# omping -c 10000 -i 0.001 -F -q hv01 hv02 hv03
hv01 : waiting for response msg
hv03 : waiting for response msg
hv01 : joined (S,G) = (*, 232.43.211.234), pinging
hv03 : waiting for response msg
hv03 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : given amount of query messages was sent
hv03 : waiting for response msg
hv03 : server told us to stop
hv01 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.026/0.065/1.656/0.045
hv01 : multicast, xmt/rcv/%loss = 10000/9989/0% (seq>=12 0%), min/avg/max/std-dev = 0.025/0.071/0.626/0.043
hv03 : unicast, xmt/rcv/%loss = 9625/9625/0%, min/avg/max/std-dev = 0.024/0.057/0.154/0.020
hv03 : multicast, xmt/rcv/%loss = 9625/9625/0%, min/avg/max/std-dev = 0.024/0.062/0.193/0.022
hv03:
Code:
root@hv03:~# omping -c 10000 -i 0.001 -F -q hv01 hv02 hv03
hv01 : waiting for response msg
hv02 : waiting for response msg
hv02 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : waiting for response msg
hv01 : server told us to stop
hv02 : given amount of query messages was sent
hv01 : unicast, xmt/rcv/%loss = 9854/9854/0%, min/avg/max/std-dev = 0.023/0.059/0.186/0.020
hv01 : multicast, xmt/rcv/%loss = 9854/9844/0% (seq>=11 0%), min/avg/max/std-dev = 0.023/0.064/0.192/0.022
hv02 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.023/0.055/0.206/0.019
hv02 : multicast, xmt/rcv/%loss = 10000/9990/0% (seq>=11 0%), min/avg/max/std-dev = 0.026/0.061/0.190/0.021
Anyone can help me understand what the problem is?