notice: [TOTEM ] Retransmit List

yavuz

Renowned Member
Jun 22, 2014
24
1
68
Since earlier today my cluster keeps losing nodes / quorum. I was configuring a new vlan on my switch and just re-configured the port-channels on the switches and all trouble started.

What happens:
Corosync bails out with an error. I stop pve-cluster and corosync, start pve-cluster, it runs for a couple of minutes and the same happens again. See logfiles from the time I started pve-cluster until it errors out again.

I also ran omping, see results:

hv01:
Code:
root@hv01:~# omping -c 10000 -i 0.001 -F -q hv01 hv02 hv03
hv02 : waiting for response msg
hv03 : waiting for response msg
7hv02 : waiting for response msg
hv03 : waiting for response msg
hv02 : waiting for response msg
hv03 : waiting for response msg
hv03 : joined (S,G) = (*, 232.43.211.234), pinging
hv02 : joined (S,G) = (*, 232.43.211.234), pinging
hv02 : given amount of query messages was sent
hv03 : given amount of query messages was sent

hv02 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.026/0.065/1.752/0.050
hv02 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.064/1.766/0.028
hv03 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.024/0.056/0.204/0.019
hv03 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.028/0.063/0.209/0.022

hv02:
Code:
root@hv02:~# omping -c 10000 -i 0.001 -F -q hv01 hv02 hv03
hv01 : waiting for response msg
hv03 : waiting for response msg
hv01 : joined (S,G) = (*, 232.43.211.234), pinging
hv03 : waiting for response msg
hv03 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : given amount of query messages was sent
hv03 : waiting for response msg
hv03 : server told us to stop

hv01 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.026/0.065/1.656/0.045
hv01 : multicast, xmt/rcv/%loss = 10000/9989/0% (seq>=12 0%), min/avg/max/std-dev = 0.025/0.071/0.626/0.043
hv03 :   unicast, xmt/rcv/%loss = 9625/9625/0%, min/avg/max/std-dev = 0.024/0.057/0.154/0.020
hv03 : multicast, xmt/rcv/%loss = 9625/9625/0%, min/avg/max/std-dev = 0.024/0.062/0.193/0.022

hv03:
Code:
root@hv03:~# omping -c 10000 -i 0.001 -F -q hv01 hv02 hv03
hv01 : waiting for response msg
hv02 : waiting for response msg
hv02 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : waiting for response msg
hv01 : server told us to stop
hv02 : given amount of query messages was sent

hv01 :   unicast, xmt/rcv/%loss = 9854/9854/0%, min/avg/max/std-dev = 0.023/0.059/0.186/0.020
hv01 : multicast, xmt/rcv/%loss = 9854/9844/0% (seq>=11 0%), min/avg/max/std-dev = 0.023/0.064/0.192/0.022
hv02 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.023/0.055/0.206/0.019
hv02 : multicast, xmt/rcv/%loss = 10000/9990/0% (seq>=11 0%), min/avg/max/std-dev = 0.026/0.061/0.190/0.021

Anyone can help me understand what the problem is?
 

Attachments

  • syslog.zip
    20.5 KB · Views: 1
OK, did the long tests:

hv01:
Code:
root@hv01:~# omping -c 600 -i 1 -q hv01 hv02 hv03
hv02 : waiting for response msg
hv03 : waiting for response msg
hv02 : waiting for response msg
hv03 : waiting for response msg
hv02 : joined (S,G) = (*, 232.43.211.234), pinging
hv03 : waiting for response msg
hv03 : joined (S,G) = (*, 232.43.211.234), pinging
hv02 : given amount of query messages was sent
hv03 : given amount of query messages was sent

hv02 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.042/0.105/0.189/0.024
hv02 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.089/0.157/0.293/0.033
hv03 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.046/0.131/0.510/0.055
hv03 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.086/0.197/0.557/0.063

hv02:
Code:
root@hv02:~# omping -c 600 -i 1 -q hv01 hv02 hv03
hv01 : waiting for response msg
hv03 : waiting for response msg
hv01 : joined (S,G) = (*, 232.43.211.234), pinging
hv03 : waiting for response msg
hv03 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : given amount of query messages was sent
hv03 : given amount of query messages was sent

hv01 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.050/0.122/0.211/0.026
hv01 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.096/0.168/0.279/0.033
hv03 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.057/0.099/0.309/0.026
hv03 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.108/0.176/0.561/0.038

hv03:
Code:
root@hv03:~# omping -c 600 -i 1 -q hv01 hv02 hv03
hv01 : waiting for response msg
hv02 : waiting for response msg
hv02 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : joined (S,G) = (*, 232.43.211.234), pinging
hv01 : given amount of query messages was sent
hv02 : given amount of query messages was sent

hv01 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.054/0.117/1.086/0.080
hv01 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.082/0.143/0.496/0.044
hv02 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.044/0.090/0.178/0.026
hv02 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.062/0.125/0.210/0.030

This looks like it is OK. Any other suggestions (or is it not OK)?
 
hmmmm...hang on...I have been away for a couple of days and it looks like the problem has been resolved...I'm going to monitor this and will report back if this happens again.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!