Failed to Receive

Downeast Tech

New Member
Sep 20, 2016
14
0
1
41
I am curious about my TOTEM errors. corosync spits the Retransmit List error multiple times a day and after each time TOTEM states that it FAILED TO RECEIVE . After this I have what looks like a new membership for the cluster being created and after a few seconds it seems to correct itself. Is this normal behavior or is there some like packet loss that needs to be addressed. Thanks for the help.

Also, how long does a properly working cluster perform before any server maintenance usually is needed. Are regular restarts something that needs to be scheduled?
 
This happens if you have network problems.
Test you network.

This command you have to execute on all cluster nodes

omping -c 10000 -i 0.001 -F -q node1 node2 node3
 
SERVER1
root@server1:~# omping -c 10000 -i 0.001 -F -q server1 server2 server3
server2 : waiting for response msg
server3 : waiting for response msg
server2 : waiting for response msg
server3 : waiting for response msg
server2 : waiting for response msg
server3 : waiting for response msg
server2 : joined (S,G) = (*, 232.43.211.234), pinging
server3 : waiting for response msg
server3 : joined (S,G) = (*, 232.43.211.234), pinging
server2 : given amount of query messages was sent
server3 : waiting for response msg
server3 : server told us to stop

server2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.031/0.061/0.168/0.020
server2 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.033/0.065/0.170/0.022
server3 : unicast, xmt/rcv/%loss = 9935/9935/0%, min/avg/max/std-dev = 0.029/0.068/1.814/0.030
server3 : multicast, xmt/rcv/%loss = 9935/9935/0%, min/avg/max/std-dev = 0.034/0.072/1.815/0.032
root@server1:~#

SERVER2
root@server2:~# omping -c 10000 -i 0.001 -F -q server1 server2 server3
server1 : waiting for response msg
server3 : waiting for response msg
server1 : joined (S,G) = (*, 232.43.211.234), pinging
server3 : waiting for response msg
server3 : waiting for response msg
server3 : joined (S,G) = (*, 232.43.211.234), pinging
server1 : given amount of query messages was sent
server3 : waiting for response msg
server3 : server told us to stop

server1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.030/0.054/0.160/0.015
server1 : multicast, xmt/rcv/%loss = 10000/9992/0% (seq>=9 0%), min/avg/max/std-dev = 0.031/0.060/0.176/0.017
server3 : unicast, xmt/rcv/%loss = 9395/9395/0%, min/avg/max/std-dev = 0.030/0.067/1.814/0.031
server3 : multicast, xmt/rcv/%loss = 9395/9395/0%, min/avg/max/std-dev = 0.034/0.071/1.857/0.037
root@server2:~#

SERVER3
root@server3:~# omping -c 10000 -i 0.001 -F -q server1 server2 server3
server1 : waiting for response msg
server2 : waiting for response msg
server2 : joined (S,G) = (*, 232.43.211.234), pinging
server1 : joined (S,G) = (*, 232.43.211.234), pinging
server1 : given amount of query messages was sent
server2 : given amount of query messages was sent

server1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.031/0.057/0.154/0.016
server1 : multicast, xmt/rcv/%loss = 10000/9992/0% (seq>=9 0%), min/avg/max/std-dev = 0.032/0.064/0.176/0.019
server2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.030/0.061/0.166/0.019
server2 : multicast, xmt/rcv/%loss = 10000/9992/0% (seq>=9 0%), min/avg/max/std-dev = 0.034/0.066/0.179/0.020
root@server3:~#

I had tried the omping -c 600 test from the multicast troubleshooting notes and it went through fine. I never saw any "Server told us to stop" error. I ran this omping -c 10000 again and the stop message was sent from server2 that time. Is this something that maybe jumbo frames would help? Thanks for the help.
 
Also, I just upgraded to 4.3 hoping these errors would go away. Before the upgrade these omping commands worked with almost no loss. Now they don't work at all
 
Did some switch modification and ran omping again.

SERVER1
server2 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.096/0.195/0.256/0.027
server2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.100/0.200/0.258/0.027
server3 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.084/0.182/0.252/0.027
server3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.089/0.187/0.251/0.026

SERVER2

server1 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.088/0.201/0.261/0.024
server1 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.091/0.205/0.263/0.024
server3 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.089/0.186/0.235/0.022
server3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.090/0.192/0.251/0.022

SERVER3

server1 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.093/0.188/0.251/0.028
server1 : multicast, xmt/rcv/%loss = 600/598/0% (seq>=2 0%), min/avg/max/std-dev = 0.097/0.193/0.255/0.029
server2 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.089/0.181/0.249/0.028
server2 : multicast, xmt/rcv/%loss = 600/598/0% (seq>=2 0%), min/avg/max/std-dev = 0.094/0.185/0.249/0.027

This is what I was getting before the upgrade, not sure what changed. I am still getting Retransmit List errors and what looks like a new cluster setup. Any other ideas? Thanks.
 
I have tried new cables, different NIC's and a different switch. I am still getting Retransmit List errors. I am at a loss on what to do next. Does anyone have any ideas? Thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!