Cluster Dropped Overnight

Ashley

Member
Jun 28, 2016
267
15
18
34
I setup a two node test cluster late yesterday, overnight the cluster has broken, with both node's being unable to see each other within the cluster, (shows as disconected in the GUI)

Looking at the syslog I can see the following :

Aug 8 23:50:19 prox corosync[12516]: [TOTEM ] FAILED TO RECEIVE
Aug 8 23:50:20 prox corosync[12516]: [TOTEM ] A new membership (172.16.1.250:12) was formed. Members left: 2
Aug 8 23:50:20 prox corosync[12516]: [TOTEM ] Failed to receive the leave message. failed: 2


And then the following repeated a few times "This node is within the non-primary component and will NOT provide any services.", since then I have tried to reboot both nodes multiple times, however each time corosync comes online "syncs", but only shows one node (itself) on each server.

I am using a VLAN / Internal network for cluster comm's, both servers can ping each other with >0.1MS PING, and both have the internal hostname set in /etc/hosts.

The only thing I can think of is one has a lacp / bond while the other is a single NIC, could the "failed recieve" be where data has gone through one bond uplink, and been recieved on another, which corosync may not see corectly?

Apart from that all services are showing running and with no error's.
 
I have tried a couple of the commands, but getting the following response back :

root@n1:~# omping -c 10000 -i 0.001 -F -q 172.16.1.250
omping: Can't find local address in arguments

I have noticed that even though I have added the internal DNS entrys in /etc/hosts, a nslookup fails (listed on the troublshooting page), but doing a ping to the internal dns entry resolves and pings fine.

root@n1:~# ping -f prox-internal
PING prox-internal (172.16.1.250) 56(84) bytes of data.
.^C
--- prox-internal ping statistics ---
121432 packets transmitted, 121431 received, 0% packet loss, time 13527ms
rtt min/avg/max/mdev = 0.053/0.094/0.523/0.020 ms, pipe 2, ipg/ewma 0.111/0.090 ms

Ive noticed in corosync one entry is using the internal dns, the other is using the IP address:

ring0_addr: prox-internal

&

ring0_addr: 172.16.1.1

File is locked out so im unable to change to all IP's to see if this resolves the issue.

Thanks,
Ashley
 
root@n1:~# omping -c 10000 -i 0.001 -F -q 172.16.1.250
omping: Can't find local address in arguments

here you have only given one ip adress, but you need all like this:
Code:
omping -c 10000 -i 0.001 -F -q 10.0.0.10 10.0.0.20 10.0.0.30

edit: also you need to execute this command on all nodes simultaneously

I have noticed that even though I have added the internal DNS entrys in /etc/hosts, a nslookup fails (listed on the troublshooting page), but doing a ping to the internal dns entry resolves and pings fine.

which troubleshooting page are you referring to?
it would be helpful to post your corosync config and the /etc/hosts file
 
Hello,

This page : https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues

The command is now running fine:

root@n1:~# omping -c 10000 -i 0.001 -F -q 172.16.1.1 172.16.1.250
172.16.1.250 : waiting for response msg
172.16.1.250 : joined (S,G) = (*, 232.43.211.234), pinging
172.16.1.250 : given amount of query messages was sent

172.16.1.250 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.064/0.139/0.560/0.043
172.16.1.250 : multicast, xmt/rcv/%loss = 10000/9917/0% (seq>=84 0%), min/avg/max/std-dev = 0.066/0.151/0.565/0

root@prox:/var/log# omping -c 10000 -i 0.001 -F -q 172.16.1.1 172.16.1.250
172.16.1.1 : waiting for response msg
172.16.1.1 : waiting for response msg
172.16.1.1 : joined (S,G) = (*, 232.43.211.234), pinging
172.16.1.1 : waiting for response msg
172.16.1.1 : server told us to stop

172.16.1.1 : unicast, xmt/rcv/%loss = 9073/9073/0%, min/avg/max/std-dev = 0.066/0.145/0.520/0.038
172.16.1.1 : multicast, xmt/rcv/%loss = 9073/9073/0%, min/avg/max/std-dev = 0.065/0.143/0.518/0.036

/etc/hosts:

# corosync network hosts

172.16.1.250 prox-internal
172.16.1.1 n1-internal

corosync config file (n1)

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: prox
nodeid: 1
quorum_votes: 1
ring0_addr: prox-internal
}

node {
name: n1
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.1.1
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: zxfra
config_version: 2
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.16.1.250
ringnumber: 0
}

}

Thanks,
Ashley
 
/etc/hosts:

# corosync network hosts

172.16.1.250 prox-internal
172.16.1.1 n1-internal

is this the whole /etc/hosts file ?

the rest looks ok (i dont think it matters that one host is in the corosync.conf with ip and the other with the hostname, as long as you can reach all)

oh also can you post the corosync.conf from the other node?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!