Cluster Dropped Overnight

Ashley · Aug 9, 2016

I setup a two node test cluster late yesterday, overnight the cluster has broken, with both node's being unable to see each other within the cluster, (shows as disconected in the GUI)

Looking at the syslog I can see the following :

Aug 8 23:50:19 prox corosync[12516]: [TOTEM ] FAILED TO RECEIVE
Aug 8 23:50:20 prox corosync[12516]: [TOTEM ] A new membership (172.16.1.250:12) was formed. Members left: 2
Aug 8 23:50:20 prox corosync[12516]: [TOTEM ] Failed to receive the leave message. failed: 2

And then the following repeated a few times "This node is within the non-primary component and will NOT provide any services.", since then I have tried to reboot both nodes multiple times, however each time corosync comes online "syncs", but only shows one node (itself) on each server.

I am using a VLAN / Internal network for cluster comm's, both servers can ping each other with >0.1MS PING, and both have the internal hostname set in /etc/hosts.

The only thing I can think of is one has a lacp / bond while the other is a single NIC, could the "failed recieve" be where data has gone through one bond uplink, and been recieved on another, which corosync may not see corectly?

Apart from that all services are showing running and with no error's.

dcsapak · Aug 9, 2016

please verify that multicast is working
see https://pve.proxmox.com/wiki/Multicast_notes#Testing_multicast

Ashley · Aug 9, 2016

I have tried a couple of the commands, but getting the following response back :

root@n1:~# omping -c 10000 -i 0.001 -F -q 172.16.1.250
omping: Can't find local address in arguments

I have noticed that even though I have added the internal DNS entrys in /etc/hosts, a nslookup fails (listed on the troublshooting page), but doing a ping to the internal dns entry resolves and pings fine.

root@n1:~# ping -f prox-internal
PING prox-internal (172.16.1.250) 56(84) bytes of data.
.^C
--- prox-internal ping statistics ---
121432 packets transmitted, 121431 received, 0% packet loss, time 13527ms
rtt min/avg/max/mdev = 0.053/0.094/0.523/0.020 ms, pipe 2, ipg/ewma 0.111/0.090 ms

Ive noticed in corosync one entry is using the internal dns, the other is using the IP address:

ring0_addr: prox-internal

&

ring0_addr: 172.16.1.1

File is locked out so im unable to change to all IP's to see if this resolves the issue.

Thanks,
Ashley

dcsapak · Aug 9, 2016

Ashley said:
root@n1:~# omping -c 10000 -i 0.001 -F -q 172.16.1.250
omping: Can't find local address in arguments

here you have only given one ip adress, but you need all like this:

Code:

omping -c 10000 -i 0.001 -F -q 10.0.0.10 10.0.0.20 10.0.0.30

edit: also you need to execute this command on all nodes simultaneously

Ashley said:
I have noticed that even though I have added the internal DNS entrys in /etc/hosts, a nslookup fails (listed on the troublshooting page), but doing a ping to the internal dns entry resolves and pings fine.

which troubleshooting page are you referring to?
it would be helpful to post your corosync config and the /etc/hosts file

Ashley · Aug 9, 2016

Hello,

This page : https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues

The command is now running fine:

root@n1:~# omping -c 10000 -i 0.001 -F -q 172.16.1.1 172.16.1.250
172.16.1.250 : waiting for response msg
172.16.1.250 : joined (S,G) = (*, 232.43.211.234), pinging
172.16.1.250 : given amount of query messages was sent

172.16.1.250 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.064/0.139/0.560/0.043
172.16.1.250 : multicast, xmt/rcv/%loss = 10000/9917/0% (seq>=84 0%), min/avg/max/std-dev = 0.066/0.151/0.565/0

root@prox:/var/log# omping -c 10000 -i 0.001 -F -q 172.16.1.1 172.16.1.250
172.16.1.1 : waiting for response msg
172.16.1.1 : waiting for response msg
172.16.1.1 : joined (S,G) = (*, 232.43.211.234), pinging
172.16.1.1 : waiting for response msg
172.16.1.1 : server told us to stop

172.16.1.1 : unicast, xmt/rcv/%loss = 9073/9073/0%, min/avg/max/std-dev = 0.066/0.145/0.520/0.038
172.16.1.1 : multicast, xmt/rcv/%loss = 9073/9073/0%, min/avg/max/std-dev = 0.065/0.143/0.518/0.036

/etc/hosts:

# corosync network hosts

172.16.1.250 prox-internal
172.16.1.1 n1-internal

corosync config file (n1)

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: prox
nodeid: 1
quorum_votes: 1
ring0_addr: prox-internal
}

node {
name: n1
nodeid: 2
quorum_votes: 1
ring0_addr: 172.16.1.1
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: zxfra
config_version: 2
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 172.16.1.250
ringnumber: 0
}

}

Thanks,
Ashley

dcsapak · Aug 11, 2016

Ashley said:
/etc/hosts:

# corosync network hosts

172.16.1.250 prox-internal
172.16.1.1 n1-internal

is this the whole /etc/hosts file ?

the rest looks ok (i dont think it matters that one host is in the corosync.conf with ip and the other with the hostname, as long as you can reach all)

oh also can you post the corosync.conf from the other node?

Search

Search

Cluster Dropped Overnight

Ashley

Member

dcsapak

Proxmox Staff Member

Ashley

Member

dcsapak

Proxmox Staff Member

Ashley

Member

dcsapak

Proxmox Staff Member

We value your privacy