I setup a two node test cluster late yesterday, overnight the cluster has broken, with both node's being unable to see each other within the cluster, (shows as disconected in the GUI)
Looking at the syslog I can see the following :
Aug 8 23:50:19 prox corosync[12516]: [TOTEM ] FAILED TO RECEIVE
Aug 8 23:50:20 prox corosync[12516]: [TOTEM ] A new membership (172.16.1.250:12) was formed. Members left: 2
Aug 8 23:50:20 prox corosync[12516]: [TOTEM ] Failed to receive the leave message. failed: 2
And then the following repeated a few times "This node is within the non-primary component and will NOT provide any services.", since then I have tried to reboot both nodes multiple times, however each time corosync comes online "syncs", but only shows one node (itself) on each server.
I am using a VLAN / Internal network for cluster comm's, both servers can ping each other with >0.1MS PING, and both have the internal hostname set in /etc/hosts.
The only thing I can think of is one has a lacp / bond while the other is a single NIC, could the "failed recieve" be where data has gone through one bond uplink, and been recieved on another, which corosync may not see corectly?
Apart from that all services are showing running and with no error's.
Looking at the syslog I can see the following :
Aug 8 23:50:19 prox corosync[12516]: [TOTEM ] FAILED TO RECEIVE
Aug 8 23:50:20 prox corosync[12516]: [TOTEM ] A new membership (172.16.1.250:12) was formed. Members left: 2
Aug 8 23:50:20 prox corosync[12516]: [TOTEM ] Failed to receive the leave message. failed: 2
And then the following repeated a few times "This node is within the non-primary component and will NOT provide any services.", since then I have tried to reboot both nodes multiple times, however each time corosync comes online "syncs", but only shows one node (itself) on each server.
I am using a VLAN / Internal network for cluster comm's, both servers can ping each other with >0.1MS PING, and both have the internal hostname set in /etc/hosts.
The only thing I can think of is one has a lacp / bond while the other is a single NIC, could the "failed recieve" be where data has gone through one bond uplink, and been recieved on another, which corosync may not see corectly?
Apart from that all services are showing running and with no error's.