Cluster "Flapping"

Ashley · Oct 18, 2016

Hello,

Currently trying to setup a 3-4 node test Cluster before expanding into a production scale cluster.

Cluster network is on a separate VLAN with internal IP's only used for the cluster communication, when first setting up the cluster all seem's well, all servers report they are in the cluster and their is a quorum.

Quorum information
------------------
Date: Tue Oct 18 16:18:13 2016
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1/12
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.1.1
0x00000002 1 172.16.1.2 (local)
0x00000003 1 172.16.1.250

A few minutes later, the cluster will break and each server will only show it's self on the cluster list:

Quorum information
------------------
Date: Tue Oct 18 16:26:30 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2/108
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.1.2 (local)

At this time each server is still able to ping via the internal network, and have used ping -f and other tests and all come back with zero packet-loss or issues, I have reinstalled all servers multiple time's and every-time the same thing happens.

When i login into the Web Interface I normally can still view the other nodes within the cluster (even though they show red - offline), however every so often it will fail say too many redirects.

I was once able to get back into a normal state after a reboot of all 3 server's but a few minutes later the same happened, I have struggled to find any log files to see if I can see what is happening.

Is there any ideas or has anyone seen this before, currently these 3 servers have just been reinstalled running latest Proxmox and all updates, so no 3rd party applications or config's.

Thanks,
Ashley

dietmar · Oct 18, 2016

Ashley said:
A few minutes later, the cluster will break and each server will only show it's self on the cluster list:

Maybe a multicast problem. Do you have an active multicast querier on your switch?

Ashley · Oct 18, 2016

Will have a check of that now (Juniper Switch Gear), if so and I disable such should the Cluster auto repair it self or is their steps I need to do (or a simple reboot of all nodes) will be enough.

Thanks,
Ashley

dietmar · Oct 18, 2016

cluster should auto-repair (maybe you need a reboot).

Ashley · Oct 19, 2016

I have checked and storm control (Juniper name) is disabled on the switch.

And as earlier I am able to complete a ping -f and the other multicast tests without any packet drops. Just seems randomly they will loose contact and only show them selfs as a member.

What is even more weird is 90% of time web interface continues to work between nodes, and the other 10% shows an error saying too many redirects.

Is there any logs somewhere to maybe show an idea of what's happening?

Thanks,
Ashley

dmora · Oct 19, 2016

https://pve.proxmox.com/wiki/Multicast_notes ?

Search

Search

Cluster "Flapping"

Ashley

Member

dietmar

Proxmox Staff Member

Ashley

Member

dietmar

Proxmox Staff Member

Ashley

Member

dmora

New Member