Cluster "Flapping"

Ashley

Member
Jun 28, 2016
267
15
18
34
Hello,

Currently trying to setup a 3-4 node test Cluster before expanding into a production scale cluster.

Cluster network is on a separate VLAN with internal IP's only used for the cluster communication, when first setting up the cluster all seem's well, all servers report they are in the cluster and their is a quorum.

Quorum information
------------------
Date: Tue Oct 18 16:18:13 2016
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1/12
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.1.1
0x00000002 1 172.16.1.2 (local)
0x00000003 1 172.16.1.250


A few minutes later, the cluster will break and each server will only show it's self on the cluster list:

Quorum information
------------------
Date: Tue Oct 18 16:26:30 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2/108
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.1.2 (local)


At this time each server is still able to ping via the internal network, and have used ping -f and other tests and all come back with zero packet-loss or issues, I have reinstalled all servers multiple time's and every-time the same thing happens.

When i login into the Web Interface I normally can still view the other nodes within the cluster (even though they show red - offline), however every so often it will fail say too many redirects.

I was once able to get back into a normal state after a reboot of all 3 server's but a few minutes later the same happened, I have struggled to find any log files to see if I can see what is happening.

Is there any ideas or has anyone seen this before, currently these 3 servers have just been reinstalled running latest Proxmox and all updates, so no 3rd party applications or config's.

Thanks,
Ashley
 
Will have a check of that now (Juniper Switch Gear), if so and I disable such should the Cluster auto repair it self or is their steps I need to do (or a simple reboot of all nodes) will be enough.

Thanks,
Ashley
 
I have checked and storm control (Juniper name) is disabled on the switch.

And as earlier I am able to complete a ping -f and the other multicast tests without any packet drops. Just seems randomly they will loose contact and only show them selfs as a member.

What is even more weird is 90% of time web interface continues to work between nodes, and the other 10% shows an error saying too many redirects.

Is there any logs somewhere to maybe show an idea of what's happening?

Thanks,
Ashley
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!