Adding node to cluster of 3 broke the cluster

drjaymz@ · Jun 26, 2023

Early days, can't make heads or tails of it but we have a cluster of 3 nodes with HA, replication happily chugging along. When we came to add a 3rd node basically looks like all hell broke loose, all 3 nodes restarted red x's on the nodes, couldn't get in on the web interface. All VM's were abruptly stopped causing corruption and general chaos.
The 3rd node was at a satellite site as the main and satellite site are the same organisational unit so we wanted to add it in so that whole unit could be administered and backed up the same way. That remote site has a different subnet and a small amount of latency.

Problem seemed to be identical to the following post:
https://forum.proxmox.com/threads/cluster-issues-after-adding-node.89091/

That post doesn't go anywhere really, it says it broke it and something went wrong and then they removed and re-added it and it worked.
Well, we removed it and the original 3 are now happy again and I don't fancy re-adding it until I know what happened.

But... what happened?

The internet thinks its a corosync confusion issue, but by now that shouldn't be causing the entire thing to explode. We stopped cluster, then corosync then forced to local mode.

We were lucky, that we could get on one of the nodes because the site is a good distance away. It makes sense to me that we can add all parts of an OU to the same cluster and manage them under one pane of glass; you can group servers and have replication within the nodes that are geographically related.

Where's the best place to look for a forensic breakdown of what actually happened?

spirit · Jun 26, 2023

Hi,

we have a cluster of 3 nodes with HA,

When we came to add a 3rd node

? do your mean add a 4th node ?

Also:

- don't do a single cluster with HA on only 2 different sites. if you have a split brain/network cut, the HA will restart nodes on the site without quorum. (where you have less nodes). You need 3 differents sites if you want to create 1 single cluster.

(and if you have same number of nodes on both sites, all the nodes will reboot in case of a network cut)

That remote site has a different subnet and a small amount of latency.

- How much latency ? Corosync is very senstive to latency. (3-5ms max , 10ms with some tuning)

drjaymz@ · Jun 29, 2023

Thanks, I understand quorum.

What I was trying to do was have the HA configured between the 3 nodes (on the same site) using the pool grouping and be able to manage a 4th node (whos vm's don't need HA) via the same interface, such that I have a single pane of glass to work with for the same organisational unit. Think of it as like a remote site that has two buildings either side of a road we just treat as one.

Ideally I'd like to have a single pane and see all sites, but thats now how its been set up.

mfed · Jun 29, 2023

I had bad experience when I tried HA configuration, so I stopped using it. If I am not mistaken if a node goes out of quorum just for a short while it should fence itself off, which means it needs to shutdown or something like that (and of course that brings down all the VMs). I think that pretty much excludes the use case for a remote site, and it makes it of very limited use...
Maybe when you were adding the node to the cluster there was some kind of temporary loss of quorum which triggered the fencing...

Adding node to cluster of 3 broke the cluster

drjaymz@

Member

spirit

Distinguished Member

drjaymz@

Member

mfed

Well-Known Member

We value your privacy