Adding node to cluster of 3 broke the cluster

drjaymz@

Member
Jan 19, 2022
124
5
23
102
Early days, can't make heads or tails of it but we have a cluster of 3 nodes with HA, replication happily chugging along. When we came to add a 3rd node basically looks like all hell broke loose, all 3 nodes restarted red x's on the nodes, couldn't get in on the web interface. All VM's were abruptly stopped causing corruption and general chaos.
The 3rd node was at a satellite site as the main and satellite site are the same organisational unit so we wanted to add it in so that whole unit could be administered and backed up the same way. That remote site has a different subnet and a small amount of latency.

Problem seemed to be identical to the following post:
https://forum.proxmox.com/threads/cluster-issues-after-adding-node.89091/

That post doesn't go anywhere really, it says it broke it and something went wrong and then they removed and re-added it and it worked.
Well, we removed it and the original 3 are now happy again and I don't fancy re-adding it until I know what happened.

But... what happened?

The internet thinks its a corosync confusion issue, but by now that shouldn't be causing the entire thing to explode. We stopped cluster, then corosync then forced to local mode.

We were lucky, that we could get on one of the nodes because the site is a good distance away. It makes sense to me that we can add all parts of an OU to the same cluster and manage them under one pane of glass; you can group servers and have replication within the nodes that are geographically related.

Where's the best place to look for a forensic breakdown of what actually happened?
 
Last edited:
Hi,
we have a cluster of 3 nodes with HA,

When we came to add a 3rd node

? do your mean add a 4th node ?


Also:

- don't do a single cluster with HA on only 2 different sites. if you have a split brain/network cut, the HA will restart nodes on the site without quorum. (where you have less nodes). You need 3 differents sites if you want to create 1 single cluster.

(and if you have same number of nodes on both sites, all the nodes will reboot in case of a network cut)


That remote site has a different subnet and a small amount of latency.
- How much latency ? Corosync is very senstive to latency. (3-5ms max , 10ms with some tuning)
 
Thanks, I understand quorum.

What I was trying to do was have the HA configured between the 3 nodes (on the same site) using the pool grouping and be able to manage a 4th node (whos vm's don't need HA) via the same interface, such that I have a single pane of glass to work with for the same organisational unit. Think of it as like a remote site that has two buildings either side of a road we just treat as one.

Ideally I'd like to have a single pane and see all sites, but thats now how its been set up.
 
I had bad experience when I tried HA configuration, so I stopped using it. If I am not mistaken if a node goes out of quorum just for a short while it should fence itself off, which means it needs to shutdown or something like that (and of course that brings down all the VMs). I think that pretty much excludes the use case for a remote site, and it makes it of very limited use...
Maybe when you were adding the node to the cluster there was some kind of temporary loss of quorum which triggered the fencing...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!