"Rejoining" a node to a cluster after hardware issue.

clusterfunned

New Member
May 26, 2023
2
0
1
Hello,

I inherited a 4-node, v3.4 ProxMox cluster (names: A, B, C, D), I know it's very old, we will be working on migration soon.

A power outage fried Node B's motherboard, so when power came back I restored its guests to the other nodes and changed the ID numbers from 1## to 2## (101 to 201, etc) because I feared that numbers would conflict on rejoin. Nodes A, C, and D were otherwise unaffected.

The GUI still shows all the 1## VMs as greyed-out on the (absent, red-dotted) Node B.

I have fixed Node B and deleted all the local guests from it. It otherwise is exactly as it was before the outage.

My question is, I hope, fairly simple: Can I just turn Node B back on and will it rejoin happily, or will the rest of the cluster have a problem since it thinks there's a bunch of 1## numbered guests on it? I've been reading a little about "split brain" and am worried this might somehow break the cluster. Since the cluster hosts production VMs it would be a Very Bad Thing if we were to have downtime.

Any help is appreciated. This is not urgent, but we are running into some capacity issues so getting Node B back in play would be very helpful.

Thank you!
 
Some more information:

I played around with this some more and found that if I go into Node [A|C|D] and move the /etc/pve/node/B/openvz/.conf files to .bak files, they disappear from the UI. Is this the "database" where concensus is formed about who actually runs what, or is that information stored elsewhere? I feel like intuitively removing all the .conf files from the /etc/pve/node/B/openvz and then starting up (not rejoining, I now understand that the node was never deleted so the terminology I used was incorrect) node B after making sure it's /etc/pve/openvz/ is clean should allow it to slide back into the cluster and resume operation.

Am I close on this or am I missing some other piece of back-end management that will cause problems because I changed node B's guest configuration (deleted them) while it was offline and not part of the cluster?

I have a bad feeling that if I had just left the old 1## containers on Node B and then started it back up it would have synced up perfectly because there would have been no ID collisions, then I could have just deleted the 1## containers on Node B (which had been restored to other nodes with 2## IDs) easily and without mishap.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!