"Rejoining" a node to a cluster after hardware issue.

clusterfunned

New Member
May 26, 2023
2
0
1
Hello,

I inherited a 4-node, v3.4 ProxMox cluster (names: A, B, C, D), I know it's very old, we will be working on migration soon.

A power outage fried Node B's motherboard, so when power came back I restored its guests to the other nodes and changed the ID numbers from 1## to 2## (101 to 201, etc) because I feared that numbers would conflict on rejoin. Nodes A, C, and D were otherwise unaffected.

The GUI still shows all the 1## VMs as greyed-out on the (absent, red-dotted) Node B.

I have fixed Node B and deleted all the local guests from it. It otherwise is exactly as it was before the outage.

My question is, I hope, fairly simple: Can I just turn Node B back on and will it rejoin happily, or will the rest of the cluster have a problem since it thinks there's a bunch of 1## numbered guests on it? I've been reading a little about "split brain" and am worried this might somehow break the cluster. Since the cluster hosts production VMs it would be a Very Bad Thing if we were to have downtime.

Any help is appreciated. This is not urgent, but we are running into some capacity issues so getting Node B back in play would be very helpful.

Thank you!
 
Some more information:

I played around with this some more and found that if I go into Node [A|C|D] and move the /etc/pve/node/B/openvz/.conf files to .bak files, they disappear from the UI. Is this the "database" where concensus is formed about who actually runs what, or is that information stored elsewhere? I feel like intuitively removing all the .conf files from the /etc/pve/node/B/openvz and then starting up (not rejoining, I now understand that the node was never deleted so the terminology I used was incorrect) node B after making sure it's /etc/pve/openvz/ is clean should allow it to slide back into the cluster and resume operation.

Am I close on this or am I missing some other piece of back-end management that will cause problems because I changed node B's guest configuration (deleted them) while it was offline and not part of the cluster?

I have a bad feeling that if I had just left the old 1## containers on Node B and then started it back up it would have synced up perfectly because there would have been no ID collisions, then I could have just deleted the 1## containers on Node B (which had been restored to other nodes with 2## IDs) easily and without mishap.