[SOLVED] Cluster in a bad way after node replacement and failed join

*Daedalus

New Member
Feb 5, 2024
11
2
3
Hi all. I've been running Proxmox for a while to tinker with, and has mostly been very smooth, but the cluster isn't doing so fantastic at the moment.
3 nodes, that have been in-place replaced with MS-01's recently. This was a reinstall of PVE onto a new drive, re-using the existing NVME drives for guests.
Power off old node, move hardware, install, delnode from cluster, power on new node, setup interfaces/storage, add to cluster.

This worked fine for prox3.
I accidentally added the wrong network for prox2 (didn't notice until other issues yesterday)
The join failed for prox1. The cluster didn't see prox1, and prox1 saw itself in its own cluster.
I manually removed prox1 from the cluster, got it back to single-node, re-added it, and we are where are are now:

Code:
Cluster information
-------------------
Name:             cluster
Config Version:   21
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Sep  4 16:35:12 2025
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1.2ea1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.0.171
0x00000002          1 192.168.1.172 (local)
0x00000003          1 192.168.0.173

I know prox2 will need fixing, but the issue at the moment is that replication and migration are broken, so I can't remove prox2 without losing the guests on it (I believe. Correct me if I'm wrong here).

The management network is 0.17x, and the migration and replication networks are set to 1.17x.

A replication job fails with:
Code:
2025-09-04 16:29:00 111-2: start replication job
2025-09-04 16:29:00 111-2: guest => CT 111, running => 0
2025-09-04 16:29:00 111-2: volumes => local-nvme:subvol-111-disk-1
2025-09-04 16:29:00 111-2: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=prox1' -o 'UserKnownHostsFile=/etc/pve/nodes/prox1/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.0.171 pvecm mtunnel -migration_network 192.168.1.171/24 -get_migration_ip' failed: exit code 255

I've tried so much I've lost track. Lots of
Code:
ssh-keygen -f "/etc/pve/nodes/prox3/ssh_known_hosts" -R "prox3"

I'm in a little over my head at this point. Could do with some help. :)
 
I have everything on PBS if need be. Is it easier at this point to just delnode everything, delete the corosync data, and join a new cluster?
I've never done it, so I'm not sure if this will ruin guest configs or not. I could restore from backup completely, but I'd rather not if I had an option.
 
PSA: Always check the obvious first. I had an interface on prox1 set with prox3's IP. That's about 6 hours I won't get back. :D