Hi all. I've been running Proxmox for a while to tinker with, and has mostly been very smooth, but the cluster isn't doing so fantastic at the moment.
3 nodes, that have been in-place replaced with MS-01's recently. This was a reinstall of PVE onto a new drive, re-using the existing NVME drives for guests.
Power off old node, move hardware, install, delnode from cluster, power on new node, setup interfaces/storage, add to cluster.
This worked fine for prox3.
I accidentally added the wrong network for prox2 (didn't notice until other issues yesterday)
The join failed for prox1. The cluster didn't see prox1, and prox1 saw itself in its own cluster.
I manually removed prox1 from the cluster, got it back to single-node, re-added it, and we are where are are now:
I know prox2 will need fixing, but the issue at the moment is that replication and migration are broken, so I can't remove prox2 without losing the guests on it (I believe. Correct me if I'm wrong here).
The management network is 0.17x, and the migration and replication networks are set to 1.17x.
A replication job fails with:
I've tried so much I've lost track. Lots of
I'm in a little over my head at this point. Could do with some help.
3 nodes, that have been in-place replaced with MS-01's recently. This was a reinstall of PVE onto a new drive, re-using the existing NVME drives for guests.
Power off old node, move hardware, install, delnode from cluster, power on new node, setup interfaces/storage, add to cluster.
This worked fine for prox3.
I accidentally added the wrong network for prox2 (didn't notice until other issues yesterday)
The join failed for prox1. The cluster didn't see prox1, and prox1 saw itself in its own cluster.
I manually removed prox1 from the cluster, got it back to single-node, re-added it, and we are where are are now:
Code:
Cluster information
-------------------
Name: cluster
Config Version: 21
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Sep 4 16:35:12 2025
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1.2ea1
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.0.171
0x00000002 1 192.168.1.172 (local)
0x00000003 1 192.168.0.173
I know prox2 will need fixing, but the issue at the moment is that replication and migration are broken, so I can't remove prox2 without losing the guests on it (I believe. Correct me if I'm wrong here).
The management network is 0.17x, and the migration and replication networks are set to 1.17x.
A replication job fails with:
Code:
2025-09-04 16:29:00 111-2: start replication job
2025-09-04 16:29:00 111-2: guest => CT 111, running => 0
2025-09-04 16:29:00 111-2: volumes => local-nvme:subvol-111-disk-1
2025-09-04 16:29:00 111-2: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=prox1' -o 'UserKnownHostsFile=/etc/pve/nodes/prox1/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.0.171 pvecm mtunnel -migration_network 192.168.1.171/24 -get_migration_ip' failed: exit code 255
I've tried so much I've lost track. Lots of
Code:
ssh-keygen -f "/etc/pve/nodes/prox3/ssh_known_hosts" -R "prox3"
I'm in a little over my head at this point. Could do with some help.
