[SOLVED] Replace node in a 2-node cluster

Elleni

Well-Known Member
Jul 6, 2020
222
16
58
52
I migrated all guests to the remaining node, and removed the empty node from the cluster. I then tried to add a new pve server to the cluster. Unfortunatelly I had not deleted the entry of the removed node from known_hosts file on the remaining node. As the new node has the same hostname as the removed node, I was not yet successfull.

The situation now is that:
- the datacenter contains the new nodename but it is red - looking at the remaining node's webinterface
- the new node is not accessible by webinterface anymore, but I still can login via ssh

Is this a problem of unsucessfull exchange of certificates? Or did it not work because I should have issued a pvecm expected 1 first? However although I am prepared to reinstall everything and restore a bunch of vm's I would love to fix this as elseway it would take a lot of more time.

Thanks for any help on howto restore this cluster. I will provide any informations needed that can help to fix this. The 2node cluster has a third server with qdedice installed by the way as third vote for the quorum.

Logging in via ssh and looking at pvecm status I seen that the new node had still somehow created a cluster config, so I went through a guide removing cluster config altogether and rebooted. Now webinterface access is restored.

I now want to remove the appearance of non functional node2 on webinterface of the remaining node (the one with all the vms on i) to then be able to add the new node to that cluster - while the new node will have the same ip and name of the removed node.

Realized that removed node1 was still in webinterface because it was on /etc/corosyncd/corosync.conf. Removed it from there and restarted corosync and pvestatd made it disapear from webinterface.
 
Last edited:
I made some progress. By stopping corosync pvestad and pve-cluster service and issuing a pmxcfs -l I was also able to edit /etc/pve/corosync.conf which I thought came in handy as the server needs some physical rebuilding for the clusternetwork (direct link) so I thought in the meantime I can change the clusternetwork to the already existing link and change the clusternetwork later once I installed direct an additional nic on the new pve.

But then starting corosync pvestatd and pve-cluster only the first two started. Then I got:

Job for pve-cluster.service failed because the control process exited with error code.
See "systemctl status pve-cluster.service" and "journalctl -xeu pve-cluster.service" for details.

notice: unable to acquire pmxcfs lock - trying again

Rebooting both servers and joining domain this time worked. Checking if everything is working as intended
 
Last edited:
It looks like everything is ok though I dont know about the votes, as I had put expected 1 while troubleshooting.

Code:
root@PVE002:~# pvecm status
Cluster information
-------------------
Name:             PVExCL001
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Oct 11 19:57:21 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.240
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1         NR 10.3.8.6
0x00000002          1    A,V,NMW 10.3.8.7 (local)
0x00000000          1            Qdevice

Code:
Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR PVE001
         2          1    A,V,NMW PVE002 (local)
         0          1            Qdevice

Is it a correct assumption that once the direct link is established by adding a dedicated network card for cluster networking, I can do that by just adapting net network configuration on the new node modifying /etc/network/interfaces and /etc/hosts and then finally /etc/pve/corosync.conf and restarting services or is there more to do or a more elegant way like within webinterface or cli?
 
Last edited:
After having installed and configured a separate cluster network again, it first seems everything ok on pvecm status. And one node also starts its vms that are set to autostart.

Changed /etc/corosyncd/corosync.conf and /etc/pve/corosync.conf back to a separate clusternetwork, after made sure that the modes can access themselves via ssh by the clusternetwork.

On the second node though - as soon as i try to start a vm, the ip (not the cluster ip but the ip that is bridged to vmbr0) is entering in disabled mode and the webinterface becomes unavailable. I still can ssh into it through the configured clusternetwork ip though.

Please advice on howto fix this. And or ask for informations you might need to find the problem.

crorosync, pvestatd und pve-cluster service are up and running.
 
Last edited:
I dont understand this, I can successfully configure replication of vms in both directions - they successfully finish. I also wasable to successfully migrate a node which was started at node2 to node1.

But as soon as I start a vm on node I the webinterface address stops responding. looking at the serial interface I see that it goes:
entered blocking state
entered forwarding state
entered disabling state

What could cause this?
 
Uff, even removed it from the cluster and re-joined it keeps on with the same behaviour - network entering blocking and disabling state..
 
I finally found (my) error - had nothing to do with the cluster reconfiguration. When it did the same thing after deleting one node from the cluster and cleaning its clusterconfiguration I finally realized that as it is a network issue - might have something to do with /etc/network/interfaces. And guess what - checking it I realized that I had vlan aware line forgotten in vmbr0 config of said clusternode, so the behaviour actually made total sense :)

So adding it and restarting network - everything worked fine.