[SOLVED] How to replace a failed node in a two-node VE 6 cluster?

peterwargo · Apr 1, 2020

I've been trawling the forums, but my search-fu seems to be lacking today. Here's the problem:

We had a nicely functional Proxmox VE 6 cluster. Identical machines, nice and new, with both local (lvm-thin) and NFS storage. Let's call them NODE1 and NODE2. Unfortunately, NODE2 ran into some really unusual hardware problems (it's a fairly new platform), and the manufacturer ended up replacing the whole system, disks and all. To make a shipping deadline, I did what I thought was necessary to remove the node, then scrubbed the disks and shipped it out. It took some time to get replaced, and when the new machine arrived, we were (and still are) shut down for COVAD-19 safety.

However, since we are an essential industry, I was called in this weekend for another matter, and managed to get the "replacement" NODE2 in the rack and remote access configured. Once I could, I remotely set it up, installed Proxmox VE 6, relicensed it, and then tried to add it back to the cluster. No love.

Establishing API connection with host '172.19.68.211'
Login succeeded.
Request addition of this node
TASK ERROR: 500 cluster not ready - no quorum?

I believe the old system is still showing up as part of the cluster - it's there in the list when I log into NODE1 with an "x" on its icon. If I click on NODE2 and try to get the status, I get:

tls_process_server_certificate: certificate verify failed (596)

If I go to the datacenter view, the cluster status shows as NODE2 being offline, and shows Quorate: no.

So, the question is: can I get NODE2 to come back into the cluster?
-or-
Since it's a two-node cluster, can I safely destroy it and re-create it?

Thanks for any help, and if this *is* an FAQ and I missed it, I apologize.

fabian · Apr 2, 2020

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node <- did you follow all these steps when removing the 'old' node2?

peterwargo · Apr 2, 2020

Hi Fabian,

Thanks. However, those are the instructions I followed before removing the node. Apparently, it didn't work as expected. I've gone back and run the commands again to get output, and provided a copy of my corosync.conf file.

NODE1 = vms01 (existing system, the one that we kept)
NIODE2 = vms02 (the node that was removed, and the hardware replaced.)

Is it safe for me to backup and then edit the corosync.conf file and restart the daemon? Would that get it out of what looks to be read-only mode? Or, is there a way to (at least for the time being) set the required quorum votes to be one? Or give vms01 more votes?

Here's output from the commands used to remove the node. This is recent, I didn't save the output when I originally did it. My bad, I was under a time crunch.

root@vms01:/etc/pve# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
1 1 vms01 (local)
root@vms01:/etc/pve# pvecm delnode vms02
cluster not ready - no quorum?
root@vms01:/etc/pve# pvecm status
Quorum information
------------------
Date: Thu Apr 2 10:24:20 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/84
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.19.68.

Contents of corosync.conf:
-----

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: vms01
nodeid: 1
quorum_votes: 1
ring0_addr: 172.19.68.211
}
node {
name: vms02
nodeid: 2
quorum_votes: 1
ring0_addr: 172.19.68.213
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: vms-cluster
config_version: 2
interface {
linknumber: 0
}
ip_version: ipv4-6
secauth: on
version: 2
}

ermanishchawla · Apr 2, 2020

just do the following
1. on a working node pvecm expected 1
2. delete the old node
3. run pvecm updatecerts
4. add the new node

peterwargo · Apr 2, 2020

Thanks [B]ermanishchawla[/B]! That did the trick.

To summarize, if you have a two-node cluster and one node needs to be completely replaced, the steps to follow are just a bit modified from the documentation section 5.5. Remove a Cluster Node:

(All commands are on the "remaining" single node. In other words, the one you want to keep of the two.)

List the nodes: pvecm nodes
Power off the other node (if it isn't already dead).
On the only remaining node: pvecm expected 1
Follow that with: pvecm delete <dead node>
Finally, run: pvecm updatecerts

Now, once the new node is built, and the OS is installed *from scratch*, it can be re-added to the cluster the normal way.

As a side-note, I believe not having the quorum was causing issues with my backups, not being able to remove old ones, etc - since it was in read-only mode.

Of course, normally it wouldn't take months to replace a failed machine, but the combination of an unusual hardware failure, a scarcity of that machine, and COVAD-19 led to a huge delay. But, my "cluster of two" is now happy, I migrated a test VM without an issue, and I feel much, much better.

oah433 · Apr 8, 2021

peterwargo said:
Thanks [B]ermanishchawla[/B]! That did the trick.

To summarize, if you have a two-node cluster and one node needs to be completely replaced, the steps to follow are just a bit modified from the documentation section 5.5. Remove a Cluster Node:

(All commands are on the "remaining" single node. In other words, the one you want to keep of the two.)

List the nodes: pvecm nodes

Power off the other node (if it isn't already dead).

On the only remaining node: pvecm expected 1

Follow that with: pvecm delete <dead node>

Finally, run: pvecm updatecerts

Now, once the new node is built, and the OS is installed *from scratch*, it can be re-added to the cluster the normal way.

As a side-note, I believe not having the quorum was causing issues with my backups, not being able to remove old ones, etc - since it was in read-only mode.

Of course, normally it wouldn't take months to replace a failed machine, but the combination of an unusual hardware failure, a scarcity of that machine, and COVAD-19 led to a huge delay. But, my "cluster of two" is now happy, I migrated a test VM without an issue, and I feel much, much better.

That worked so fine, but on step-4 it is pvecm delnode pve3
pve3 was my node name

Search

Search

[SOLVED] How to replace a failed node in a two-node VE 6 cluster?

peterwargo

New Member

fabian

Proxmox Staff Member

peterwargo

New Member

ermanishchawla

Well-Known Member

peterwargo

New Member

oah433

Member

We value your privacy