Advice after node hardware failure - how to re-add server in cluster after reinstall

alain · Dec 8, 2013

Hi all,

We had two disks on a server which failed in a raw, and we lost the Raid (Raid 10). I replaced the disks and reinstalled proxmox on the server, with the same IP and hostname (srv-virt2) than before, this is perhaps not the best option...

There are three nodes in the cluster (srv-virt1, srv-virt2 and srv-virt3), so the quorum is 2, and we have it (still two nodes). The failed node still appear in the the cluster :

Code:

# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M    308   2013-10-20 17:29:19  srv-virt1
   2   X    320                        srv-virt2
   3   M  14628   2013-12-07 17:35:29  srv-virt3

I can no more delete the VMs that were on the failed node. I have VM backups. I read the wiki on PVE 2.0 cluster, it seems I have to be careful. See :
https://pve.proxmox.com/wiki/Proxmox_VE_2.0_Cluster

There seems to be two options possible : remove the node, which permanently remove it, and re-add the server as a new node (so new name ?). But I cannot delete the ghosts VMs. What will happen for these VMs IDs ? Is it possible to force delete the node.

Second option would be "Re-installing a cluster node", but even if I have a backup for /etc, I don't have a backup for /var/lib/pve-cluster, nor /root/.ssh, so it does not seem a viable option.

What would be the best procedure for me to be sure to have a sane cluster at the end ?

Promox version is 3.1 :

Code:

# pveversion
pve-manager/3.1-20/c3aa0f1a (running kernel: 2.6.32-26-pve)

Thanks for advices

Alain

tom · Dec 8, 2013

Re: Advice after node hardware failure - how to re-add server in cluster after reinst

if your new installed node has the same IP and hostname, just add to the cluster by using the -force flag:

> pvecm add IP-of-cluster -force

udo · Dec 8, 2013

Re: Advice after node hardware failure - how to re-add server in cluster after reinst

alain said:
...
I can no more delete the VMs that were on the failed node.

Hi,
but with quorum you can delete the VM inside the pve-dir

Code:

cd /etc/pve/nodes/srv-virt2/qemu-server
rm *.conf
# or mv to the existing node "mv *.conf ../../srv-virt1/qemu-server/"

There seems to be two options possible : remove the node, which permanently remove it, and re-add the server as a new node (so new name ?). But I cannot delete the ghosts VMs. What will happen for these VMs IDs ? Is it possible to force delete the node.
...

readd with -force should work.

Udo

alain · Dec 9, 2013

Re: Advice after node hardware failure - how to re-add server in cluster after reinst

Hi Udo and Dietmar,

Thanks for the answers. It is reassuring to know that -force should work. I think I have first to delete the previous node ? pvecm delnode srv-virt2, before re-adding the re-installed node ?

dietmar · Dec 9, 2013

Re: Advice after node hardware failure - how to re-add server in cluster after reinst

alain said:
I think I have first to delete the previous node ? pvecm delnode srv-virt2

no, that is not necessary.

alain · Dec 9, 2013

Re: Advice after node hardware failure - how to re-add server in cluster after reinst

Just to add some context, we don't have fencing configured, and VMs were stored locally. So, when we will add again the node to the cluster, it will not find the VM locally. Most VMs have been restored on the other nodes for the moment. No problem with that ?

dietmar · Dec 9, 2013

Re: Advice after node hardware failure - how to re-add server in cluster after reinst

alain said:
No problem with that ?

I assume you used other VMID, so that is no problem. You can simply delete the unused VM configuration files later.

alain · Dec 9, 2013

Re: Advice after node hardware failure - how to re-add server in cluster after reinst

I did a 'pvecm add IP-cluster -force, and indeed it seems to work. No error :

Code:

# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M  14632   2013-12-09 10:33:50  srv-virt1
   2   M  14632   2013-12-09 10:33:50  srv-virt2
   3   M  14632   2013-12-09 10:33:50  srv-virt3

I restarted some VMs on this node, it was OK.

Thanks a lot Dietmar and all !

Alain

alain · Dec 9, 2013

Re: Advice after node hardware failure - how to re-add server in cluster after reinst

dietmar said:
I assume you used other VMID, so that is no problem. You can simply delete the unused VM configuration files later.

Yes, I used other VMIDs, it was just impossible to restore a VM on its old VMID (ghost VM). Now that the srv-virt2 node is again in the cluster, I was able to delete these ghost VMs, and re-use them to restore VMs on their old VMID.

It was a somewhat hot experience, as we had some production machines on the failing node, but at the end, very instructive on how to recover fron such a situation.

I noticed that I thanked Dietmar for the first answer, instead of Tom, who was the first to answer, so Thanks Tom !

Search

Search

Advice after node hardware failure - how to re-add server in cluster after reinstall

alain

Renowned Member

tom

Proxmox Staff Member

udo

Distinguished Member

alain

Renowned Member

dietmar

Proxmox Staff Member

alain

Renowned Member

dietmar

Proxmox Staff Member

alain

Renowned Member

alain

Renowned Member

We value your privacy