How to recover from HW failure on a 3 nodes cluster - VMs are hidden

francoisd

Renowned Member
Sep 10, 2009
57
6
73
Hi,

How should we recover from a lost node due to HW failure on a 3 nodes cluster ?
I can't see the VMs anymore, and have no direct way to restart them on remaining nodes.

1743104573788.png

Best guess is that I should remove the failing node from the cluster, but unfortunately cluster view do not really help:
1743104721824.png

Since the cluster do not seem to handle the situation, you should add a section "Recovery" in the default manual : https://pve/pve-docs/index.html

Surprisingly, the pvecm nodes do not display the node 1 while the web UI does
Code:
root@pve2:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         2          1 pve2 (local)
         3          1 pve3
We can however delete the hidden pve1
Code:
root@pve2:~# pvecm delnode pve1
Could not kill node (error = CS_ERR_NOT_EXIST)
Killing node 1

Despite the removal of the pve1 faulty node :
  • The VMs are still hidden, and we don't know how to start them on the remaining nodes
  • The pve1 is still displayed in the WebUI Datacenter but not in the Cluster panel
1743108217335.png


A useful link to remove a node: https://forum.proxmox.com/threads/remove-node-from-cluster.98752/
 
Last edited:
Hi,
have a look at /etc/pve/nodes/pve1/qemu-server/. There should the config files of the vm which were running on pve1 be stored in. Move them to one of your other nodes directories. As long as there were no local storage or other local resource usage configured, they should be accessible in the GUI and can be started at the new node afterwards.
 
  • Like
Reactions: UdoB