Recover failed host back on cluster

xcj · Aug 3, 2021

Hi there,

I had a working cluster with two hosts and HA using a Rasperry PI as a QI Device.

Host 1 boot drive failed (i use a USB as a boot drive, something i learned here on the forum is risky and i am waiting for hardware to change it).
I have now Host 2 and the QI Device running, and i reinstalled proxmox on a new USB drive on Host 1, i gave it the same FQDN and same IP. But did not try to reintegrate into the cluster.

I was trying to follow this guide from serve the home https://www.servethehome.com/9-step-calm-and-easy-proxmox-ve-boot-drive-failure-recovery/2/ and i was now on the part where i could import ZFS filesystems back on to the host, since they did not have any mallfunction, all contents should be on the drives.

I have two pools "ZFS_SDD" on an SSD and "ZFS_HDD" on a Hard Disk, the guide says you should see the pools using the command "zpool status", but it just says:

root@node1:~# zpool status                                                                                                                                                                                                 
no pools available

Maybe i am missing a step?

On disks i see the file systems.

Also, do you guys recommend me following the STH guide? I didn't find anything here similar to it.

Please excuse for any lack of basic knowledge, i am here to learn.

Thanks in advance!

xcj · Aug 3, 2021

Hi, i found outside the forum:

Code:

zpool import -f -d /dev/sdX PoolName

root@node1:/# zpool import -f -d /dev/sdb1 ZFS_SSD
root@node1:/# zpool import -f -d /dev/sda1 ZFS_HDD

I now have the pools added and i can see the VM's disk's on them.

Now, at this point, i could start to recreate the VM's and using the existing disks, but since the node was in a cluster before, aren't the configurations already existing on the second host? and since i used the same IP and hostname/FQDN for this node, if i recreate everything and add it to the cluster, since the second host still list it as an offline host, there will be a lot of problems right?

The STH guide recomended using a diferent host name , recreating the VM's and adding it as a new host, what is the best approach here?

xcj · Aug 3, 2021

Host 2 , alone in it's cluster missing Host 1:

Host1 , all storage configured, not running VM's, running as stand alone:

Host 1 , having the VM's HDD's all ready and accessible:

Host2, having all the information to get Host1 back running:


[B]root@node2[/B]:/etc/pve/nodes/node1# ls -l                                                                                                                                                                                     
total 2                                                                                                                                                                                                                    
-rw-r----- 1 root www-data  145 Aug  2 22:18 lrm_status                                                                                                                                                                    
drwxr-xr-x 2 root www-data    0 Jul  9 19:19 lxc                                                                                                                                                                           
drwxr-xr-x 2 root www-data    0 Jul  9 19:19 openvz                                                                                                                                                                        
drwx------ 2 root www-data    0 Jul  9 19:19 priv                                                                                                                                                                          
-rw-r----- 1 root www-data 1679 Jul  9 19:19 pve-ssl.key                                                                                                                                                                   
-rw-r----- 1 root www-data 1712 Jul  9 19:19 pve-ssl.pem                                                                                                                                                                   
drwxr-xr-x 2 root www-data    0 Jul  9 19:19 qemu-server                                                                                                                                                                   
root@node2:/etc/pve/nodes/node1# cd qemu-server/                                                                                                                                                                           
root@node2:/etc/pve/nodes/node1/qemu-server# ls                                                                                                                                                                            
101.conf  105.conf  108.conf  109.conf  112.conf  115.conf                                                                                                                                                                 
root@node2:/etc/pve/nodes/node1/qemu-server#

This all feels like a bad romantic movie where the hosts run back at each other arms in the end of the movie and be complete again

The thing is, the solution seems to be a small step away, but it's not clear to me, and one misstep could get me to an even worse state.

Any help on what my next step should be? or pointers in the right direction?

Thanks in advance!

xcj · Aug 6, 2021

I tried to follow recommendations on https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node:

Code:

6.5. Recovery

If you have major problems with your Proxmox VE host, e.g. hardware issues, it could be helpful to just copy the pmxcfs database file /var/lib/pve-cluster/config.db and move it to a new Proxmox VE host. On the new host (with nothing running), you need to stop the pve-cluster service and replace the config.db file (needed permissions 0600). Second, adapt /etc/hostname and /etc/hosts according to the lost Proxmox VE host, then reboot and check. (And don’t forget your VM/CT data)

This lead to a critical failure on boot on host1, so i reinstalled it again and redone the steps back to how it was before the last post.

If no one has any pointers i will probably save the ".conf" files from each host, reinstall host2 and set a new cluster from the two freshly installed stand alone hosts. Maybe when i come back from hollidays.

Since i needed some of the VM's on host1 running again, i'v used the .conf files for each VM that were on the host2 /etc/pve/nodes/host1, and have them back running on host1 (i imported them with "quemu rescan").

But host1 is still stand alone, host2 is still in a cluster reporting the host1 failure.

I have a VM that i need running on host 1 that got migrated trough HA to host 2, i can't remove it from host 2 because it has a replication job to host 1, i cant delete the replication job because the cluster can't perform the job delete on host1.

Search

Search

Recover failed host back on cluster

xcj

Active Member

xcj

Active Member

xcj

Active Member

xcj

Active Member

We value your privacy