Recover failed host back on cluster

xcj

Member
Jul 6, 2021
34
7
13
48
Hi there,

I had a working cluster with two hosts and HA using a Rasperry PI as a QI Device.

Host 1 boot drive failed (i use a USB as a boot drive, something i learned here on the forum is risky and i am waiting for hardware to change it).
I have now Host 2 and the QI Device running, and i reinstalled proxmox on a new USB drive on Host 1, i gave it the same FQDN and same IP. But did not try to reintegrate into the cluster.

I was trying to follow this guide from serve the home https://www.servethehome.com/9-step-calm-and-easy-proxmox-ve-boot-drive-failure-recovery/2/ and i was now on the part where i could import ZFS filesystems back on to the host, since they did not have any mallfunction, all contents should be on the drives.

I have two pools "ZFS_SDD" on an SSD and "ZFS_HDD" on a Hard Disk, the guide says you should see the pools using the command "zpool status", but it just says:
root@node1:~# zpool status no pools available

Maybe i am missing a step?

On disks i see the file systems.
Screenshot 2021-08-03 at 12.08.31.png

Also, do you guys recommend me following the STH guide? I didn't find anything here similar to it.

Please excuse for any lack of basic knowledge, i am here to learn.

Thanks in advance!
 
Hi, i found outside the forum:

Code:
zpool import -f -d /dev/sdX PoolName

root@node1:/# zpool import -f -d /dev/sdb1 ZFS_SSD
root@node1:/# zpool import -f -d /dev/sda1 ZFS_HDD

I now have the pools added and i can see the VM's disk's on them.

Now, at this point, i could start to recreate the VM's and using the existing disks, but since the node was in a cluster before, aren't the configurations already existing on the second host? and since i used the same IP and hostname/FQDN for this node, if i recreate everything and add it to the cluster, since the second host still list it as an offline host, there will be a lot of problems right?

The STH guide recomended using a diferent host name , recreating the VM's and adding it as a new host, what is the best approach here?
 
Last edited:
Host 2 , alone in it's cluster missing Host 1:
Screenshot 2021-08-03 at 19.44.18.png

Host1 , all storage configured, not running VM's, running as stand alone:
Screenshot 2021-08-03 at 19.46.30.png

Host 1 , having the VM's HDD's all ready and accessible:

Screenshot 2021-08-03 at 19.49.19.png

Host2, having all the information to get Host1 back running:

[B]root@node2[/B]:/etc/pve/nodes/node1# ls -l total 2 -rw-r----- 1 root www-data 145 Aug 2 22:18 lrm_status drwxr-xr-x 2 root www-data 0 Jul 9 19:19 lxc drwxr-xr-x 2 root www-data 0 Jul 9 19:19 openvz drwx------ 2 root www-data 0 Jul 9 19:19 priv -rw-r----- 1 root www-data 1679 Jul 9 19:19 pve-ssl.key -rw-r----- 1 root www-data 1712 Jul 9 19:19 pve-ssl.pem drwxr-xr-x 2 root www-data 0 Jul 9 19:19 qemu-server root@node2:/etc/pve/nodes/node1# cd qemu-server/ root@node2:/etc/pve/nodes/node1/qemu-server# ls 101.conf 105.conf 108.conf 109.conf 112.conf 115.conf root@node2:/etc/pve/nodes/node1/qemu-server#

This all feels like a bad romantic movie where the hosts run back at each other arms in the end of the movie and be complete again :rolleyes:

The thing is, the solution seems to be a small step away, but it's not clear to me, and one misstep could get me to an even worse state.

Any help on what my next step should be? or pointers in the right direction?

Thanks in advance!
 
I tried to follow recommendations on https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node:

Code:
6.5. Recovery

If you have major problems with your Proxmox VE host, e.g. hardware issues, it could be helpful to just copy the pmxcfs database file /var/lib/pve-cluster/config.db and move it to a new Proxmox VE host. On the new host (with nothing running), you need to stop the pve-cluster service and replace the config.db file (needed permissions 0600). Second, adapt /etc/hostname and /etc/hosts according to the lost Proxmox VE host, then reboot and check. (And don’t forget your VM/CT data)

This lead to a critical failure on boot on host1, so i reinstalled it again and redone the steps back to how it was before the last post.

If no one has any pointers i will probably save the ".conf" files from each host, reinstall host2 and set a new cluster from the two freshly installed stand alone hosts. Maybe when i come back from hollidays.

Since i needed some of the VM's on host1 running again, i'v used the .conf files for each VM that were on the host2 /etc/pve/nodes/host1, and have them back running on host1 (i imported them with "quemu rescan").

But host1 is still stand alone, host2 is still in a cluster reporting the host1 failure.

I have a VM that i need running on host 1 that got migrated trough HA to host 2, i can't remove it from host 2 because it has a replication job to host 1, i cant delete the replication job because the cluster can't perform the job delete on host1.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!