Rebuild node in a cluster

CycloneB

Member
Jan 26, 2020
21
3
23
43
I have a 3-node cluster in my homelab and lost one of the nodes this weekend due to a sudden SSD failure. I had recent backups of all the CTs and VMs and restored their backups to new id #s on the other two nodes. I'm ready to replace the failed drive and reinstall pve 8.2 from scratch. I will indeed use the same hostname, ip address, and root password that I used for this node previously.

Once the install is completed and I run through an apt update/dist-upgrade, do I just copy over the following directories from one of the other cluster nodes?
  • /etc/pve/
  • /etc/corosync/
Or do I copy over just /etc/corosync and copy the corosync.conf file to /etc/pve on the rebuilt node? I can rebuild the symlinks if necessary, but no sense in going through all that if I don't need anything except corosync to get the node back into the cluster and resynchronize it.

Also, once the files are in place, do I just cycle pve-cluster and corosync via systemctl restart corosync pve-cluster ?

Once complete with everything, I will do a pvecm updatecerts on each node.

Fundamentally, the steps above are from reading https://forum.proxmox.com/threads/reinstall-node-in-cluster.64281/ . I will also probably need to update some flags like turning on AMD IOMMU, etc., redefining static names for USB drives to pass to CTs, etc. Those shouldn't be too bad as I managed to do the same on the other nodes to limp along with the other two nodes while I got parts.
 
Last edited:
Well it will take a little longer than expected to do this. My replacement SSD arrived and I was sent a different one than what I ordered, so now I wait for another replacement. In the meantime, does the prior post sound correct as my recovery steps?
 
Things went decently smoothly. I did have to manually remove the lost node from the other two's SSH known hosts in a few places and on the rebuilt node, change /etc/ssh/known_hosts to be a symlink to /etc/pve/priv/ssh_known_hosts . Until I did those steps, while I could get shells going (after the cert updates), I couldn't get migrations to occur due to "Host Key Failed". But otherwise, everything was pretty smooth.

For future reference, I replaced the entire /etc/corosync directory on the rebuilt and added /etc/pve/corosync.conf from one of the other nodes to it. Then restarting the servers got connectivity back within the cluster.
 
Hmmm, if you don't have a backup of the /etc directory on the original host, then probably the easier way would have been to delete the dead host from the cluster (including from the /etc/pve/nodes dir), then install Proxmox on the replacement node and add that to the cluster as a new host.

Doing that, Proxmox would automatically populate the /etc/pve directory on the new node and things should "just work".
 
Last edited:
  • Like
Reactions: andlil
Hmmm, if you don't have a backup of the /etc directory on the original host, then probably the easier way would have been to delete the dead host from the cluster (including from the /etc/pve/nodes dir), then install Proxmox on the replacement node and add that to the cluster as a new host.

Doing that, Proxmox would automatically populate the /etc/pve directory on the new node and things should "just work".

Wish I came across that before. I was worried with reusing the same hostname + ip that would run into issues. If I ever have to do it again, sounds like that's the path to then take.
 
  • Like
Reactions: justinclift
No worries. Trying things out is what test labs are for. :)

With your nodes, are they not able to have at least 2 or more SSDs in them?

If they can, you can do the Proxmox installation onto multiple SSDs at the same time (ie mirrored drives). Then if one dies, the other drive still has all your stuff.
 
On two of the nodes, that’s not an option. They are HP 730 Thin Clients with only one SSD m.2 SATA slot (not even nvmE). The third one can support more, but current has three SSDs for NAS purposes and one as a boot disk, plus four rust spinners.

Thankfully, I take routine backups of the containers and vms. Hopefully soon it will be easy to take backups of the pve hosts. I should probably add the pbs client on them in the meantime.
 
No worries. Trying things out is what test labs are for. :)

With your nodes, are they not able to have at least 2 or more SSDs in them?

If they can, you can do the Proxmox installation onto multiple SSDs at the same time (ie mirrored drives). Then if one dies, the other drive still has all your stuff.
That's a good guard against a drive failure, but I just had a cluster node completely unbootable during the upgrade from version 8 to 9 because it looks like grub was installed in the wrong place. Now I've got a mess on my hands and I'm here in the forums looking for a way to repair it or reinstall it.
 
The PVE installer ISO includes a "rescue boot" mode that worked for me. It found the existing proxmox install and Just Booted It, and I was able to reinstall grub from there after logging in as root.

For future reference, Rescatux and Super Grub Disc may also be useful for non-proxmox Linux installs
 
Thanks for the super tip! This is actually the only thing that ended up working for me. I had to completely reformat the 512mb boot partition and re-init. I was prompted to install additional packages in order to do so…. Which created an additional obstacle of getting networking reconfigured.

When I have my monitor plugged into the onboard motherboard vga port, my system firmware renames my nic from enp3s0 to enp2s0 because of the kernel naming policies. (Probably a unique quirk of my hardware and the gpu in one of the pcie slots)

Once I was able to configure the new interface name through the CLI by adding it to etc/network/interfaces and systemcl restart networking, I was able to get networking back up and download and install the packages to finally use the boot tool to fix the drive.

For the future, I’ve now reconfigured all the nodes on my cluster with add the additional interface on vmbr0 so if this happens in the future, networking will automatically work with a monitor attached.
 
  • Like
Reactions: Kingneutron