Rebuild node in a cluster

CycloneB

Member
Jan 26, 2020
21
3
23
42
I have a 3-node cluster in my homelab and lost one of the nodes this weekend due to a sudden SSD failure. I had recent backups of all the CTs and VMs and restored their backups to new id #s on the other two nodes. I'm ready to replace the failed drive and reinstall pve 8.2 from scratch. I will indeed use the same hostname, ip address, and root password that I used for this node previously.

Once the install is completed and I run through an apt update/dist-upgrade, do I just copy over the following directories from one of the other cluster nodes?
  • /etc/pve/
  • /etc/corosync/
Or do I copy over just /etc/corosync and copy the corosync.conf file to /etc/pve on the rebuilt node? I can rebuild the symlinks if necessary, but no sense in going through all that if I don't need anything except corosync to get the node back into the cluster and resynchronize it.

Also, once the files are in place, do I just cycle pve-cluster and corosync via systemctl restart corosync pve-cluster ?

Once complete with everything, I will do a pvecm updatecerts on each node.

Fundamentally, the steps above are from reading https://forum.proxmox.com/threads/reinstall-node-in-cluster.64281/ . I will also probably need to update some flags like turning on AMD IOMMU, etc., redefining static names for USB drives to pass to CTs, etc. Those shouldn't be too bad as I managed to do the same on the other nodes to limp along with the other two nodes while I got parts.
 
Last edited:
Well it will take a little longer than expected to do this. My replacement SSD arrived and I was sent a different one than what I ordered, so now I wait for another replacement. In the meantime, does the prior post sound correct as my recovery steps?
 
Things went decently smoothly. I did have to manually remove the lost node from the other two's SSH known hosts in a few places and on the rebuilt node, change /etc/ssh/known_hosts to be a symlink to /etc/pve/priv/ssh_known_hosts . Until I did those steps, while I could get shells going (after the cert updates), I couldn't get migrations to occur due to "Host Key Failed". But otherwise, everything was pretty smooth.

For future reference, I replaced the entire /etc/corosync directory on the rebuilt and added /etc/pve/corosync.conf from one of the other nodes to it. Then restarting the servers got connectivity back within the cluster.
 
Hmmm, if you don't have a backup of the /etc directory on the original host, then probably the easier way would have been to delete the dead host from the cluster (including from the /etc/pve/nodes dir), then install Proxmox on the replacement node and add that to the cluster as a new host.

Doing that, Proxmox would automatically populate the /etc/pve directory on the new node and things should "just work".
 
Last edited:
  • Like
Reactions: andlil
Hmmm, if you don't have a backup of the /etc directory on the original host, then probably the easier way would have been to delete the dead host from the cluster (including from the /etc/pve/nodes dir), then install Proxmox on the replacement node and add that to the cluster as a new host.

Doing that, Proxmox would automatically populate the /etc/pve directory on the new node and things should "just work".

Wish I came across that before. I was worried with reusing the same hostname + ip that would run into issues. If I ever have to do it again, sounds like that's the path to then take.
 
  • Like
Reactions: justinclift
No worries. Trying things out is what test labs are for. :)

With your nodes, are they not able to have at least 2 or more SSDs in them?

If they can, you can do the Proxmox installation onto multiple SSDs at the same time (ie mirrored drives). Then if one dies, the other drive still has all your stuff.
 
On two of the nodes, that’s not an option. They are HP 730 Thin Clients with only one SSD m.2 SATA slot (not even nvmE). The third one can support more, but current has three SSDs for NAS purposes and one as a boot disk, plus four rust spinners.

Thankfully, I take routine backups of the containers and vms. Hopefully soon it will be easy to take backups of the pve hosts. I should probably add the pbs client on them in the meantime.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!