Hello, I am looking for a common and reliable way to be able to recover/rollback my 4 node homelab if one (or all) hosts needs a full reinstall (hardware failure of self-induced). I used Proxmox backup server (as a VM) to backup my VMs in the past and it was great. However, I never did implement anything for the host/cluster recovery. I will have 4 nodes in a PVE8 cluster running ceph reef across all of them (2.5gbe or 10gb DAC for the internal network) with a totally separate target for host file backup as necessary. It seems like removing/re-adding a host in a PVE8 cluster is much simpler than in the past which is good. However, I have never successfully recovered a host that was also part of a ceph cluster. I only tried this once after incorrectly using sed against an interfaces file and things went downhill from there.
I mentioned "common" and reliable since I want to be able to find relevant posts on how to resolve it if something happens OR it is so simple that it is easy if you can follow instructions without being a linux admin.
Also, I have used grub/ext4 in the past, but (as I put in a recent post) I for some reason need to use system.d (to get IOMMU working) so planning on using ZFS RAID 0 single SSD/NVME for each host install. I've used clonezilla in the past (easy), but I would "like" to (if necessary) be able to reinstall all 4 nodes, run a script to restore copies of files (/etc/... whatever is necessary) that I would periodically write out to a separate solution/target and have the ceph cluster healthy again. I would also have my Prox Backup Server VM setup to replicate occasionally to a remote host not part of the cluster so I could easily get my VMs back. I'm a newb when it comes to ceph - I did get it working and healthy, but when I made a mistake on the manager node, I tried hacking to get it back and it didn't go well. I would like to run ceph (more complicated) because I want to learn it anyway - part of the homelab purpose. Any and all feedback appreciated
Also, I have used ansible and plan on using it from my desktop to update various VMs. If I can use it to recover all 4 nodes in the PVE/Ceph Cluster (after new installs and updating known hosts/keys) to a known/healthy state, then that would be ideal.

Also, I have used grub/ext4 in the past, but (as I put in a recent post) I for some reason need to use system.d (to get IOMMU working) so planning on using ZFS RAID 0 single SSD/NVME for each host install. I've used clonezilla in the past (easy), but I would "like" to (if necessary) be able to reinstall all 4 nodes, run a script to restore copies of files (/etc/... whatever is necessary) that I would periodically write out to a separate solution/target and have the ceph cluster healthy again. I would also have my Prox Backup Server VM setup to replicate occasionally to a remote host not part of the cluster so I could easily get my VMs back. I'm a newb when it comes to ceph - I did get it working and healthy, but when I made a mistake on the manager node, I tried hacking to get it back and it didn't go well. I would like to run ceph (more complicated) because I want to learn it anyway - part of the homelab purpose. Any and all feedback appreciated

Also, I have used ansible and plan on using it from my desktop to update various VMs. If I can use it to recover all 4 nodes in the PVE/Ceph Cluster (after new installs and updating known hosts/keys) to a known/healthy state, then that would be ideal.
Last edited: