[SOLVED] damaged rpool - data "recovery"

I had ProxMox 7.3 running on a Dell R820 as a mirror (rpool). The rpool status is "clean", but...
after several power outages (South Africa has "load-shedding" which makes for interesting times) my server wouldn't boot.

I managed to rescue boot it and all seemed well...but obviously wasn't. It has slowly deteriorated to the point of not being bootable (automatically or via rescue) now.

I thought I could do a fresh install onto an alternative non-redundant disk, then mount the zpool(s) to a different path.... copy off the essential data to a safe location, then wipe/reinstall on the original pair, and copy the data back.
I used the info from this post to get them mounted...

The install, though absolutely not ideal (via an external USB SSD drive) is a bit slow, is temporary, so I don't particularly care.
I have 99.9% (by size) right [including all the CT/VM images etc], but the
/etc/pve/*
.conf files are not all showing up in the mounted rpool.... especially/specifically, the qemu/lxc config files are "vanished" (the /etc/pve/nodes/* is not there in the mounted rpool).
I did build all of them using commandline and have copies of the scripts, so I could probably rebuild them from there, but I'm wondering if anyone has/knows or could point me in the right direction to extract them from some cached copy or location. (I do admit to being an idiot for not having backed up the .confs - which was part of the plan right up to the point when it was too late). The price of stupidity is, indeed, high.

1) Is it correct to assume that once I have the configs back, I can overwrite the stored images from the backup and they should be fine?
2) Alternatively, if I have to recreate them, I do have the vzdumps which I could probably use to restore them to almost as good as new? (ie. would the system detect the backups associated and allow restore?)

Thanks
 
Last edited:
The /etc/pve/ directory is actually a database pretending to be a filesystem. It is indeed not there when the Proxmox services have not started (correctly).
If you install a new Proxmox and attach/connect the old backup storage then Proxmox should see them and be able to restore them.
ZFS really works best with enterprise SSDs that have PLP (Power Loss Protection) built-in. Maybe your situation does justify the higher up-front costs of those? They do last much longer and are faster because they can safely cache writes because of the PLP.
 
You should also run your server on an UPS that powers it for some minutes on an power outage. You can then use tools like NUT that monitors your UPS and shutdown the server gracefully on an power outage before the battery gets empty.

What you can do is backup the database itself and restore it on a new PVE host. See here: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)#_recovery

When the pool itself is healthy but just the bootloader is damaged, you could boot a Live Linux, chroot into your non-bootable rpool and then rebuild the bootloader and initramfs and write it to disk. For initramfs I wrote that here today: https://forum.proxmox.com/threads/fix-stuck-initramfs.120192/post-522569
When chrooted you could also sync the ESP or write the grub as described in the wiki under "changing a failed bootable device": https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration

And in case just one bootloader is damaged, you could go to the BIOS and change the boot order to boot from the second disk first.
 
Last edited:
  • Like
Reactions: leesteken
Hey guys - I have had some fibre problems, and have been offline for a few days. It was obviously not my intention to abuse your knowledge without thanks.
I can now offer my thanks and appreciation you for the inputs - I have managed to recovery fully from the problem because of the guidance.

Just to explain - I do have 2 relatively large UPS system (4x100Ah each) that feed the server - they can last about 9 hours, but we had a ~20hr stint of no power because the provider substation blew. In all a comedy of errors...first the power, then the server configs, then the fibre.
Thankfully all sorted now.
Much obliged for the time, knowledge and patience that the experts and the community share with us n00bs!