[SOLVED] Trouble getting proxmox to work after crash - what is the right method?

zahnfee

New Member
Jan 19, 2024
19
1
3
Hi,

after my root disk broke and needed to be replaced (containers kept working, only VMs went down as well btw.) I did some researching on how to do it.
Since I have an identical system that was cloned from the original system (using dd and then adjusting hostnames/ip afterwards) I though this ought to be an easy job.
Also I have all important files backed up, including the vm-images and container-images

So far it has been a nightmare:

I installed Debian bookworm from scratch and then used apt-clone to clone all packages from the back-up system to the new install.
This went well after some hick-ups with package management due to /var being on a separate disk.

The first thing I did was restore all images back to the respective locations (VM, containers, templates, iso...).

Now I found some posts claiming that it would be sufficient to replace /etc/pve with the backed-up version and all should be fine
(see https://forum.proxmox.com/threads/how-to-restore-config-db-or-list-contents.72491/ )

Well this lead into a bunch of errors which lead me into a completely wrong direction, until in my desperation I removed the entire /etc/pve and rebooted
At least I had a working web-Interface now, but was nowhere close to my goal of restoring the old system.

Since I have the old /var which contains /var/lib/pve-cluster/config.db from the time of the crash, I tried the other suggestion that I found
( https://forum.proxmox.com/threads/w...-proxmox-os-crash-due-to-power-failure.74000/ )
Note that again copying /etc/pve is mentioned - I find this very misleading, because it appears clearly wrong.

So I restored all files mentioned in that post, generated an empty /etc/pve again and put the old config.db in the correct location on the new system, then rebooted.
It looks promising - however the entire network config (although networking appears to function) is not visible anymore in the web interface.

===> How can that be corrected? /etc/network/interfaces and /etc/network/interfaces.d have been restored and as mentioned networking seems to work (I can reach all containers/VMs from the outside via ssh)

===> Also what *is* the correct and also fastest solution to restore the machines.
I feel uneasy to rely on a config.db file which I cannot easily read/check for correct information.
Is there at least a description of the format, so that it can be edited and if need be corrected?

kind regards,

z.
 
Last edited:
Another thing, that bothers me:

after the restore the VM comes up with very little memory, so it crashes on boot with OOM. A reset then succeeds and it boots.
It shows 512MB memory using the free command although it should be 4GB.
Even after changing the memory in the webinterface and rebooting the VM it still does not show the memory configured.
That is different on my backup system where it shows exactly what is configured.

My confidence in the restore process is waning...
 
Unfortunately there is no "good" way to restore a node from a host backup (yet). There is no official way to create a host backup in the first place :-( (See for example https://bugzilla.proxmox.com/show_bug.cgi?id=2287)

While the cli-"proxmox-backup-client" helps with some parts, it is not a full solution for desaster recovery / replacement installation.

I would go with a clean install and fiddle around manually with those files from "/etc" in my multiple backups. Exactly like you did, with the exception that I would not copy the full config.db but go for individual "vi /etc/pve/storage.cfg" and so on...

So..., you're not alone. Nevertheless: have fun!

Disclaimer: I am running some clusters and zero stand-alone PVEs. Some things are easier with a cluster...
 
Also helps to have all disks in raid1 so you won't need to restore PVE in the first place when the system disk fails.
The 10-40€ you save by not buying a second system disk usually isn't worth all the downtime and work hours you need to get everything working again.
 
Last edited:
Thanks,
yes I thought about having mirrored disks, but the entire disk (except the ESP) is luks2 encrypted with lvm and has a patched grub-improved.
That makes it a bit more challenging... another layer bewteeen luks and lvm.

The good news is I got my networking display back.
It was my own fault - I use ansible (rsync) to push recovery files to the server, once the basic system is up.
I thought I had diligently configured everythiing during the initial installation. However...

I discovered that /etc/network had 700 permissions because I recently switched to a different linux flavour with a more restrictive umask.
So networking worked, but pve had no permission to read it.

As to the memory issue, I am not sure if it has to do with memory ballooning enabled. Currently the VM starts again without crashing.

So the advice in the second post (above) seems sound:

start with an empty /etc/pve and use the restored /var/lib/pve-cluster/config.db to populate the /etc/pve.
Of course the vm and container images have to be backed up as well.
And watch the permissions :)

All in all I'm glad this crash did not cause more problems and my backup strategy basically worked
I've learned exactly which files are important to have and will tweak my rsnapshot config further.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!