Recovery disk?

venquessa

New Member
Aug 25, 2023
17
2
3
tldr;

Does anyone make, has anyone a guide on making a "shadow boot disc" for their PVE install? A shadow boot disk is basically a external boot disk which has everything needed to boot, access, fix, repair, start, stop, debug, inspect, the real system. If you have a large enough media, and enough ram, like a 32Gb flash drive, you can just put the whole OS on it and make a "Live ISO" image of the running machine.

-----

So I perma-crashed my first PVE box on day one. I tested what would happen if I assigned the PCIe SATA controller (I have 3 controllers) to a VM.

Well, what happened, apparently, is that the PVE node itself crashed. Even when I plugged a monitor and keyboard into it, it was dead. When I used the power button it shutdown cleanly and powered back up cleanly until it launched that VM when it hung. It was interesting that it could shutdown and startup with console output, but as soon as that VM ran the PVE node was completely incognito to everything.

Switching DNS over to 8.8.8.8 for some searching and I couldn't quickly find a way to boot into a recovery or safe mode. No bootloader options I could see etc. I found a post on here suggesting using the installer, which I still had on a USB key, but using debug mode which will drop you to console at each stage.

Being new to zfs it was a little daunting at first, but I finally got the ROOT volume mounted and.... WTF?!? panic mode. /etc/pve empty.

I hunted high and low and ended up back here to find the /etc/pve is a virutal mount point materialised by the cluster manager database which lives in /var/lib/pve-cluster

So, I made a copy of that, edited the binary file and remove the hostpci: config line from the offending VM and rebooted.

I kinda figured I wouldn't get away with this, but I was a little desperate, even saying out loud... you are going to have to reinstall it and spend the evening trying to re-attach/restore the VMs. After having to go back and change the mountpoint on /rpool/ROOT again (oops),... Indeed "Database corrupt" on pve-cluster startup.

However. This got me to exactly where I needed to be in the first place, in a working, fully initialised shell with running base services. So I could put the backed up config.db back in place, start the cluster manager alone to materialise the /etc/pve folder and fix the VM config.

reboot

Back to normal. Phew.
 
Does anyone make, has anyone a guide on making a "shadow boot disc" for their PVE install?
We have our own debian-based live linux for fixing systems, so I always use that as basis ... yet in the end, it's the same process as you descibed - also with decades worth of experience fixing Linux-based systems.

You could fix your system a little bit faster by:
  • booting your OS with grub and init=/bin/bash as a kernel command line argument
  • import your pool
  • sqlite3 your pve config
  • update the VM (fix the passthrough setting or disable auto boot)
  • reboot and enjoy
Often you don't need a separate boot option, just a working kernel yet it is always a good idea to have a working fallback.

Note: Your example is a very good one. If you would have asked me how to totally destroy your PVE node, I would not have thought of that.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!