Create Test Scenario for PVE boot failure

mw88

New Member
Oct 8, 2022
3
0
1
I am working on setting up a homeserver, which is going to mainly serve as a fileserver plus run some virtual machines and containers for home automation, routing, media, etc..

For the files, I have planned and ready a raidz2 with four HDDs and I played around with restoring files and datasets, with loading the pool on different machines, so that I feel I can trust my data to this setup.

However, for the VMs, containers and main PVE config I am using a raidz2 boot pool consisting of four SSDs. In my reasoning, raidz2 is a suitably low risk of having to recover from failure of three disks at once. However, as I am a beginner, I always run the risk of misconfiguration. So I would like to create a test scenario, where I manually render my PVE unbootable and then restore the rpool to an earlier snapshot, before I finally trust the system with all my config and data.

I imagine, that running
Bash:
wipefs -a /dev/sdaX
would be sufficient to render disks unbootable, but what is the procedure to restore it to an earlier snapshot without PVE running?

Also, do you have any suggestions how I would/should test a bootloader failure?
 
I am working on setting up a homeserver, which is going to mainly serve as a fileserver plus run some virtual machines and containers for home automation, routing, media, etc..

For the files, I have planned and ready a raidz2 with four HDDs and I played around with restoring files and datasets, with loading the pool on different machines, so that I feel I can trust my data to this setup.

However, for the VMs, containers and main PVE config I am using a raidz2 boot pool consisting of four SSDs. In my reasoning, raidz2 is a suitably low risk of having to recover from failure of three disks at once. However, as I am a beginner, I always run the risk of misconfiguration. So I would like to create a test scenario, where I manually render my PVE unbootable and then restore the rpool to an earlier snapshot, before I finally trust the system with all my config and data.

I imagine, that running
Bash:
wipefs -a /dev/sdaX
would be sufficient to render disks unbootable, but what is the procedure to restore it to an earlier snapshot without PVE running?

Also, do you have any suggestions how I would/should test a bootloader failure?
In this case I would do the following:

* Install Proxmox VE in VM and configure it exactly as you plan it for your physical server
* Play then around by removing (virtual) disks etc and see the effect
* In order to be able to go back to a certain state of your experiments make snapshots
* As soon as you are safe and familiar how it works install your physical server accordingly

Btw: RAIDZ2 need a lot of performance, recommended to figure out if Storage access speed is sufficient.
 
In this case I would do the following:

* Install Proxmox VE in VM and configure it exactly as you plan it for your physical server
* Play then around by removing (virtual) disks etc and see the effect
* In order to be able to go back to a certain state of your experiments make snapshots
* As soon as you are safe and familiar how it works install your physical server accordingly

Btw: RAIDZ2 need a lot of performance, recommended to figure out if Storage access speed is sufficient.
Thank you for the suggestions, however I have concerns:

* Running Proxmox in a VM is something I did before deciding on actually deciding on proxmox about a year ago.
* I sucessfully understood how to resilver, and did that.
* For most configurations problems, going back to a VM state is fine, however I want to prepare for the case when Proxmox does not boot anymore. Resetting them VM is not something that I can do on a real system. And this is the main concern I have and want to test, hence my question.
* I have the physical server anyway, so why should I not test everything on there? Virtualization always introduces/reduces some aspects.

Storage access speed should be fine, I'm running this on a TR3960.
 
* For most configurations problems, going back to a VM state is fine, however I want to prepare for the case when Proxmox does not boot anymore. Resetting them VM is not something that I can do on a real system. And this is the main concern I have and want to test, hence my question.
Sure, therefore I suggested to proof the concept for your setup with a VM before in order to avoid a situation having the need to to "reset" the system; respectively: get familiar how to recover a "damaged" RAIDZ2 by try it out with a VM. In principle explained in https://docs.oracle.com/cd/E19253-01/819-5461/gcfhw/index.html , the difficulty is what to do when RAIDZ2 is necessary for boot: you have to boot the system with alternative media and repair it by using the alternative OS and make it bootable then. Once you get safety by practise it with a VM it will make sure that you succeed in case of a failure in bare metal system too.
 
Thank you, although I have been sceptical at first and didn't like yourt answer, I now realize this is a great way to test what I have had in mind.

My own research shows I should be able to perform the following steps:
  1. destroy the OS (PVE), then
  2. boot from a (virttual) flash drive and
  3. try a rollback
Are these the recommended steps? Then afterwards I can experiment on destroying the bootloader.

I will try it out during the upcoming weeks and most probably document my results here (for myself and anyone else interested)