Create Test Scenario for PVE boot failure

mw88 · Nov 28, 2022

I am working on setting up a homeserver, which is going to mainly serve as a fileserver plus run some virtual machines and containers for home automation, routing, media, etc..

For the files, I have planned and ready a raidz2 with four HDDs and I played around with restoring files and datasets, with loading the pool on different machines, so that I feel I can trust my data to this setup.

However, for the VMs, containers and main PVE config I am using a raidz2 boot pool consisting of four SSDs. In my reasoning, raidz2 is a suitably low risk of having to recover from failure of three disks at once. However, as I am a beginner, I always run the risk of misconfiguration. So I would like to create a test scenario, where I manually render my PVE unbootable and then restore the rpool to an earlier snapshot, before I finally trust the system with all my config and data.

I imagine, that running

Bash:

wipefs -a /dev/sdaX

would be sufficient to render disks unbootable, but what is the procedure to restore it to an earlier snapshot without PVE running?

Also, do you have any suggestions how I would/should test a bootloader failure?

Richard · Nov 30, 2022

mw88 said:
I am working on setting up a homeserver, which is going to mainly serve as a fileserver plus run some virtual machines and containers for home automation, routing, media, etc..

For the files, I have planned and ready a raidz2 with four HDDs and I played around with restoring files and datasets, with loading the pool on different machines, so that I feel I can trust my data to this setup.

However, for the VMs, containers and main PVE config I am using a raidz2 boot pool consisting of four SSDs. In my reasoning, raidz2 is a suitably low risk of having to recover from failure of three disks at once. However, as I am a beginner, I always run the risk of misconfiguration. So I would like to create a test scenario, where I manually render my PVE unbootable and then restore the rpool to an earlier snapshot, before I finally trust the system with all my config and data.

I imagine, that running

Bash:

wipefs -a /dev/sdaX

would be sufficient to render disks unbootable, but what is the procedure to restore it to an earlier snapshot without PVE running?

Also, do you have any suggestions how I would/should test a bootloader failure?

In this case I would do the following:

* Install Proxmox VE in VM and configure it exactly as you plan it for your physical server
* Play then around by removing (virtual) disks etc and see the effect
* In order to be able to go back to a certain state of your experiments make snapshots
* As soon as you are safe and familiar how it works install your physical server accordingly

Btw: RAIDZ2 need a lot of performance, recommended to figure out if Storage access speed is sufficient.

mw88 · Nov 30, 2022

Richard said:
In this case I would do the following:

* Install Proxmox VE in VM and configure it exactly as you plan it for your physical server
* Play then around by removing (virtual) disks etc and see the effect
* In order to be able to go back to a certain state of your experiments make snapshots
* As soon as you are safe and familiar how it works install your physical server accordingly

Btw: RAIDZ2 need a lot of performance, recommended to figure out if Storage access speed is sufficient.

Thank you for the suggestions, however I have concerns:

* Running Proxmox in a VM is something I did before deciding on actually deciding on proxmox about a year ago.
* I sucessfully understood how to resilver, and did that.
* For most configurations problems, going back to a VM state is fine, however I want to prepare for the case when Proxmox does not boot anymore. Resetting them VM is not something that I can do on a real system. And this is the main concern I have and want to test, hence my question.
* I have the physical server anyway, so why should I not test everything on there? Virtualization always introduces/reduces some aspects.

Storage access speed should be fine, I'm running this on a TR3960.

Richard · Dec 1, 2022

mw88 said:
* For most configurations problems, going back to a VM state is fine, however I want to prepare for the case when Proxmox does not boot anymore. Resetting them VM is not something that I can do on a real system. And this is the main concern I have and want to test, hence my question.

Sure, therefore I suggested to proof the concept for your setup with a VM before in order to avoid a situation having the need to to "reset" the system; respectively: get familiar how to recover a "damaged" RAIDZ2 by try it out with a VM. In principle explained in https://docs.oracle.com/cd/E19253-01/819-5461/gcfhw/index.html , the difficulty is what to do when RAIDZ2 is necessary for boot: you have to boot the system with alternative media and repair it by using the alternative OS and make it bootable then. Once you get safety by practise it with a VM it will make sure that you succeed in case of a failure in bare metal system too.

mw88 · Dec 4, 2022

Thank you, although I have been sceptical at first and didn't like yourt answer, I now realize this is a great way to test what I have had in mind.

My own research shows I should be able to perform the following steps:

destroy the OS (PVE), then
boot from a (virttual) flash drive and
try a rollback

Are these the recommended steps? Then afterwards I can experiment on destroying the bootloader.

I will try it out during the upcoming weeks and most probably document my results here (for myself and anyone else interested)

Search

Search

Create Test Scenario for PVE boot failure

mw88

New Member

Richard

Renowned Member

mw88

New Member

Richard

Renowned Member

mw88

New Member

We value your privacy