What is the best way to recover quickly from proxmox/OS crash ? (due to power failure)

Mani · Aug 5, 2020

We are planning to setup proxmox on 20 nodes. We are doing study and working on pilot project. I would like to know, is there anyone has solution/better approach already to recover quickly on the event of proxmox/OS corruption due to power failure or some other reason. (not talking about VM data).

I think, the following way, It can be done...but looking for better approach.

Copy the folder content "/etc/pve/" in some other server once in few hour. When OS drive is corrupted install proxmox on the OS drive and copy the "/etc/pve/" from remote server and bring up the server. Expected down time => 1 ~ 2 hour appox.
Have 2 (A and B) identical drive for OS (may be 200GB) -> Keep Drive A as a primary boot device. Sync the data between two drives once in few hours. (Is it possible?). When drive A data is corrupted, boot with drive B. Note drive A & B is not in any raid. Expected down time => 15 mins appox.
Is there any other better way to recover quickly when OS drive is corrupted ?

One more additional question: Which is better method ?
1. Have 2 logical partition for OS & data. (data for all VMs data)
2. Have two set of disks, like 200 GB disks for OS and have more 2TB disks for data.

Suggestion: Learning & playing with proxmox and already happy with it. If this problem is taken care by proxmox, then it will be become a great platform. More & more people will start using it.

aaron · Aug 5, 2020

Have you considered using two smaller disks for the OS and install it in a mirrored RAID? If you want to use a software RAID you can do so via the installer by using ZFS. The Installer will take care that both disks will be bootable should one fail.

Otherwise, backing up the /etc/pve and /etc/network/interfaces is a good approach to get a freshly installed system back up to the previous state.

Mani said:
One more additional question: Which is better method ?
1. Have 2 logical partition for OS & data. (data for all VMs data)
2. Have two set of disks, like 200 GB disks for OS and have more 2TB disks for data.

How much data storage do you plan per server?

If you plan to have such a large cluster you want to think about the storage system [0]. If you want to use Ceph you will need additional disks for each Ceph OSD anyway.

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_storage

Mani · Aug 5, 2020

aaron said:
Have you considered using two smaller disks for the OS and install it in a mirrored RAID? If you want to use a software RAID you can do so via the installer by using ZFS. The Installer will take care that both disks will be bootable should one fail.

I have handled more than 100+ of servers in live environemnt. Due to many reasons in linux systems, the OS parition or disk can get corrupted. In software RAID also, both the disks will be corrupted, isnt ? (If one corrupts, then other one will also be corrupted)

I am looking for a solution, where one disk will have few hours old replica and the active one will have a up to now data.
I think, In ZFS we can keep replicate the data once in few hours. Let me try and update the solution here.

aaron · Aug 5, 2020

Mani said:
Due to many reasons in linux systems, the OS parition or disk can get corrupted. In software RAID also, both the disks will be corrupted, isnt ? (If one corrupts, then other one will also be corrupted)

Corrupted how? In a MD raid?

If you haven't used ZFS yet, take a look at its features. One of the really nice ones it that it checksums everything, data as well as metadata. Thus if it encounters a block of data for which the checksum does not fit, it will try to write that block again from a good copy (from the mirror or another level of redundancy).

Mani said:
I am looking for a solution, where one disk will have few hours old replica and the active one will have a up to now data.

If you fear that some operations might render the system faulty, you can think about auto snapshotting the ZFS datasets and if you encounter such a problem, you can rollback the dataset to a good snapshot.

That saved me a few times by now on my personal computers where I do not always think too hard if I should delete something

spirit · Aug 5, 2020

you don't need to copy /etc/pve.

just backup /var/lib/pve-cluster/config.db. (/etc/pve is a fuse directory, build on top of this db).

reinstall a new server with same hostname, copy back

/var/lib/pve-cluster/config.db,
/etc/corosync.conf
/etc/hostname
/etc/hosts
/etc/network/interfaces

and it should be ok.

guletz · Aug 5, 2020

Mani said:
We are planning to setup proxmox on 20 nodes. We are doing study and working on pilot project. I would like to know, is there anyone has solution/better approach already to recover quickly on the event of proxmox/OS corruption due to power failure or some other reason. (not talking about VM data).

Hi,

Now my larger PMX cluster have only 7 nodes. During several month I tested various scenario like one node will crush from whatever reson(power problem was only one scenario). And I do unnumbered failed test on different PMX nodes.
My standard nodes have only one ssd for PMX os and several hdds for data (VM and CT) using Amd or Intel cpu.

My own conclusion was this(money for hardware or software is at under minimum):

- is more efficient to re-install a pmx node and to use a simple script to be run after instalation (swap, sysctl, ...), maybe ansible is better in your case
- during this reinstall process, all the vm/ct can be started on other pmx nodes (zfs replication and HA) and then after node resurect cand be migrated back

With so many nodes you can designate one of them as a reserve that you can use if whatever bad event will happend.

Good luck / Bafta !

Mani · Aug 6, 2020

spirit said:
you don't need to copy /etc/pve.

just backup /var/lib/pve-cluster/config.db. (/etc/pve is a fuse directory, build on top of this db).

reinstall a new server with same hostname, copy back

/var/lib/pve-cluster/config.db,
/etc/corosync.conf
/etc/hostname
/etc/hosts
/etc/network/interfaces

and it should be ok.

Thank you, will test this before we start the production.

And also, We are thinking of having two logical volume for OS, like we will have syn the data between two drives.
Like the following logical volumes/drives.
boot1 ==> This will be used all the time
boot2 ==> this will have latest data of the above mentioned files. ==> If boot1 is not loading, will try with boot2 ==> This way, We can bring up the system within 10 minutes.
data

Search

Search

What is the best way to recover quickly from proxmox/OS crash ? (due to power failure)

Mani

Active Member

aaron

Proxmox Staff Member

Mani

Active Member

aaron

Proxmox Staff Member

spirit

Distinguished Member

guletz

Famous Member

Mani

Active Member