Hello everyone,
I have a fundamental question about Proxmox that has been on my mind for quite some time, and I just can't seem to figure it out. It's about how to enable robust and reasonably fast failover even in small businesses with limited financial resources.
In my experience, small businesses often have lower requirements, meaning that downtime does not need to be zero or less than five minutes. However, it should not take several days to restore the failed services. The server should deliver sufficient performance for “normal operation,” but during “emergency operation” with a replacement server, compromises can be made for a certain period of time.
My considerations focus on the classic scenario, which does occur in practice, namely that the server/Proxmox host fails completely and is no longer functional (defective hardware, compromised OS, power surge, water damage, whatever...)
Basically, I would run Proxmox in such companies on rather simple server hardware with local ZFS storage on SSDs. Normal backups should be performed via Proxmox Backup Server in the LAN and nightly pull synchronisation to an external PBS, each with HDD NAS as cost-effective storage. This would ensure normal operation and a good backup with manageable resources. But what about recovery in the event of a server failure?
For cost reasons, cluster solutions with two or three identical hardware nodes with ZFS replication or Ceph are out of the question. In the event of a failure, getting a new server, setting everything up again, and restoring the VMs from the backup is also out of the question because it would take far too long. Just restoring a 2 TB VM (unfortunately, even small companies generate large amounts of data) takes an eternity in a classic environment without “all-flash, all-NVMe, Gbit/s LAN, etc.”
I have yet to find a viable solution that lies between these two extremes and enables a recovery time of 2-4 hours. However, not everything has to be (semi-)automatic, some manual work according to an emergency plan is perfectly acceptable in this situation.
My ideal scenario would be: A slimmed-down replacement server without local storage (only very small storage for the OS) with Proxmox pre-installed (not switched on, only as a “cold standby”), regularly synchronising the VMs to a NAS (at least once a day, possibly more often), creating different backup states. If the primary server fails, switch on the replacement server, integrate the VMs from the NAS (preferably already prepared) and start them under Proxmox. Functionality will be ensured by occasionally (2-3 times a year) booting up the replacement server, installing updates, and testing whether the VM replicas can be started with any backup states from the backup.
We have often implemented a solution like this with VMware and Veeam in the past, and it has worked quite well in practice. In a very simple variant, VMware could even be started from a prepared USB stick on any hardware, and the VMs could be mounted and booted from the backup on an external USB hard drive using Veeam Instant Recovery. The performance was generally not outstanding, but it was absolutely usable for the transition.
However, based on Proxmox, I have not yet been able to develop a complete solution that would allow the described concept to be fully implemented. I have pursued the following three approaches, but none of them have been successful:
1. True replication of the VMs is only possible at the block level on a second active server with local ZFS. Even if this were a slimmed-down server with HDDs instead of SSDs, and if it were only woken up once a day to sync and then returned to standby mode to save power, this seems impractical and error-prone to me. And as far as I know, this solution only allows you to fail over to the last replicated state. If this is also faulty (damaged or compromised), you have a problem because there are no previous states of the VMs on the replacement server, only in the normal backup. Or am I getting something wrong?
2. There are also various limitations with normal file- or block-level backups, e.g., to an NFS share on a NAS or to a Proxmox Backup Server. At the file level, the backup takes a long time and consumes a lot of space (always a complete backup without deduplication), and the file format is VMA and not a directly usable native VM format such as RAW or QCOW2. At the block level, you also don't have VM files, but rather a more or less large amount of chunks. In no case can a backup be directly integrated into PVE as a VM and started. The only option is a live restore, which does allow the VM to be started directly, but is very resource-intensive and always involves restoring to PVE storage.
3. According to my tests and research, there is also no suitable solution from third-party providers for storage and backup such as TrueNAS, Blockbridge, SEP, Veeam, etc. Either the storage on the primary server must already be external (e.g., TrueNAS with NFS or iSCSI), only the same mechanisms as PBS itself can be used (e.g., SEP with Live Restore), or you can certainly make a backup of Proxmox VMs, but instant recovery is only possible to other hypervisors (the current status at Veeam is to VMware or Hyper-V).
I would now be interested to know:
a) Have I understood the basic conditions correctly, or is there a major error in my thinking somewhere, and
b) What options are there for implementing a simple but functional recovery concept for temporary use until the primary server is repaired or replaced, as described?
Actually, I think it would be sufficient if, based on point 2, a VM with usable performance could be started from the backup without restoring it. This could be done either by creating the backup in a file-based format that can be used by PVE, such as RAW or QCOW2, preferably incrementally and with multiple states as with snapshots. Or by ensuring that, in the case of block-based backups, the requested blocks are not restored locally to PVE, but are read directly from the backup according to the selected backup state. Changes to the VM data during this temporary operating state would, of course, have to be written away separately, as with a snapshot or backup fleecing. And the backup would have to be created differently during the failover state, preferably still incrementally to the normal backup destination.
Would something like this be feasible, or is it completely hopeless with current means? And what might be possible alternatives that I haven't thought of yet? I'm very excited to hear your answers!
Best regards,
Thorger
I have a fundamental question about Proxmox that has been on my mind for quite some time, and I just can't seem to figure it out. It's about how to enable robust and reasonably fast failover even in small businesses with limited financial resources.
In my experience, small businesses often have lower requirements, meaning that downtime does not need to be zero or less than five minutes. However, it should not take several days to restore the failed services. The server should deliver sufficient performance for “normal operation,” but during “emergency operation” with a replacement server, compromises can be made for a certain period of time.
My considerations focus on the classic scenario, which does occur in practice, namely that the server/Proxmox host fails completely and is no longer functional (defective hardware, compromised OS, power surge, water damage, whatever...)
Basically, I would run Proxmox in such companies on rather simple server hardware with local ZFS storage on SSDs. Normal backups should be performed via Proxmox Backup Server in the LAN and nightly pull synchronisation to an external PBS, each with HDD NAS as cost-effective storage. This would ensure normal operation and a good backup with manageable resources. But what about recovery in the event of a server failure?
For cost reasons, cluster solutions with two or three identical hardware nodes with ZFS replication or Ceph are out of the question. In the event of a failure, getting a new server, setting everything up again, and restoring the VMs from the backup is also out of the question because it would take far too long. Just restoring a 2 TB VM (unfortunately, even small companies generate large amounts of data) takes an eternity in a classic environment without “all-flash, all-NVMe, Gbit/s LAN, etc.”
I have yet to find a viable solution that lies between these two extremes and enables a recovery time of 2-4 hours. However, not everything has to be (semi-)automatic, some manual work according to an emergency plan is perfectly acceptable in this situation.
My ideal scenario would be: A slimmed-down replacement server without local storage (only very small storage for the OS) with Proxmox pre-installed (not switched on, only as a “cold standby”), regularly synchronising the VMs to a NAS (at least once a day, possibly more often), creating different backup states. If the primary server fails, switch on the replacement server, integrate the VMs from the NAS (preferably already prepared) and start them under Proxmox. Functionality will be ensured by occasionally (2-3 times a year) booting up the replacement server, installing updates, and testing whether the VM replicas can be started with any backup states from the backup.
We have often implemented a solution like this with VMware and Veeam in the past, and it has worked quite well in practice. In a very simple variant, VMware could even be started from a prepared USB stick on any hardware, and the VMs could be mounted and booted from the backup on an external USB hard drive using Veeam Instant Recovery. The performance was generally not outstanding, but it was absolutely usable for the transition.
However, based on Proxmox, I have not yet been able to develop a complete solution that would allow the described concept to be fully implemented. I have pursued the following three approaches, but none of them have been successful:
1. True replication of the VMs is only possible at the block level on a second active server with local ZFS. Even if this were a slimmed-down server with HDDs instead of SSDs, and if it were only woken up once a day to sync and then returned to standby mode to save power, this seems impractical and error-prone to me. And as far as I know, this solution only allows you to fail over to the last replicated state. If this is also faulty (damaged or compromised), you have a problem because there are no previous states of the VMs on the replacement server, only in the normal backup. Or am I getting something wrong?
2. There are also various limitations with normal file- or block-level backups, e.g., to an NFS share on a NAS or to a Proxmox Backup Server. At the file level, the backup takes a long time and consumes a lot of space (always a complete backup without deduplication), and the file format is VMA and not a directly usable native VM format such as RAW or QCOW2. At the block level, you also don't have VM files, but rather a more or less large amount of chunks. In no case can a backup be directly integrated into PVE as a VM and started. The only option is a live restore, which does allow the VM to be started directly, but is very resource-intensive and always involves restoring to PVE storage.
3. According to my tests and research, there is also no suitable solution from third-party providers for storage and backup such as TrueNAS, Blockbridge, SEP, Veeam, etc. Either the storage on the primary server must already be external (e.g., TrueNAS with NFS or iSCSI), only the same mechanisms as PBS itself can be used (e.g., SEP with Live Restore), or you can certainly make a backup of Proxmox VMs, but instant recovery is only possible to other hypervisors (the current status at Veeam is to VMware or Hyper-V).
I would now be interested to know:
a) Have I understood the basic conditions correctly, or is there a major error in my thinking somewhere, and
b) What options are there for implementing a simple but functional recovery concept for temporary use until the primary server is repaired or replaced, as described?
Actually, I think it would be sufficient if, based on point 2, a VM with usable performance could be started from the backup without restoring it. This could be done either by creating the backup in a file-based format that can be used by PVE, such as RAW or QCOW2, preferably incrementally and with multiple states as with snapshots. Or by ensuring that, in the case of block-based backups, the requested blocks are not restored locally to PVE, but are read directly from the backup according to the selected backup state. Changes to the VM data during this temporary operating state would, of course, have to be written away separately, as with a snapshot or backup fleecing. And the backup would have to be created differently during the failover state, preferably still incrementally to the normal backup destination.
Would something like this be feasible, or is it completely hopeless with current means? And what might be possible alternatives that I haven't thought of yet? I'm very excited to hear your answers!
Best regards,
Thorger
Last edited: