Simple concept for manual, short term use failover without significant downtime?

tahrens

New Member
Feb 4, 2026
1
0
1
Hamburg - Germany
Hello everyone,

I have a fundamental question about Proxmox that has been on my mind for quite some time, and I just can't seem to figure it out. It's about how to enable robust and reasonably fast failover even in small businesses with limited financial resources.
In my experience, small businesses often have lower requirements, meaning that downtime does not need to be zero or less than five minutes. However, it should not take several days to restore the failed services. The server should deliver sufficient performance for “normal operation,” but during “emergency operation” with a replacement server, compromises can be made for a certain period of time.
My considerations focus on the classic scenario, which does occur in practice, namely that the server/Proxmox host fails completely and is no longer functional (defective hardware, compromised OS, power surge, water damage, whatever...)

Basically, I would run Proxmox in such companies on rather simple server hardware with local ZFS storage on SSDs. Normal backups should be performed via Proxmox Backup Server in the LAN and nightly pull synchronisation to an external PBS, each with HDD NAS as cost-effective storage. This would ensure normal operation and a good backup with manageable resources. But what about recovery in the event of a server failure?

For cost reasons, cluster solutions with two or three identical hardware nodes with ZFS replication or Ceph are out of the question. In the event of a failure, getting a new server, setting everything up again, and restoring the VMs from the backup is also out of the question because it would take far too long. Just restoring a 2 TB VM (unfortunately, even small companies generate large amounts of data) takes an eternity in a classic environment without “all-flash, all-NVMe, Gbit/s LAN, etc.”

I have yet to find a viable solution that lies between these two extremes and enables a recovery time of 2-4 hours. However, not everything has to be (semi-)automatic, some manual work according to an emergency plan is perfectly acceptable in this situation.

My ideal scenario would be: A slimmed-down replacement server without local storage (only very small storage for the OS) with Proxmox pre-installed (not switched on, only as a “cold standby”), regularly synchronising the VMs to a NAS (at least once a day, possibly more often), creating different backup states. If the primary server fails, switch on the replacement server, integrate the VMs from the NAS (preferably already prepared) and start them under Proxmox. Functionality will be ensured by occasionally (2-3 times a year) booting up the replacement server, installing updates, and testing whether the VM replicas can be started with any backup states from the backup.

We have often implemented a solution like this with VMware and Veeam in the past, and it has worked quite well in practice. In a very simple variant, VMware could even be started from a prepared USB stick on any hardware, and the VMs could be mounted and booted from the backup on an external USB hard drive using Veeam Instant Recovery. The performance was generally not outstanding, but it was absolutely usable for the transition.

However, based on Proxmox, I have not yet been able to develop a complete solution that would allow the described concept to be fully implemented. I have pursued the following three approaches, but none of them have been successful:

1. True replication of the VMs is only possible at the block level on a second active server with local ZFS. Even if this were a slimmed-down server with HDDs instead of SSDs, and if it were only woken up once a day to sync and then returned to standby mode to save power, this seems impractical and error-prone to me. And as far as I know, this solution only allows you to fail over to the last replicated state. If this is also faulty (damaged or compromised), you have a problem because there are no previous states of the VMs on the replacement server, only in the normal backup. Or am I getting something wrong?

2. There are also various limitations with normal file- or block-level backups, e.g., to an NFS share on a NAS or to a Proxmox Backup Server. At the file level, the backup takes a long time and consumes a lot of space (always a complete backup without deduplication), and the file format is VMA and not a directly usable native VM format such as RAW or QCOW2. At the block level, you also don't have VM files, but rather a more or less large amount of chunks. In no case can a backup be directly integrated into PVE as a VM and started. The only option is a live restore, which does allow the VM to be started directly, but is very resource-intensive and always involves restoring to PVE storage.

3. According to my tests and research, there is also no suitable solution from third-party providers for storage and backup such as TrueNAS, Blockbridge, SEP, Veeam, etc. Either the storage on the primary server must already be external (e.g., TrueNAS with NFS or iSCSI), only the same mechanisms as PBS itself can be used (e.g., SEP with Live Restore), or you can certainly make a backup of Proxmox VMs, but instant recovery is only possible to other hypervisors (the current status at Veeam is to VMware or Hyper-V).

I would now be interested to know:
a) Have I understood the basic conditions correctly, or is there a major error in my thinking somewhere, and
b) What options are there for implementing a simple but functional recovery concept for temporary use until the primary server is repaired or replaced, as described?

Actually, I think it would be sufficient if, based on point 2, a VM with usable performance could be started from the backup without restoring it. This could be done either by creating the backup in a file-based format that can be used by PVE, such as RAW or QCOW2, preferably incrementally and with multiple states as with snapshots. Or by ensuring that, in the case of block-based backups, the requested blocks are not restored locally to PVE, but are read directly from the backup according to the selected backup state. Changes to the VM data during this temporary operating state would, of course, have to be written away separately, as with a snapshot or backup fleecing. And the backup would have to be created differently during the failover state, preferably still incrementally to the normal backup destination.

Would something like this be feasible, or is it completely hopeless with current means? And what might be possible alternatives that I haven't thought of yet? I'm very excited to hear your answers!

Best regards,
Thorger
 
Last edited:
One very low-end approach, with these assumptions:
  • you have two nearly identical PVE server, not clustered
  • one is doing its job, the other is turned off
  • you have sane PBS (with SSDs only, at least for the latest backup) - as you need to have daily backups anyway, right?
When the primary machine dies you turn on the standby machine. You'll need to restore the backup from last night, right?

Here comes the magic I wanted to mention: you can start the VM at the beginning of the restore process - it is called "live restore". It will run slowly as the PBS has to deliver the required data on demand, but nevertheless it will startup immediately - without the need to restore 2 TB first.

That's the smallest concept I can think of. For a commercial use case I would always recommend a cluster with at least two active nodes, plus the PBS to give Quorum. In this concept I would use ZFS-replication of course. A cluster has so many advantages...!

Just my 2 €¢ ;-)
 
  • Like
Reactions: bl1mp
the relevant questions to ask:
1. Would you want the failover to occur automatically, or with user intervention?
2. What is an acceptable outage period?
3. What is acceptable in terms of minimum performance (specifically, disk performance)

IF the acceptable outage period is not too short AND requiring user intervention isnt a limitation (it can add hours if not days to failover) @UdoB outlined an effective approach.
 
  • Like
Reactions: UdoB