Restore VM data after HA / replication sync issue – node failure scenario

RLMPve

New Member
Jan 30, 2026
1
0
1
Hi all,



I need some guidance on restoring VM data after an HA/replication related issue in a 2-node Proxmox cluster.



Environment


  • Proxmox VE cluster: 2 nodes
  • HA enabled
  • Storage used for VMs: (ZFS / LVM-thin )
  • Replication configured between nodes
  • Affected VM: VMID 2210







What happened




We experienced an issue where one node went down unexpectedly. Before the failure:



  • HA was configured for the VM
  • Replication jobs were active between nodes
  • The VM was running on Node A
  • Replicated copy existed on Node B







During/after the incident:




  • The node hosting the active VM failed
  • HA attempted to recover the VM
  • Now I’m in a situation where:
    • The VM disk on one side seems out of sync / inconsistent / missing recent data
    • I suspect the replicated volume may have a more recent or at least usable state


Status:
  • HA service for the VM is currently disabled to avoid further changes.
  • I have not restarted replication jobs yet.




What I need:


I want to understand the safest way to restore the VM using the replicated data without corrupting anything.


Specifically:


  1. How can I verify which replica snapshot is the latest consistent one?
  2. Is it safe to:
    • Detach the current disk
    • Promote the replicated volume on the secondary node
    • Attach it back to the VM config?
  3. With Proxmox replication, is there a recommended way to:
    • Break the replication relationship
    • Make the replica the new primary disk
  4. Are there logs that clearly show the last successful replication snapshot?
    • /var/log/pve/tasks/
    • journalctl
    • other?


Goal


Bring the VM up using the most recent consistent disk state from replication, and then re-establish replication cleanly afterward.





I want to avoid:

  • Starting the VM on a partially synced disk
  • Accidentally overwriting the good replica with an older state



Any step-by-step guidance for this recovery scenario would be greatly appreciated.





Thanks!
 
In a two-node cluster you should not have quorum and thus no fail-over from one host to the other will occur. HA might have tried to start the replica, but will not have succeeded in doing so due to this - except if you configured external vote support [0].

Another option is to lower the expected quorum votes to 1 [1] this can lead lead to data corruption if done improperly, especially on shared storage. You should make sure that the failed node is powered off and can not come back up on the network. It should be a last resort step to bring essential VMs up or restore quorum.

So in theory, host A should have the most recent version, but might contain corrupted data due to the host failure, whereas host B will have the last replica, which most likely will be consistent.

If you got a replicate on B to start then the replication job will have switched automatically from B -> A. The VM running on B will stay there barring any affinity rules [2].

If your VM state on node A is inconsistent/bad you can force a migration to node B, this worked on my test setup:

* Make sure node A is powered off, pull the network cable to be sure.
* Lower the expected quorum votes on node B to 1: pvecm expected 1
* Wait until the replica got started on node B, the replication job on node B will have switched to node A and probably fail since A is still down
* Start node A and check the cluster status: pvecm status, expected votes should be back at 2
* You might have to configure replication again, though with my test setup it worked without needing manual intervention

[0]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support
[1]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_write_configuration_when_not_quorate
[2]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_node_affinity_rules