We're currently testing ZFS replication in order to use it on our production servers.
I made a test cluster of 3 nested Proxmox VMs:
I created a VM and a container (referred to as "VMs" from now on) on
Then I simulate node failure by shutting down
The only way to recover from it is to remove the volumes from
And for some reason it only happens for
Any ideas on what's causing the issue would be great!
I made a test cluster of 3 nested Proxmox VMs:
at1
, at2
and at3
. Each node has a single 16GB drive with a ZFS pool.I created a VM and a container (referred to as "VMs" from now on) on
at1
, enabled replication to at2
and at3
. And configured HA so that at1
is the preferred node for the VMs.Then I simulate node failure by shutting down
at1
. The VMs migrate to at2
, as expected. Replication also changes to at1
and at3
, as expected. But often, about 1/2 of times, the replication job for at3
fails with the following error log:
Code:
2022-02-01 06:05:00 100-1: start replication job
2022-02-01 06:05:00 100-1: guest => VM 100, running => 90283
2022-02-01 06:05:00 100-1: volumes => zfs:vm-100-disk-0
2022-02-01 06:05:02 100-1: end replication job with error: No common base to restore the job state
please delete jobid: 100-1 and create the job again
The only way to recover from it is to remove the volumes from
at3
and run the job again, but it's obviously not practical to do that for big volumes. Deleting and creating the job like the error suggests doesn't help.And for some reason it only happens for
at3
. at2
is always ok. Interestingly, same doesn't seem to happen to at2
when I change the HA priority from at1
→at2
→at3
to at1
→at3
→at2
. But maybe I just didn't test it for long enough.Any ideas on what's causing the issue would be great!