ZFS replication sometimes fails

thokr · Feb 1, 2022

We're currently testing ZFS replication in order to use it on our production servers.

I made a test cluster of 3 nested Proxmox VMs: at1, at2 and at3. Each node has a single 16GB drive with a ZFS pool.

I created a VM and a container (referred to as "VMs" from now on) on at1, enabled replication to at2 and at3. And configured HA so that at1 is the preferred node for the VMs.

Then I simulate node failure by shutting down at1. The VMs migrate to at2, as expected. Replication also changes to at1 and at3, as expected. But often, about 1/2 of times, the replication job for at3 fails with the following error log:

Code:

2022-02-01 06:05:00 100-1: start replication job
2022-02-01 06:05:00 100-1: guest => VM 100, running => 90283
2022-02-01 06:05:00 100-1: volumes => zfs:vm-100-disk-0
2022-02-01 06:05:02 100-1: end replication job with error: No common base to restore the job state
please delete jobid: 100-1 and create the job again

The only way to recover from it is to remove the volumes from at3 and run the job again, but it's obviously not practical to do that for big volumes. Deleting and creating the job like the error suggests doesn't help.

And for some reason it only happens for at3. at2 is always ok. Interestingly, same doesn't seem to happen to at2 when I change the HA priority from at1→at2→at3 to at1→at3→at2. But maybe I just didn't test it for long enough.

Any ideas on what's causing the issue would be great!

raistlinkell · Mar 7, 2022

I've also just noticed this on my PVEv7.1-8 cluster. I'm also running 3 PVE nodes. There are several production machines that replicate between all 3 nodes in the cluster. Only 1, my Nextcloud CT is showing an error when replicating from PVE1 to PVE2. There is no error in replicating the CT from PVE1 to PVE3.
The other 2 PVE nodes are replicating VM's and CT's fine between themselves.

Proxmox Replication Error with Nextcloud Svr.png

timdonovan · Apr 25, 2022

This issue is persistent on my 3 cluster node. I delete the replication job...it comes back with errors a few days later. Happens across all my VM's and replication tasks.

dcsapak · May 2, 2022

hi,
sorry for the late answer...

can you post the vm config, as well as the output of

Code:

zfs list -t all

from all nodes ?

thokr · May 2, 2022

Here's the config of a VM:

Code:

boot: order=scsi0;net0
cores: 1
memory: 1024
meta: creation-qemu=6.1.0,ctime=1643655674
name: vm1
net0: virtio=1E:EB:1B:A7:14:27,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: zfs:vm-100-disk-0,backup=0,size=4100M
scsihw: virtio-scsi-pci
smbios1: uuid=5c2930a4-2e93-433e-ac89-e1121606e504
sockets: 1
vmgenid: 89a2ce97-1ba1-4478-9b6e-e78bef0e91f2

Here's the config of a container:

Code:

arch: amd64
cores: 1
features: nesting=1
hostname: ct1
memory: 512
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=E6:0E:2E:C7:8B:76,ip=dhcp,type=veth
ostype: debian
rootfs: zfs:subvol-101-disk-0,size=1052408K
swap: 512
unprivileged: 1

Here's the output of the command from all nodes:

Code:

NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651503190__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1651503178__         0B      -     1.63G  -

Code:

NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651503190__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1650573900__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -

Code:

NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1650576606__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1651503178__         0B      -     1.63G  -

thokr · May 2, 2022

The output from 2nd and 3rd nodes (1st is turned off) when the error is present:

Code:

NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  439M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651508113__  1.41M      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508108__      2.14M      -     1.63G  -

Code:

NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651507745__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651507740__       620K      -     1.63G  -

And when turning 1st node back on:

Code:

NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G       96K  /zfs
zfs/subvol-101-disk-0                                  438M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651509907__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651509902__         0B      -     1.64G  -

Code:

NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.21G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  439M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651508113__  1.41M      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651508695__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508108__      2.15M      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508692__         0B      -     1.64G  -

Code:

NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651507745__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651507740__       620K      -     1.63G  -

dcsapak · May 3, 2022

thanks, i'll see if i can reproduce that (when i have time soon)

dcsapak · May 24, 2022

hi, i could reproduce it and have sent a (preliminary) patch to the list: https://lists.proxmox.com/pipermail/pve-devel/2022-May/053057.html

thokr · May 24, 2022

That's great! Thank you for your work! Please let everyone here know when it makes its way to the enterprise repository.

argylesocks · May 28, 2022

Is there any way to fix this before the patch goes live? Or do you expect it to go live shortly? Thank you.

Edit: Nuking the subvol on the broken destination node via zfs destroy fixed the issue for now. Replication job now functions.

bzb-rs · Aug 17, 2022

Is there an update to this? Struggling for a bit.

argylesocks said:
Is there any way to fix this before the patch goes live? Or do you expect it to go live shortly? Thank you.

Edit: Nuking the subvol on the broken destination node via zfs destroy fixed the issue for now. Replication job now functions.

Do you mind passing on the steps to fix this manually?

MasterCATZ · Sep 8, 2022

after using few a few weeks facing this issue as well

so is it confirmed that this code will fix ?

Code:

src/PVE/ReplicationState.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/PVE/ReplicationState.pm b/src/PVE/ReplicationState.pm
index 0a5e410..8eebb42 100644
--- a/src/PVE/ReplicationState.pm
+++ b/src/PVE/ReplicationState.pm
@@ -215,7 +215,7 @@ sub purge_old_states {
     my $tid = $plugin->get_unique_target_id($jobcfg);
     my $vmid = $jobcfg->{guest};
     $used_tids->{$vmid}->{$tid} = 1
-        if defined($vms->{ids}->{$vmid}); # && $vms->{ids}->{$vmid}->{node} eq $local_node;
+        if defined($vms->{ids}->{$vmid}) && $vms->{ids}->{$vmid}->{node} eq $local_node;
     }
 
     my $purge_state = sub {
--
2.30.2

dcsapak · Sep 8, 2022

MasterCATZ said:
so is it confirmed that this code will fix ?

at least i couldn't reproduce the issue not with this patch

it's already applied, but pve-guest-common is not bumped yet. should be fixed with libpve-guest-common 4.1-3 (or higher)

bzb-rs · Sep 8, 2022

Still facing the issue

MasterCATZ · Sep 8, 2022

Thanks how can I get it to update to
libpve-guest-common 4.1-3 as it says everything us up to date

what repo to enable ?

Code:

proxmox-ve: 7.2-1 (running kernel: 5.15.39-4-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-10
pve-kernel-helper: 7.2-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

also my attempt to delete snapshot and start again seemed to have failed

bzb-rs · Sep 8, 2022

For my case, immediate deletion of subvol did not resolve the issue, i went on and kept deleting the subvol until it told me everything is fine. Not sure if this will help someone.

MasterCATZ · Sep 8, 2022

so I nuked everything containing "rpool/data/vm-100" on the failed node
also not sure why the GUI does not remove them ... or even shows the zfs structure in the gui apart from just the pool

zfs destroy -r rpool/data/vm-100-state-MigratedPreDataUpdate
zfs destroy -r rpool/data/vm-100
zfs destroy -r rpool/data/vm-100-disk-1
zfs destroy -r rpool/data/vm-100-disk-0
.. etc

also unsure whats causing migrations issues that were failing before as i was wanting to pull the load from the laptop back to the virtualmaches ,
so took the laptop node offline that was running vm-100 and started vm-100 up on another node ,

#pvecm expected 1
#mv /etc/pve/nodes/pvelaptop/qemu-server/* /etc/pve/nodes/pveextra/qemu-server/

redid replication job , but still getting migration errors and I am not even trying a migration

Andrii.B · Nov 13, 2023

Sometimes I received an error of replication. About 30% of them are failed. Example at 16:00:
Error:

command 'set -o pipefail && pvesm export local-zfs:vm-8113-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_8113-0_1699891200__ -base __replicate_8113-0_1699884000__ | /usr/bin/cstream -t 50000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=nl-144013' root@IP -- pvesm import local-zfs:vm-8113-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_8113-0_1699891200__' failed: exit code 255

But at 18 no error.

There is only one VM on this node and several VMs on the target node. There is no high load on any of them.
Proxmox 8 is installed on this node and version 7 on the target node.

Andrii.B · Nov 15, 2023

When I received this error the sync doing 1-2 hours. In normal 7 second.

Andrii.B · Nov 17, 2023

And the strangest thing is that if I run the command from the console 20-30 times at intervals of every 15-30 minutes, the error does not occur.

Code:

pvesr run -id 8113-0 -verbose

But when the system did it itself sometimes return an error. Why?

ZFS replication sometimes fails

New Member

Member

Active Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

Proxmox Staff Member

New Member

New Member

Member

Member

Attachments

Proxmox Staff Member

Member

Member

Member

Member

Active Member

Active Member

Active Member