ZFS replication sometimes fails

thokr

New Member
Jan 24, 2022
4
0
1
We're currently testing ZFS replication in order to use it on our production servers.

I made a test cluster of 3 nested Proxmox VMs: at1, at2 and at3. Each node has a single 16GB drive with a ZFS pool.

I created a VM and a container (referred to as "VMs" from now on) on at1, enabled replication to at2 and at3. And configured HA so that at1 is the preferred node for the VMs.

Then I simulate node failure by shutting down at1. The VMs migrate to at2, as expected. Replication also changes to at1 and at3, as expected. But often, about 1/2 of times, the replication job for at3 fails with the following error log:

Code:
2022-02-01 06:05:00 100-1: start replication job
2022-02-01 06:05:00 100-1: guest => VM 100, running => 90283
2022-02-01 06:05:00 100-1: volumes => zfs:vm-100-disk-0
2022-02-01 06:05:02 100-1: end replication job with error: No common base to restore the job state
please delete jobid: 100-1 and create the job again

The only way to recover from it is to remove the volumes from at3 and run the job again, but it's obviously not practical to do that for big volumes. Deleting and creating the job like the error suggests doesn't help.

And for some reason it only happens for at3. at2 is always ok. Interestingly, same doesn't seem to happen to at2 when I change the HA priority from at1at2at3 to at1at3at2. But maybe I just didn't test it for long enough.

Any ideas on what's causing the issue would be great!
 
I've also just noticed this on my PVEv7.1-8 cluster. I'm also running 3 PVE nodes. There are several production machines that replicate between all 3 nodes in the cluster. Only 1, my Nextcloud CT is showing an error when replicating from PVE1 to PVE2. There is no error in replicating the CT from PVE1 to PVE3.
The other 2 PVE nodes are replicating VM's and CT's fine between themselves.

Proxmox Replication Error with Nextcloud Svr.png
 
  • Like
Reactions: thokr
This issue is persistent on my 3 cluster node. I delete the replication job...it comes back with errors a few days later. Happens across all my VM's and replication tasks.
 
hi,
sorry for the late answer...

can you post the vm config, as well as the output of
Code:
zfs list -t all
from all nodes ?
 
Here's the config of a VM:
Code:
boot: order=scsi0;net0
cores: 1
memory: 1024
meta: creation-qemu=6.1.0,ctime=1643655674
name: vm1
net0: virtio=1E:EB:1B:A7:14:27,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: zfs:vm-100-disk-0,backup=0,size=4100M
scsihw: virtio-scsi-pci
smbios1: uuid=5c2930a4-2e93-433e-ac89-e1121606e504
sockets: 1
vmgenid: 89a2ce97-1ba1-4478-9b6e-e78bef0e91f2

Here's the config of a container:
Code:
arch: amd64
cores: 1
features: nesting=1
hostname: ct1
memory: 512
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=E6:0E:2E:C7:8B:76,ip=dhcp,type=veth
ostype: debian
rootfs: zfs:subvol-101-disk-0,size=1052408K
swap: 512
unprivileged: 1

Here's the output of the command from all nodes:
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651503190__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1651503178__         0B      -     1.63G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651503190__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1650573900__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1650576606__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1651503178__         0B      -     1.63G  -
 
Last edited:
The output from 2nd and 3rd nodes (1st is turned off) when the error is present:
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  439M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651508113__  1.41M      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508108__      2.14M      -     1.63G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651507745__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651507740__       620K      -     1.63G  -

And when turning 1st node back on:
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G       96K  /zfs
zfs/subvol-101-disk-0                                  438M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651509907__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651509902__         0B      -     1.64G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.21G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  439M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651508113__  1.41M      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651508695__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508108__      2.15M      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508692__         0B      -     1.64G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651507745__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651507740__       620K      -     1.63G  -
 
Last edited:
thanks, i'll see if i can reproduce that (when i have time soon)
 
That's great! Thank you for your work! Please let everyone here know when it makes its way to the enterprise repository.
 
Is there any way to fix this before the patch goes live? Or do you expect it to go live shortly? Thank you.

Edit: Nuking the subvol on the broken destination node via zfs destroy fixed the issue for now. Replication job now functions.
 
Last edited:
Is there an update to this? Struggling for a bit.

Is there any way to fix this before the patch goes live? Or do you expect it to go live shortly? Thank you.

Edit: Nuking the subvol on the broken destination node via zfs destroy fixed the issue for now. Replication job now functions.
Do you mind passing on the steps to fix this manually?
 
after using few a few weeks facing this issue as well

so is it confirmed that this code will fix ?

Code:
src/PVE/ReplicationState.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/PVE/ReplicationState.pm b/src/PVE/ReplicationState.pm
index 0a5e410..8eebb42 100644
--- a/src/PVE/ReplicationState.pm
+++ b/src/PVE/ReplicationState.pm
@@ -215,7 +215,7 @@ sub purge_old_states {
     my $tid = $plugin->get_unique_target_id($jobcfg);
     my $vmid = $jobcfg->{guest};
     $used_tids->{$vmid}->{$tid} = 1
-        if defined($vms->{ids}->{$vmid}); # && $vms->{ids}->{$vmid}->{node} eq $local_node;
+        if defined($vms->{ids}->{$vmid}) && $vms->{ids}->{$vmid}->{node} eq $local_node;
     }
 
     my $purge_state = sub {
--
2.30.2
 

Attachments

  • Screenshot from 2022-09-08 16-53-27.png
    Screenshot from 2022-09-08 16-53-27.png
    157.1 KB · Views: 14
so is it confirmed that this code will fix ?
at least i couldn't reproduce the issue not with this patch

it's already applied, but pve-guest-common is not bumped yet. should be fixed with libpve-guest-common 4.1-3 (or higher)
 
Thanks how can I get it to update to
libpve-guest-common 4.1-3 as it says everything us up to date

what repo to enable ?

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.39-4-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-10
pve-kernel-helper: 7.2-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

also my attempt to delete snapshot and start again seemed to have failed

Screenshot from 2022-09-09 06-33-16.png
Screenshot from 2022-09-09 06-30-11.png
 
Last edited:
For my case, immediate deletion of subvol did not resolve the issue, i went on and kept deleting the subvol until it told me everything is fine. Not sure if this will help someone.
 
so I nuked everything containing "rpool/data/vm-100" on the failed node
also not sure why the GUI does not remove them ... or even shows the zfs structure in the gui apart from just the pool

zfs destroy -r rpool/data/vm-100-state-MigratedPreDataUpdate
zfs destroy -r rpool/data/vm-100
zfs destroy -r rpool/data/vm-100-disk-1
zfs destroy -r rpool/data/vm-100-disk-0
.. etc


also unsure whats causing migrations issues that were failing before as i was wanting to pull the load from the laptop back to the virtualmaches ,
so took the laptop node offline that was running vm-100 and started vm-100 up on another node ,

#pvecm expected 1
#mv /etc/pve/nodes/pvelaptop/qemu-server/* /etc/pve/nodes/pveextra/qemu-server/

redid replication job , but still getting migration errors and I am not even trying a migration

Screenshot from 2022-09-09 08-45-29.png
 
Last edited:
Sometimes I received an error of replication. About 30% of them are failed. Example at 16:00:
Error:
command 'set -o pipefail && pvesm export local-zfs:vm-8113-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_8113-0_1699891200__ -base __replicate_8113-0_1699884000__ | /usr/bin/cstream -t 50000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=nl-144013' root@IP -- pvesm import local-zfs:vm-8113-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_8113-0_1699891200__' failed: exit code 255
But at 18 no error.
Screenshot 2023-11-13 at 18.42.27.png
There is only one VM on this node and several VMs on the target node. There is no high load on any of them.
Proxmox 8 is installed on this node and version 7 on the target node.
 
Last edited:
And the strangest thing is that if I run the command from the console 20-30 times at intervals of every 15-30 minutes, the error does not occur.
Code:
pvesr run -id 8113-0 -verbose
But when the system did it itself sometimes return an error. Why?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!