ZFS replication sometimes fails

thokr

New Member
Jan 24, 2022
4
0
1
We're currently testing ZFS replication in order to use it on our production servers.

I made a test cluster of 3 nested Proxmox VMs: at1, at2 and at3. Each node has a single 16GB drive with a ZFS pool.

I created a VM and a container (referred to as "VMs" from now on) on at1, enabled replication to at2 and at3. And configured HA so that at1 is the preferred node for the VMs.

Then I simulate node failure by shutting down at1. The VMs migrate to at2, as expected. Replication also changes to at1 and at3, as expected. But often, about 1/2 of times, the replication job for at3 fails with the following error log:

Code:
2022-02-01 06:05:00 100-1: start replication job
2022-02-01 06:05:00 100-1: guest => VM 100, running => 90283
2022-02-01 06:05:00 100-1: volumes => zfs:vm-100-disk-0
2022-02-01 06:05:02 100-1: end replication job with error: No common base to restore the job state
please delete jobid: 100-1 and create the job again

The only way to recover from it is to remove the volumes from at3 and run the job again, but it's obviously not practical to do that for big volumes. Deleting and creating the job like the error suggests doesn't help.

And for some reason it only happens for at3. at2 is always ok. Interestingly, same doesn't seem to happen to at2 when I change the HA priority from at1at2at3 to at1at3at2. But maybe I just didn't test it for long enough.

Any ideas on what's causing the issue would be great!
 

raistlinkell

Member
Nov 21, 2020
9
2
8
52
I've also just noticed this on my PVEv7.1-8 cluster. I'm also running 3 PVE nodes. There are several production machines that replicate between all 3 nodes in the cluster. Only 1, my Nextcloud CT is showing an error when replicating from PVE1 to PVE2. There is no error in replicating the CT from PVE1 to PVE3.
The other 2 PVE nodes are replicating VM's and CT's fine between themselves.

Proxmox Replication Error with Nextcloud Svr.png
 
  • Like
Reactions: thokr

timdonovan

Member
Feb 3, 2020
55
12
13
36
This issue is persistent on my 3 cluster node. I delete the replication job...it comes back with errors a few days later. Happens across all my VM's and replication tasks.
 

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
8,498
1,095
164
34
Vienna
hi,
sorry for the late answer...

can you post the vm config, as well as the output of
Code:
zfs list -t all
from all nodes ?
 

thokr

New Member
Jan 24, 2022
4
0
1
Here's the config of a VM:
Code:
boot: order=scsi0;net0
cores: 1
memory: 1024
meta: creation-qemu=6.1.0,ctime=1643655674
name: vm1
net0: virtio=1E:EB:1B:A7:14:27,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: zfs:vm-100-disk-0,backup=0,size=4100M
scsihw: virtio-scsi-pci
smbios1: uuid=5c2930a4-2e93-433e-ac89-e1121606e504
sockets: 1
vmgenid: 89a2ce97-1ba1-4478-9b6e-e78bef0e91f2

Here's the config of a container:
Code:
arch: amd64
cores: 1
features: nesting=1
hostname: ct1
memory: 512
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=E6:0E:2E:C7:8B:76,ip=dhcp,type=veth
ostype: debian
rootfs: zfs:subvol-101-disk-0,size=1052408K
swap: 512
unprivileged: 1

Here's the output of the command from all nodes:
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651503190__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1651503178__         0B      -     1.63G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651503190__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1650573900__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   591M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1650576606__     0B      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-1_1651503185__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.76G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651503170__         0B      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-1_1651503178__         0B      -     1.63G  -
 
Last edited:

thokr

New Member
Jan 24, 2022
4
0
1
The output from 2nd and 3rd nodes (1st is turned off) when the error is present:
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  439M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651508113__  1.41M      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508108__      2.14M      -     1.63G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651507745__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651507740__       620K      -     1.63G  -

And when turning 1st node back on:
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.81G       96K  /zfs
zfs/subvol-101-disk-0                                  438M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651509907__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651509902__         0B      -     1.64G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.21G  8.81G      104K  /zfs
zfs/subvol-101-disk-0                                  439M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651508113__  1.41M      -      437M  -
zfs/subvol-101-disk-0@__replicate_101-0_1651508695__     0B      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.64G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508108__      2.15M      -     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651508692__         0B      -     1.64G  -
Code:
NAME                                                   USED  AVAIL     REFER  MOUNTPOINT
zfs                                                   6.20G  8.82G       96K  /zfs
zfs/subvol-101-disk-0                                  437M   590M      437M  /zfs/subvol-101-disk-0
zfs/subvol-101-disk-0@__replicate_101-0_1651507745__   168K      -      437M  -
zfs/vm-100-disk-0                                     5.77G  12.9G     1.63G  -
zfs/vm-100-disk-0@__replicate_100-0_1651507740__       620K      -     1.63G  -
 
Last edited:

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
8,498
1,095
164
34
Vienna
thanks, i'll see if i can reproduce that (when i have time soon)
 

thokr

New Member
Jan 24, 2022
4
0
1
That's great! Thank you for your work! Please let everyone here know when it makes its way to the enterprise repository.
 

argylesocks

New Member
Apr 3, 2022
5
2
3
Is there any way to fix this before the patch goes live? Or do you expect it to go live shortly? Thank you.

Edit: Nuking the subvol on the broken destination node via zfs destroy fixed the issue for now. Replication job now functions.
 
Last edited:

bzb-rs

Member
Jun 8, 2022
40
2
8
Canada
Is there an update to this? Struggling for a bit.

Is there any way to fix this before the patch goes live? Or do you expect it to go live shortly? Thank you.

Edit: Nuking the subvol on the broken destination node via zfs destroy fixed the issue for now. Replication job now functions.
Do you mind passing on the steps to fix this manually?
 

MasterCATZ

New Member
Sep 8, 2022
12
1
3
after using few a few weeks facing this issue as well

so is it confirmed that this code will fix ?

Code:
src/PVE/ReplicationState.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/PVE/ReplicationState.pm b/src/PVE/ReplicationState.pm
index 0a5e410..8eebb42 100644
--- a/src/PVE/ReplicationState.pm
+++ b/src/PVE/ReplicationState.pm
@@ -215,7 +215,7 @@ sub purge_old_states {
     my $tid = $plugin->get_unique_target_id($jobcfg);
     my $vmid = $jobcfg->{guest};
     $used_tids->{$vmid}->{$tid} = 1
-        if defined($vms->{ids}->{$vmid}); # && $vms->{ids}->{$vmid}->{node} eq $local_node;
+        if defined($vms->{ids}->{$vmid}) && $vms->{ids}->{$vmid}->{node} eq $local_node;
     }
 
     my $purge_state = sub {
--
2.30.2
 

Attachments

  • Screenshot from 2022-09-08 16-53-27.png
    Screenshot from 2022-09-08 16-53-27.png
    157.1 KB · Views: 2

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
8,498
1,095
164
34
Vienna
so is it confirmed that this code will fix ?
at least i couldn't reproduce the issue not with this patch

it's already applied, but pve-guest-common is not bumped yet. should be fixed with libpve-guest-common 4.1-3 (or higher)
 

MasterCATZ

New Member
Sep 8, 2022
12
1
3
Thanks how can I get it to update to
libpve-guest-common 4.1-3 as it says everything us up to date

what repo to enable ?

Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.39-4-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-10
pve-kernel-helper: 7.2-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

also my attempt to delete snapshot and start again seemed to have failed

Screenshot from 2022-09-09 06-33-16.png
Screenshot from 2022-09-09 06-30-11.png
 
Last edited:

bzb-rs

Member
Jun 8, 2022
40
2
8
Canada
For my case, immediate deletion of subvol did not resolve the issue, i went on and kept deleting the subvol until it told me everything is fine. Not sure if this will help someone.
 

MasterCATZ

New Member
Sep 8, 2022
12
1
3
so I nuked everything containing "rpool/data/vm-100" on the failed node
also not sure why the GUI does not remove them ... or even shows the zfs structure in the gui apart from just the pool

zfs destroy -r rpool/data/vm-100-state-MigratedPreDataUpdate
zfs destroy -r rpool/data/vm-100
zfs destroy -r rpool/data/vm-100-disk-1
zfs destroy -r rpool/data/vm-100-disk-0
.. etc


also unsure whats causing migrations issues that were failing before as i was wanting to pull the load from the laptop back to the virtualmaches ,
so took the laptop node offline that was running vm-100 and started vm-100 up on another node ,

#pvecm expected 1
#mv /etc/pve/nodes/pvelaptop/qemu-server/* /etc/pve/nodes/pveextra/qemu-server/

redid replication job , but still getting migration errors and I am not even trying a migration

Screenshot from 2022-09-09 08-45-29.png
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!