CT Replication Job Failing After First one

Lopicl · May 18, 2023

Hello everyone,

I am incurring in an issue with replication on one Debian LXC Container.

I have 3 nodes in my PVE cluster:
- pve (primary, master)
- dr (secondary failover)
- raspberry pi zero (qdevice)

This container CT100 has got two subvols attached to it:
- One on local-zfs
- One on data-zfs

The issue I am having is that after the first replication job completes successfully, all the subsequent ones always fail with an I/O error on the data-zfs subvol.
This is the replication log:

Code:

2023-05-18 12:37:05 100-0: start replication job
2023-05-18 12:37:05 100-0: guest => CT 100, running => 1
2023-05-18 12:37:05 100-0: volumes => data-zfs:subvol-100-disk-0,local-zfs:subvol-100-disk-0
2023-05-18 12:37:06 100-0: freeze guest filesystem
2023-05-18 12:37:06 100-0: create snapshot '__replicate_100-0_1684406225__' on data-zfs:subvol-100-disk-0
2023-05-18 12:37:06 100-0: create snapshot '__replicate_100-0_1684406225__' on local-zfs:subvol-100-disk-0
2023-05-18 12:37:06 100-0: thaw guest filesystem
2023-05-18 12:37:06 100-0: using secure transmission, rate limit: none
2023-05-18 12:37:06 100-0: full sync 'data-zfs:subvol-100-disk-0' (__replicate_100-0_1684406225__)
2023-05-18 12:37:07 100-0: full send of data-zfs/subvol-100-disk-0@__replicate_100-0_1684406225__ estimated size is 246G
2023-05-18 12:37:07 100-0: total estimated size is 246G
2023-05-18 12:37:08 100-0: volume 'data-zfs/subvol-100-disk-0' already exists
2023-05-18 12:37:08 100-0: warning: cannot send 'data-zfs/subvol-100-disk-0@__replicate_100-0_1684406225__': signal received
2023-05-18 12:37:08 100-0: cannot send 'data-zfs/subvol-100-disk-0': I/O error
2023-05-18 12:37:08 100-0: command 'zfs send -Rpv -- data-zfs/subvol-100-disk-0@__replicate_100-0_1684406225__' failed: exit code 1
2023-05-18 12:37:08 100-0: delete previous replication snapshot '__replicate_100-0_1684406225__' on data-zfs:subvol-100-disk-0
2023-05-18 12:37:08 100-0: delete previous replication snapshot '__replicate_100-0_1684406225__' on local-zfs:subvol-100-disk-0
2023-05-18 12:37:08 100-0: end replication job with error: command 'set -o pipefail && pvesm export data-zfs:subvol-100-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_100-0_1684406225__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=dr' root@10.10.0.15 -- pvesm import data-zfs:subvol-100-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_100-0_1684406225__ -allow-rename 0' failed: exit code 255

I already tried deleting the subvol on the receiving data-zfs storage, but it's always the same thing. Restarting the whole cluster doesn't change anything.

fiona · May 22, 2023

Hi,
that is strange. The replication seems to think there is no previous replication and tries to do a full send, which then fails of course. Are you doing anything with the volumes or snapshots in between the first and second replication?

After cleaning the target and doing the first replication, can you run zfs list -t snapshot <zfs path to the subvol> and cat /var/lib/pve-manager/pve-replication-state.json? Please also post the output of pveversion -v from both involved nodes and your storage configuration cat /etc/pve/storage.cfg.

Lopicl · May 22, 2023

fiona said:
Hi,
that is strange. The replication seems to think there is no previous replication and tries to do a full send, which then fails of course. Are you doing anything with the volumes or snapshots in between the first and second replication?

After cleaning the target and doing the first replication, can you run zfs list -t snapshot <zfs path to the subvol> and cat /var/lib/pve-manager/pve-replication-state.json? Please also post the output of pveversion -v from both involved nodes and your storage configuration cat /etc/pve/storage.cfg.

Ok so basically I started a new replication job deleting the data-zfs volume on the receiving end, at the end when it was replicating the local-zfs volume that I didnt clear it gave the same error.

Trying with zfs list -t snapshot data-zfs/subvol-100-disk-0 i get no datasets available, same thing with the local-zfs subvol (rpool/data/subvol-100-disk-0).
Instead when doing zfs list -t snapshot rpool/data/subvol-200-disk-0 which is another CT disk, I get

Code:

NAME                                                          USED  AVAIL     REFER  MOUNTPOINT
rpool/data/subvol-200-disk-0@__replicate_200-0_1684777807__   108K      -     1.29G  -

This is the result of cat /var/lib/pve-manager/pve-replication-state.json

Code:

{"100":{"local/dr":{"last_iteration":1684778403,"last_try":1684778418,"last_node":"pve","last_sync":0,"error":"command 'set -o pipefail && pvesm export data-zfs:subvol-100-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_100-0_1684778418__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=dr' root@10.10.0.15 -- pvesm import data-zfs:subvol-100-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_100-0_1684778418__ -allow-rename 0' failed: exit code 255","fail_count":3,"storeid_list":["data-zfs","local-zfs"],"duration":3.034062}},"250":{"local/dr":{"last_iteration":1684778403,"last_try":1684778407,"last_node":"pve","last_sync":1684778407,"fail_count":0,"storeid_list":["local-zfs"],"duration":4.145558}},"200":{"local/dr":{"last_iteration":1684778403,"last_try":1684778403,"last_node":"pve","last_sync":1684778403,"fail_count":0,"storeid_list":["local-zfs"],"duration":4.403275}},"500":{"local/dr":{"last_iteration":1684778403,"last_try":1684778411,"last_node":"pve","last_sync":1684778411,"fail_count":0,"storeid_list":["local-zfs"],"duration":6.613591}}}

pveversion -v of pve:

Code:

proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

pveversion -v of dr:

Code:

proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-2
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

cat /etc/pve/storage.cfg on pve:

Code:

dir: local
        path /var/lib/vz
        content vztmpl,backup,iso

zfspool: local-zfs
        pool rpool/data
        content rootdir,images
        sparse 1

zfspool: data-zfs
        pool data-zfs
        content images,rootdir
        mountpoint /data-zfs
        sparse 0

cat /etc/pve/storage.cfg on dr:

Code:

dir: local
        path /var/lib/vz
        content vztmpl,backup,iso

zfspool: local-zfs
        pool rpool/data
        content rootdir,images
        sparse 1

zfspool: data-zfs
        pool data-zfs
        content images,rootdir
        mountpoint /data-zfs
        sparse 0

I want to specify that even before updating to 6.1 kernel I had the same problem.

fiona · May 23, 2023

Lopicl said:
Ok so basically I started a new replication job deleting the data-zfs volume on the receiving end, at the end when it was replicating the local-zfs volume that I didnt clear it gave the same error.

You need to clear both of course. Otherwise the error is fully expected, because there is a conflicting volume.

Lopicl · May 23, 2023

Ok, the problem seems fixed now, I thought it was only a data-zfs problem but I had to delete both snapshots and remake the replication job to fix it.

Thank you!

Search

Search

CT Replication Job Failing After First one

Lopicl

New Member

fiona

Proxmox Staff Member

Lopicl

New Member

fiona

Proxmox Staff Member

Lopicl

New Member

We value your privacy