ZFS replication error after snapshot rollback

mbosma · Jun 25, 2020

I have a few clusters that use ZFS replication and HA.
After a client rolled back a snapshot on one of their vm's a weird error popped up on the next replication.
Removing the replication task and recreating it fixed the issue.

Code:

2020-06-25 11:36:01 102-0: start replication job
2020-06-25 11:36:01 102-0: guest => VM 102, running => 38013
2020-06-25 11:36:01 102-0: volumes => local-zfs:vm-102-disk-0,local-zfs:vm-102-state-Update30CtoA
2020-06-25 11:36:02 102-0: create snapshot '__replicate_102-0_1593077761__' on local-zfs:vm-102-disk-0
2020-06-25 11:36:02 102-0: create snapshot '__replicate_102-0_1593077761__' on local-zfs:vm-102-state-Update30CtoA
2020-06-25 11:36:02 102-0: using insecure transmission, rate limit: none
2020-06-25 11:36:02 102-0: incremental sync 'local-zfs:vm-102-disk-0' (Update30CtoA => __replicate_102-0_1593077761__)
2020-06-25 11:36:03 102-0: delete previous replication snapshot '__replicate_102-0_1593077761__' on local-zfs:vm-102-disk-0
2020-06-25 11:36:03 102-0: delete previous replication snapshot '__replicate_102-0_1593077761__' on local-zfs:vm-102-state-Update30CtoA
2020-06-25 11:36:03 102-0: end replication job with error: no tunnel IP received

Rollback log:

Code:

delete stale replication snapshot '__replicate_102-1_1593076501__' on local-zfs:vm-102-disk-0
delete stale replication snapshot '__replicate_102-0_1593076506__' on local-zfs:vm-102-disk-0
delete stale replication snapshot '__replicate_102-1_1593076501__' on local-zfs:vm-102-state-Update30CtoA
delete stale replication snapshot '__replicate_102-0_1593076506__' on local-zfs:vm-102-state-Update30CtoA
TASK OK

I tried to replicate this error on a different cluster but was not able to receive this same error.
I did however come across another issue in this scenario.

When rolling back to a snapshot the following replications will fail because the zfs dataset already exists on the destination node.

Code:

delete stale replication snapshot '__replicate_102-0_1593084601__' on local-zfs:vm-102-disk-0
delete stale replication snapshot '__replicate_102-0_1593084601__' on local-zfs:vm-102-state-test2
TASK OK

Code:

2020-06-25 13:37:01 102-0: start replication job
2020-06-25 13:37:01 102-0: guest => VM 102, running => 68768
2020-06-25 13:37:01 102-0: volumes => local-zfs:vm-102-disk-0,local-zfs:vm-102-state-test2
2020-06-25 13:37:02 102-0: create snapshot '__replicate_102-0_1593085021__' on local-zfs:vm-102-disk-0
2020-06-25 13:37:02 102-0: create snapshot '__replicate_102-0_1593085021__' on local-zfs:vm-102-state-test2
2020-06-25 13:37:02 102-0: using secure transmission, rate limit: none
2020-06-25 13:37:02 102-0: incremental sync 'local-zfs:vm-102-disk-0' (test2 => __replicate_102-0_1593085021__)
2020-06-25 13:37:03 102-0: send from @test2 to rpool/data/vm-102-disk-0@__replicate_102-0_1593085021__ estimated size is 336K
2020-06-25 13:37:03 102-0: total estimated size is 336K
2020-06-25 13:37:03 102-0: TIME        SENT   SNAPSHOT rpool/data/vm-102-disk-0@__replicate_102-0_1593085021__
2020-06-25 13:37:03 102-0: rpool/data/vm-102-disk-0@test2    name    rpool/data/vm-102-disk-0@test2    -
2020-06-25 13:37:03 102-0: successfully imported 'local-zfs:vm-102-disk-0'
2020-06-25 13:37:03 102-0: full sync 'local-zfs:vm-102-state-test2' (__replicate_102-0_1593085021__)
2020-06-25 13:37:04 102-0: full send of rpool/data/vm-102-state-test2@__replicate_102-0_1593085021__ estimated size is 254M
2020-06-25 13:37:04 102-0: total estimated size is 254M
2020-06-25 13:37:04 102-0: TIME        SENT   SNAPSHOT rpool/data/vm-102-state-test2@__replicate_102-0_1593085021__
2020-06-25 13:37:04 102-0: rpool/data/vm-102-state-test2    name    rpool/data/vm-102-state-test2    -
2020-06-25 13:37:04 102-0: volume 'rpool/data/vm-102-state-test2' already exists
2020-06-25 13:37:04 102-0: warning: cannot send 'rpool/data/vm-102-state-test2@__replicate_102-0_1593085021__': signal received
2020-06-25 13:37:04 102-0: cannot send 'rpool/data/vm-102-state-test2': I/O error
2020-06-25 13:37:04 102-0: command 'zfs send -Rpv -- rpool/data/vm-102-state-test2@__replicate_102-0_1593085021__' failed: exit code 1
2020-06-25 13:37:04 102-0: delete previous replication snapshot '__replicate_102-0_1593085021__' on local-zfs:vm-102-disk-0
2020-06-25 13:37:04 102-0: delete previous replication snapshot '__replicate_102-0_1593085021__' on local-zfs:vm-102-state-test2
2020-06-25 13:37:04 102-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-102-state-test2 zfs - -with-snapshots 1 -snapshot __replicate_102-0_1593085021__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.100.232 -- pvesm import local-zfs:vm-102-state-test2 zfs - -with-snapshots 1 -allow-rename 0' failed: exit code 255

After manually removing the dataset on the destination node the issue is resolved, that's probably what fixed my earlier issue too.

root@pve202:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

root@pve3:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-3
pve-kernel-helper: 6.2-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-7
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

Does anyone know what could have caused the "no tunnel IP received" error?
And is anyone able to replicate my second error after rolling back a snapshot?

I would like to create a bug report but only if nothing is wrong on my side.

wolfgang · Jul 1, 2020

Hi,

the error "no tunnel IP received" occurs if the migration network in the datacenter.cfg is not correct.

The odd thing is that after removing the destination image the sync is working again. There is no information about the tunnel IP in the replicated image.

mbosma · Jul 1, 2020

wolfgang said:
the error "no tunnel IP received" occurs if the migration network in the datacenter.cfg is not correct.

The replication was working rock solid until someone had to rollback a snapshot.
This is the content of datacenter.cfg:

Code:

keyboard: en-us
migration: insecure,network=10.1.25.0/24

wolfgang said:
The odd thing is that after removing the destination image the sync is working again. There is no information about the tunnel IP in the replicated image.

This was related to the second issue:
volume 'rpool/data/vm-102-state-test2' already exists

The no tunnel IP received issue was resolved simply by deleting the replication and recreating it.

Anything I can try to fix it (apart from recreating the job) or information I can give to further investigate?

mbosma · Jul 10, 2020

Another error after a snapshot rollback, on the first cluster (production).

Code:

()
2020-07-10 12:22:00 103-0: start replication job
2020-07-10 12:22:00 103-0: guest => VM 103, running => 40137
2020-07-10 12:22:00 103-0: volumes => local-zfs:vm-103-disk-0,local-zfs:vm-103-state-voor_2_3_install
2020-07-10 12:22:01 103-0: (remote_prepare_local_job) delete stale replication snapshot '__replicate_103-0_1594375260__' on local-zfs:vm-103-disk-0
2020-07-10 12:22:01 103-0: create snapshot '__replicate_103-0_1594376520__' on local-zfs:vm-103-disk-0
2020-07-10 12:22:01 103-0: create snapshot '__replicate_103-0_1594376520__' on local-zfs:vm-103-state-voor_2_3_install
2020-07-10 12:22:01 103-0: using insecure transmission, rate limit: none
2020-07-10 12:22:01 103-0: incremental sync 'local-zfs:vm-103-disk-0' (voor_2_3_install => __replicate_103-0_1594376520__)
2020-07-10 12:22:03 103-0: send from @voor_2_3_install to rpool/data/vm-103-disk-0@__replicate_103-0_1594376520__ estimated size is 6.66G
2020-07-10 12:22:03 103-0: total estimated size is 6.66G
2020-07-10 12:22:03 103-0: TIME        SENT   SNAPSHOT rpool/data/vm-103-disk-0@__replicate_103-0_1594376520__
2020-07-10 12:22:04 103-0: 12:22:04    630M   rpool/data/vm-103-disk-0@__replicate_103-0_1594376520__
Truncated...
2020-07-10 12:22:16 103-0: 12:22:16   6.36G   rpool/data/vm-103-disk-0@__replicate_103-0_1594376520__
2020-07-10 12:22:17 103-0: [pve202] rpool/data/vm-103-disk-0@voor_2_3_install    name    rpool/data/vm-103-disk-0@voor_2_3_install    -
2020-07-10 12:22:20 103-0: [pve202] successfully imported 'local-zfs:vm-103-disk-0'
2020-07-10 12:22:20 103-0: full sync 'local-zfs:vm-103-state-voor_2_3_install' (__replicate_103-0_1594376520__)
2020-07-10 12:22:21 103-0: full send of rpool/data/vm-103-state-voor_2_3_install@__replicate_103-0_1594376520__ estimated size is 1023M
2020-07-10 12:22:21 103-0: total estimated size is 1023M
2020-07-10 12:22:21 103-0: TIME        SENT   SNAPSHOT rpool/data/vm-103-state-voor_2_3_install@__replicate_103-0_1594376520__
2020-07-10 12:22:21 103-0: warning: cannot send 'rpool/data/vm-103-state-voor_2_3_install@__replicate_103-0_1594376520__': Broken pipe
2020-07-10 12:22:21 103-0: cannot send 'rpool/data/vm-103-state-voor_2_3_install': I/O error
2020-07-10 12:22:21 103-0: command 'zfs send -Rpv -- rpool/data/vm-103-state-voor_2_3_install@__replicate_103-0_1594376520__' failed: exit code 1
2020-07-10 12:22:21 103-0: delete previous replication snapshot '__replicate_103-0_1594376520__' on local-zfs:vm-103-disk-0
2020-07-10 12:22:21 103-0: delete previous replication snapshot '__replicate_103-0_1594376520__' on local-zfs:vm-103-state-voor_2_3_install
2020-07-10 12:22:21 103-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-103-state-voor_2_3_install zfs - -with-snapshots 1 -snapshot __replicate_103-0_1594376520__' failed: exit code 1

Rollback log:

Code:

delete stale replication snapshot '__replicate_103-1_1594373422__' on local-zfs:vm-103-disk-0
delete stale replication snapshot '__replicate_103-0_1594373407__' on local-zfs:vm-103-disk-0
delete stale replication snapshot '__replicate_103-1_1594373422__' on local-zfs:vm-103-state-voor_2_3_install
delete stale replication snapshot '__replicate_103-0_1594373407__' on local-zfs:vm-103-state-voor_2_3_install
TASK OK

zfs list on target node:

Code:

root@pve203:~# zfs list
NAME                                       USED  AVAIL     REFER  MOUNTPOINT
rpool                                      169G  4.63T      222K  /rpool
rpool/ROOT                                3.03G  4.63T      205K  /rpool/ROOT
rpool/ROOT/pve-1                          3.03G  4.63T     3.03G  /
rpool/data                                 166G  4.63T      205K  /rpool/data
Truncated....
rpool/data/vm-103-disk-0                  15.3G  4.63T     14.5G  -
rpool/data/vm-103-state-voor_2_3_install  1.34G  4.63T     1.34G  -
Truncated....

root@pve203:~# zfs get written | grep vm-103
rpool/data/vm-103-disk-0                                                 written   0        -
rpool/data/vm-103-disk-0@voor_2_3_install                                written   2.49G    -
rpool/data/vm-103-disk-0@__replicate_103-1_1594376541__                  written   12.8G    -
rpool/data/vm-103-state-voor_2_3_install                                 written   0        -
rpool/data/vm-103-state-voor_2_3_install@__replicate_103-0_1594373407__  written   1.34G    -
rpool/data/vm-103-state-voor_2_3_install@__replicate_103-1_1594373422__  written   0        -

It seems like the replication creates a new dataset for the snapshot made in the gui.

After deleting the dataset for the state on the target node the replication starts working again.

Code:

zfs destroy -r rpool/data/vm-103-state-voor_2_3_install

Code:

2020-07-10 12:34:00 103-1: start replication job
2020-07-10 12:34:00 103-1: guest => VM 103, running => 40137
2020-07-10 12:34:00 103-1: volumes => local-zfs:vm-103-disk-0,local-zfs:vm-103-state-voor_2_3_install
2020-07-10 12:34:01 103-1: (remote_prepare_local_job) delete stale replication snapshot '__replicate_103-1_1594376541__' on local-zfs:vm-103-disk-0
2020-07-10 12:34:01 103-1: create snapshot '__replicate_103-1_1594377240__' on local-zfs:vm-103-disk-0
2020-07-10 12:34:01 103-1: create snapshot '__replicate_103-1_1594377240__' on local-zfs:vm-103-state-voor_2_3_install
2020-07-10 12:34:01 103-1: using insecure transmission, rate limit: none
2020-07-10 12:34:01 103-1: incremental sync 'local-zfs:vm-103-disk-0' (voor_2_3_install => __replicate_103-1_1594377240__)
2020-07-10 12:34:03 103-1: send from @voor_2_3_install to rpool/data/vm-103-disk-0@__replicate_103-0_1594377020__ estimated size is 6.67G
2020-07-10 12:34:03 103-1: send from @__replicate_103-0_1594377020__ to rpool/data/vm-103-disk-0@__replicate_103-1_1594377240__ estimated size is 3.53M
2020-07-10 12:34:03 103-1: total estimated size is 6.67G
2020-07-10 12:34:03 103-1: TIME        SENT   SNAPSHOT rpool/data/vm-103-disk-0@__replicate_103-0_1594377020__
2020-07-10 12:34:04 103-1: 12:34:04    551M   rpool/data/vm-103-disk-0@__replicate_103-0_1594377020__
Truncated...
2020-07-10 12:34:18 103-1: 12:34:18   6.58G   rpool/data/vm-103-disk-0@__replicate_103-0_1594377020__
2020-07-10 12:34:18 103-1: TIME        SENT   SNAPSHOT rpool/data/vm-103-disk-0@__replicate_103-1_1594377240__
2020-07-10 12:34:18 103-1: [pve203] rpool/data/vm-103-disk-0@voor_2_3_install    name    rpool/data/vm-103-disk-0@voor_2_3_install    -
2020-07-10 12:34:23 103-1: [pve203] successfully imported 'local-zfs:vm-103-disk-0'
2020-07-10 12:34:23 103-1: full sync 'local-zfs:vm-103-state-voor_2_3_install' (__replicate_103-1_1594377240__)
2020-07-10 12:34:24 103-1: full send of rpool/data/vm-103-state-voor_2_3_install@__replicate_103-0_1594377020__ estimated size is 1023M
2020-07-10 12:34:24 103-1: send from @__replicate_103-0_1594377020__ to rpool/data/vm-103-state-voor_2_3_install@__replicate_103-1_1594377240__ estimated size is 624B
2020-07-10 12:34:24 103-1: total estimated size is 1023M
2020-07-10 12:34:24 103-1: TIME        SENT   SNAPSHOT rpool/data/vm-103-state-voor_2_3_install@__replicate_103-0_1594377020__
2020-07-10 12:34:25 103-1: 12:34:25    588M   rpool/data/vm-103-state-voor_2_3_install@__replicate_103-0_1594377020__
2020-07-10 12:34:26 103-1: TIME        SENT   SNAPSHOT rpool/data/vm-103-state-voor_2_3_install@__replicate_103-1_1594377240__
2020-07-10 12:34:27 103-1: [pve203] successfully imported 'local-zfs:vm-103-state-voor_2_3_install'
2020-07-10 12:34:27 103-1: delete previous replication snapshot '__replicate_103-1_1594373422__' on local-zfs:vm-103-disk-0
2020-07-10 12:34:28 103-1: end replication job

Search

Search

ZFS replication error after snapshot rollback

mbosma

Renowned Member

wolfgang

Proxmox Retired Staff

mbosma

Renowned Member

mbosma

Renowned Member

We value your privacy