I have a few clusters that use ZFS replication and HA.
After a client rolled back a snapshot on one of their vm's a weird error popped up on the next replication.
Removing the replication task and recreating it fixed the issue.
Rollback log:
I tried to replicate this error on a different cluster but was not able to receive this same error.
I did however come across another issue in this scenario.
When rolling back to a snapshot the following replications will fail because the zfs dataset already exists on the destination node.
After manually removing the dataset on the destination node the issue is resolved, that's probably what fixed my earlier issue too.
Does anyone know what could have caused the "no tunnel IP received" error?
And is anyone able to replicate my second error after rolling back a snapshot?
I would like to create a bug report but only if nothing is wrong on my side.
After a client rolled back a snapshot on one of their vm's a weird error popped up on the next replication.
Removing the replication task and recreating it fixed the issue.
Code:
2020-06-25 11:36:01 102-0: start replication job
2020-06-25 11:36:01 102-0: guest => VM 102, running => 38013
2020-06-25 11:36:01 102-0: volumes => local-zfs:vm-102-disk-0,local-zfs:vm-102-state-Update30CtoA
2020-06-25 11:36:02 102-0: create snapshot '__replicate_102-0_1593077761__' on local-zfs:vm-102-disk-0
2020-06-25 11:36:02 102-0: create snapshot '__replicate_102-0_1593077761__' on local-zfs:vm-102-state-Update30CtoA
2020-06-25 11:36:02 102-0: using insecure transmission, rate limit: none
2020-06-25 11:36:02 102-0: incremental sync 'local-zfs:vm-102-disk-0' (Update30CtoA => __replicate_102-0_1593077761__)
2020-06-25 11:36:03 102-0: delete previous replication snapshot '__replicate_102-0_1593077761__' on local-zfs:vm-102-disk-0
2020-06-25 11:36:03 102-0: delete previous replication snapshot '__replicate_102-0_1593077761__' on local-zfs:vm-102-state-Update30CtoA
2020-06-25 11:36:03 102-0: end replication job with error: no tunnel IP received
Rollback log:
Code:
delete stale replication snapshot '__replicate_102-1_1593076501__' on local-zfs:vm-102-disk-0
delete stale replication snapshot '__replicate_102-0_1593076506__' on local-zfs:vm-102-disk-0
delete stale replication snapshot '__replicate_102-1_1593076501__' on local-zfs:vm-102-state-Update30CtoA
delete stale replication snapshot '__replicate_102-0_1593076506__' on local-zfs:vm-102-state-Update30CtoA
TASK OK
I tried to replicate this error on a different cluster but was not able to receive this same error.
I did however come across another issue in this scenario.
When rolling back to a snapshot the following replications will fail because the zfs dataset already exists on the destination node.
Code:
delete stale replication snapshot '__replicate_102-0_1593084601__' on local-zfs:vm-102-disk-0
delete stale replication snapshot '__replicate_102-0_1593084601__' on local-zfs:vm-102-state-test2
TASK OK
Code:
2020-06-25 13:37:01 102-0: start replication job
2020-06-25 13:37:01 102-0: guest => VM 102, running => 68768
2020-06-25 13:37:01 102-0: volumes => local-zfs:vm-102-disk-0,local-zfs:vm-102-state-test2
2020-06-25 13:37:02 102-0: create snapshot '__replicate_102-0_1593085021__' on local-zfs:vm-102-disk-0
2020-06-25 13:37:02 102-0: create snapshot '__replicate_102-0_1593085021__' on local-zfs:vm-102-state-test2
2020-06-25 13:37:02 102-0: using secure transmission, rate limit: none
2020-06-25 13:37:02 102-0: incremental sync 'local-zfs:vm-102-disk-0' (test2 => __replicate_102-0_1593085021__)
2020-06-25 13:37:03 102-0: send from @test2 to rpool/data/vm-102-disk-0@__replicate_102-0_1593085021__ estimated size is 336K
2020-06-25 13:37:03 102-0: total estimated size is 336K
2020-06-25 13:37:03 102-0: TIME SENT SNAPSHOT rpool/data/vm-102-disk-0@__replicate_102-0_1593085021__
2020-06-25 13:37:03 102-0: rpool/data/vm-102-disk-0@test2 name rpool/data/vm-102-disk-0@test2 -
2020-06-25 13:37:03 102-0: successfully imported 'local-zfs:vm-102-disk-0'
2020-06-25 13:37:03 102-0: full sync 'local-zfs:vm-102-state-test2' (__replicate_102-0_1593085021__)
2020-06-25 13:37:04 102-0: full send of rpool/data/vm-102-state-test2@__replicate_102-0_1593085021__ estimated size is 254M
2020-06-25 13:37:04 102-0: total estimated size is 254M
2020-06-25 13:37:04 102-0: TIME SENT SNAPSHOT rpool/data/vm-102-state-test2@__replicate_102-0_1593085021__
2020-06-25 13:37:04 102-0: rpool/data/vm-102-state-test2 name rpool/data/vm-102-state-test2 -
2020-06-25 13:37:04 102-0: volume 'rpool/data/vm-102-state-test2' already exists
2020-06-25 13:37:04 102-0: warning: cannot send 'rpool/data/vm-102-state-test2@__replicate_102-0_1593085021__': signal received
2020-06-25 13:37:04 102-0: cannot send 'rpool/data/vm-102-state-test2': I/O error
2020-06-25 13:37:04 102-0: command 'zfs send -Rpv -- rpool/data/vm-102-state-test2@__replicate_102-0_1593085021__' failed: exit code 1
2020-06-25 13:37:04 102-0: delete previous replication snapshot '__replicate_102-0_1593085021__' on local-zfs:vm-102-disk-0
2020-06-25 13:37:04 102-0: delete previous replication snapshot '__replicate_102-0_1593085021__' on local-zfs:vm-102-state-test2
2020-06-25 13:37:04 102-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-102-state-test2 zfs - -with-snapshots 1 -snapshot __replicate_102-0_1593085021__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.100.232 -- pvesm import local-zfs:vm-102-state-test2 zfs - -with-snapshots 1 -allow-rename 0' failed: exit code 255
After manually removing the dataset on the destination node the issue is resolved, that's probably what fixed my earlier issue too.
root@pve202:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
root@pve3:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-3
pve-kernel-helper: 6.2-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-7
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
proxmox-ve: 6.2-1 (running kernel: 5.4.44-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-3
pve-kernel-helper: 6.2-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-7
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
Does anyone know what could have caused the "no tunnel IP received" error?
And is anyone able to replicate my second error after rolling back a snapshot?
I would like to create a bug report but only if nothing is wrong on my side.