Replication Job: 101-0 failed

user112366852 · May 13, 2019

Hi there,

I have a 3 node cluster, 2 of these nodes have 4 x SSDs and use ZFS with RAIDZ-1. These pools have 13% and 22% capacity used.

I've recently started converting some containers to QEMU machines to take advantage of the near-realtime migration available with ZFS. I have also enabled replication to help with this.

Here's a high level overview:
node1 has vm1
node2 has vm2

Replication is every 15 minutes and copies vm2 to node1, and vm1 to node2.

The vm2 replication was set up first and has never had an issue.
vm1 has an intermittent issue, email alert is basic:

Code:

Subject: Replication Job: 101-0 failed
import failed: exit code 29

The log from the web interface shows this:

Code:

2019-05-13 10:00:01 101-0: start replication job
2019-05-13 10:00:01 101-0: guest => VM 101, running => 35788
2019-05-13 10:00:01 101-0: volumes => local-zfs:vm-101-disk-0
2019-05-13 10:00:02 101-0: freeze guest filesystem
2019-05-13 10:00:02 101-0: create snapshot '__replicate_101-0_1557741601__' on local-zfs:vm-101-disk-0
2019-05-13 10:00:02 101-0: thaw guest filesystem
2019-05-13 10:00:02 101-0: incremental sync 'local-zfs:vm-101-disk-0' (__replicate_101-0_1557740701__ => __replicate_101-0_1557741601__)
2019-05-13 10:00:04 101-0: delete previous replication snapshot '__replicate_101-0_1557741601__' on local-zfs:vm-101-disk-0
2019-05-13 10:00:04 101-0: end replication job with error: import failed: exit code 29

I've read some other threads on this and they all hint at the storage being overloaded, however there is no indication of this on these nodes.
CPU is ~1.5%,
LOAD is ~0.4
RAM is ~18%
IO is esentially idle

I adjusted the schedule of the failing job (vm1) so that it wouldn't conflict with the incoming vm2 sync, buit this has not helped.

I tried to capture the output form iotop on node2 during a sync, it does not look overloaded at all.

Code:

Total DISK READ :     272.67 K/s | Total DISK WRITE :     386.59 K/s
Actual DISK READ:     422.08 K/s | Actual DISK WRITE:      24.41 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  929 be/4 root        0.00 B/s    0.00 B/s  0.00 %  8.47 % [txg_sync]
 1514 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.44 % [kworker/u98:3]
43974 be/4 root        0.00 B/s   33.62 K/s  0.00 %  0.22 % kvm -id 100 -name db0~=300 -machine type=pc
42476 be/4 root        0.00 B/s   48.56 K/s  0.00 %  0.16 % kvm -id 100 -name db0~=300 -machine type=pc
42768 be/4 root        0.00 B/s  123.26 K/s  0.00 %  0.16 % kvm -id 100 -name db0~=300 -machine type=pc
43973 be/4 root        0.00 B/s   37.35 K/s  0.00 %  0.16 % kvm -id 100 -name db0~=300 -machine type=pc
42473 be/4 root        0.00 B/s   82.17 K/s  0.00 %  0.08 % kvm -id 100 -name db0~=300 -machine type=pc
 2302 be/4 root        0.00 B/s  956.21 B/s  0.00 %  0.00 % rsyslogd -n [rs:main Q:Reg]
  735 be/0 root       48.56 K/s    0.00 B/s  0.00 %  0.00 % [z_rd_iss]
36951 be/0 root      115.79 K/s    0.00 B/s  0.00 %  0.00 % [z_rd_iss]
36953 be/0 root      108.32 K/s    0.00 B/s  0.00 %  0.00 % [z_rd_iss]
 2849 be/4 root        0.00 B/s   60.70 K/s  0.00 %  0.00 % pmxcfs [cfs_loop]
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
22531 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/17:1]
    4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]
43938 be/4 postfix     0.00 B/s    0.00 B/s  0.00 %  0.00 % qmgr -l -t unix -u
    7 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [mm_percpu_wq]
    8 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]

The failure is random, I cannot infer any patterns from it.

Any ideas much appreciated!
Thanks!

Package versions:

Code:

proxmox-ve: 5.4-1 (running kernel: 4.15.18-13-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-1
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

wolfgang · May 14, 2019

Hi,

does the job only fail once and recover at the next time?
Or do you must remove the replication and create it from scratch?

user112366852 · May 14, 2019

It recovers each time (it seems to retry 5 min later?). There has never been more than 1 failure in a row.

Let me know if theres any other logs that could be useful.

Thanks

wolfgang · May 14, 2019

Then I guess it is the remote side what make the troubles or you have network problems?
Is the pool the rpool on this nodes?

user112366852 · May 14, 2019

Both nodes are using rpools, the nodes are replicating to each other which is why I'm confused: 2 to 1 doesn't fail, only 1 to 2.

Both are connected through a 10Gb switch, I can't find any obvious networking issues.

Got a slightly better error this time, from daemon.log:

Code:

May 14 09:18:00 loc0-pve1 systemd[1]: Starting Proxmox VE replication runner...
May 14 09:18:02 loc0-pve1 zed: eid=9866 class=history_event pool_guid=0x7D034AB176D92D1A
May 14 09:18:04 loc0-pve1 pvesr[7532]: send from @__replicate_101-0_1557825061__ to rpool/data/vm-101-disk-0@__replicate_101-0_1557825481__ estimated size is 1.56M
May 14 09:18:04 loc0-pve1 pvesr[7532]: total estimated size is 1.56M
May 14 09:18:04 loc0-pve1 zed: eid=9867 class=history_event pool_guid=0x7D034AB176D92D1A
May 14 09:18:04 loc0-pve1 pvesr[7532]: TIME        SENT   SNAPSHOT
May 14 09:18:04 loc0-pve1 zed: eid=9868 class=history_event pool_guid=0x7D034AB176D92D1A
May 14 09:18:04 loc0-pve1 zed: eid=9869 class=history_event pool_guid=0x7D034AB176D92D1A
May 14 09:18:04 loc0-pve1 zed: eid=9870 class=history_event pool_guid=0x7D034AB176D92D1A
May 14 09:18:04 loc0-pve1 pvesr[7532]: cannot receive incremental stream: checksum mismatch or incomplete stream
May 14 09:18:04 loc0-pve1 pvesr[7532]: command 'zfs recv -F -- rpool/data/vm-101-disk-0' failed: exit code 1
May 14 09:18:04 loc0-pve1 pvesr[7532]: exit code 255
May 14 09:18:04 loc0-pve1 pvesr[7532]: send/receive failed, cleaning up snapshot(s)..
May 14 09:18:04 loc0-pve1 zed: eid=9871 class=history_event pool_guid=0x7D034AB176D92D1A
May 14 09:18:04 loc0-pve1 pvesr[7532]: 101-0: got unexpected replication job error - import failed: exit code 29
May 14 09:18:05 loc0-pve1 systemd[1]: Started Proxmox VE replication runner.

Given the small file size I'm not sure if this could be network? Though I'm not sure what else could cause that.

Thanks

udo · May 14, 2019

Hi,
i have the same issue with the replication between two nodes.
One VM was replicates from B to A every 15 minute and fails app. four times a day and one VM with the the same migration schedule from A to B fails not or only one times a day.

I would guess it's the load (IO) and ZOL related in my case...
Network is ok, otherwise the replication from A to B must fail in the same time...

Udo

wolfgang · May 14, 2019

Can you try to remove/disable the swap from the rpool.
We have seen in the past that zfs is swapping blocks from the ARC.
So this could explain the hanging pool.

Generall, we recommend to remove the swap from the ZFS.

udo · May 14, 2019

Hi Wolfgang,
in my case both servers use an seperate disk for swap...

Code:

swapon
NAME      TYPE      SIZE USED PRIO
/dev/sdf1 partition  16G 147M   -2


swapon
NAME      TYPE      SIZE USED PRIO
/dev/sdf1 partition  16G   0B   -2

Udo

user112366852 · May 14, 2019

I do not have any swap configured on my hosts (196GB total RAM, with 32GB set for zfs_arc_max)

Search

Search

Replication Job: 101-0 failed

user112366852

Member

wolfgang

Proxmox Retired Staff

user112366852

Member

wolfgang

Proxmox Retired Staff

user112366852

Member

udo

Distinguished Member

wolfgang

Proxmox Retired Staff

udo

Distinguished Member

user112366852

Member