Error with ZFS replication jobs

techd00d

Member
Aug 10, 2017
3
0
21
40
I have a replication job setup using the GUI method between 2 servers. Identical specs, configurations, and versions of proxmox.

I have about 20 KVM VM's and at random, some will fail to replicate with the following in the replication logs.

If I go to the destination server and manually delete the ZFS snapshot (zfs destroy -r hdd-storage/vmdata/vm-115-disk-1), then delete and re-create the replication job from the GUI, the replication will then succeed. That VM will start replicating successfully again, sometimes for a day or two then break, others have not broken since they were fixed. It seems random but all seem to have done it at least once.

The interface they replicate over is dedicated and not on a public network. We have run ping tests to ensure we are not losing connectivity for any period of time and can confirm we are not.

Any ideas would be appreciated!

Code:
Virtual Environment 5.0-32

2018-03-12 12:19:00 115-1: start replication job
2018-03-12 12:19:00 115-1: guest => VM 115, running => 4547
2018-03-12 12:19:00 115-1: volumes => hdd-vmdata:vm-115-disk-1
2018-03-12 12:19:01 115-1: create snapshot '__replicate_115-1_1520882340__' on hdd-vmdata:vm-115-disk-1
2018-03-12 12:19:04 115-1: full sync 'hdd-vmdata:vm-115-disk-1' (__replicate_115-1_1520882340__)
2018-03-12 12:19:06 115-1: delete previous replication snapshot '__replicate_115-1_1520882340__' on hdd-vmdata:vm-115-disk-1
2018-03-12 12:19:06 115-1: end replication job with error: command 'set -o pipefail && pvesm export hdd-vmdata:vm-115-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_115-1_1520882340__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=vmhost3' root@10.9.8.3 -- pvesm import hdd-vmdata:vm-115-disk-1 zfs - -with-snapshots 1' failed: exit code 255
 
Hi,

Is the target storage under load?
 
Hi,

Is the target storage under load?
Yes, I wouldn't think high load on average but yes. It is the target to all replication jobs from the server in question. I suppose it is possible many jobs hit at once? Plus the server itself is in production as well, but with pretty low IO on the VM's on that machine.

Is there a more reliable way to utilize ZFS for replication? I prefer to keep it in the GUI if possible but I am open to the most reliable solution.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!