I have a replication job setup using the GUI method between 2 servers. Identical specs, configurations, and versions of proxmox.
I have about 20 KVM VM's and at random, some will fail to replicate with the following in the replication logs.
If I go to the destination server and manually delete the ZFS snapshot (zfs destroy -r hdd-storage/vmdata/vm-115-disk-1), then delete and re-create the replication job from the GUI, the replication will then succeed. That VM will start replicating successfully again, sometimes for a day or two then break, others have not broken since they were fixed. It seems random but all seem to have done it at least once.
The interface they replicate over is dedicated and not on a public network. We have run ping tests to ensure we are not losing connectivity for any period of time and can confirm we are not.
Any ideas would be appreciated!
I have about 20 KVM VM's and at random, some will fail to replicate with the following in the replication logs.
If I go to the destination server and manually delete the ZFS snapshot (zfs destroy -r hdd-storage/vmdata/vm-115-disk-1), then delete and re-create the replication job from the GUI, the replication will then succeed. That VM will start replicating successfully again, sometimes for a day or two then break, others have not broken since they were fixed. It seems random but all seem to have done it at least once.
The interface they replicate over is dedicated and not on a public network. We have run ping tests to ensure we are not losing connectivity for any period of time and can confirm we are not.
Any ideas would be appreciated!
Code:
Virtual Environment 5.0-32
2018-03-12 12:19:00 115-1: start replication job
2018-03-12 12:19:00 115-1: guest => VM 115, running => 4547
2018-03-12 12:19:00 115-1: volumes => hdd-vmdata:vm-115-disk-1
2018-03-12 12:19:01 115-1: create snapshot '__replicate_115-1_1520882340__' on hdd-vmdata:vm-115-disk-1
2018-03-12 12:19:04 115-1: full sync 'hdd-vmdata:vm-115-disk-1' (__replicate_115-1_1520882340__)
2018-03-12 12:19:06 115-1: delete previous replication snapshot '__replicate_115-1_1520882340__' on hdd-vmdata:vm-115-disk-1
2018-03-12 12:19:06 115-1: end replication job with error: command 'set -o pipefail && pvesm export hdd-vmdata:vm-115-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_115-1_1520882340__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=vmhost3' root@10.9.8.3 -- pvesm import hdd-vmdata:vm-115-disk-1 zfs - -with-snapshots 1' failed: exit code 255