Replication Failure

Manny Vazquez

Well-Known Member
Jul 12, 2017
107
2
58
Miami, FL USA
Hi,

I have a 3 node system, all Dell 610 with 128 gb ram and 3tb HD on ZFS.

I had started several VMs in all 3 servers, 2 or 3 each), each Vm replicating to the other 2 servers, and also setup HA, to all the VMs in a circular mode, So VMs on 1, would HA to 2 and then 3, VMs on 2 will HA to 3 and then 1, VMs on 3 would HA to 1 and then 2.

Today when I got to do my daily check up, and all the VMs had moved out of node 1. I looked around for the reason for this but could not find anything, other than the server has rebooted (this I saw only on the iDrac of the server where I saw the "boot capture" (nothing weird and still no reason why it rebooted).

When I checked on the 'replication' of the VMs that were on server 1, I saw errors replicating back to 1.

I disabled the replications, and send the server for a nice reboot.

Once it was up, I have tried setting up replications again, but they fail with a cryptic error.
Can anyone shed some light on this?
=========================================

Virtual Environment 5.0-23/af4267bf



Virtual Machine 101 ('ubuntu-ws' ) on node 'vm3'
2017-08-08 09:37:00 101-1: start replication job
2017-08-08 09:37:00 101-1: guest => VM 101, running => 0
2017-08-08 09:37:00 101-1: volumes => local-zfs:vm-101-disk-2
2017-08-08 09:37:01 101-1: create snapshot '__replicate_101-1_1502199420__' on local-zfs:vm-101-disk-2
2017-08-08 09:37:01 101-1: full sync 'local-zfs:vm-101-disk-2' (__replicate_101-1_1502199420__)
2017-08-08 09:37:02 101-1: delete previous replication snapshot '__replicate_101-1_1502199420__' on local-zfs:vm-101-disk-2
2017-08-08 09:37:02 101-1: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1502199420__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=vm1' root@172.21.82.21 -- pvesm import local-zfs:vm-101-disk-2 zfs - -with-snapshots 1' failed: exit code 255

============================================

This VM (101) is the only linux Vm and it is onlly used for testing.
I have tried with the VM on, originally, and this error is now with VM off.

I have also tried a manual migration (VM off) (online not checked)
=================
Requesting HA migration for VM 101 to node vm1
TASK OK
=================
But it does not migrate

this is the LOG
==============================
task started by HA resource agent
2017-08-08 09:44:53 starting migration of VM 101 to node 'vm1' (172.21.82.21)
2017-08-08 09:44:53 found local disk 'local-zfs:vm-101-disk-2' (in current VM config)
2017-08-08 09:44:53 copying disk images
2017-08-08 09:44:53 start replication job
2017-08-08 09:44:53 guest => VM 101, running => 0
2017-08-08 09:44:53 volumes => local-zfs:vm-101-disk-2
2017-08-08 09:44:54 create snapshot '__replicate_101-1_1502199893__' on local-zfs:vm-101-disk-2
2017-08-08 09:44:54 full sync 'local-zfs:vm-101-disk-2' (__replicate_101-1_1502199893__)
send from @ to rpool/data/vm-101-disk-2@__replicate_101-0_1502164859__ estimated size is 7.34G
send from @__replicate_101-0_1502164859__ to rpool/data/vm-101-disk-2@__replicate_101-1_1502199893__ estimated size is 1.48M
total estimated size is 7.34G
TIME SENT SNAPSHOT
rpool/data/vm-101-disk-2 name rpool/data/vm-101-disk-2 -
volume 'rpool/data/vm-101-disk-2' already exists
command 'zfs send -Rpv -- rpool/data/vm-101-disk-2@__replicate_101-1_1502199893__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-08-08 09:44:55 delete previous replication snapshot '__replicate_101-1_1502199893__' on local-zfs:vm-101-disk-2
2017-08-08 09:44:55 end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1502199893__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=vm1' root@172.21.82.21 -- pvesm import local-zfs:vm-101-disk-2 zfs - -with-snapshots 1' failed: exit code 255
2017-08-08 09:44:55 ERROR: Failed to sync data - command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1502199893__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=vm1' root@172.21.82.21 -- pvesm import local-zfs:vm-101-disk-2 zfs - -with-snapshots 1' failed: exit code 255
2017-08-08 09:44:55 aborting phase 1 - cleanup resources
2017-08-08 09:44:55 ERROR: migration aborted (duration 00:00:02): Failed to sync data - command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1502199893__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=vm1' root@172.21.82.21 -- pvesm import local-zfs:vm-101-disk-2 zfs - -with-snapshots 1' failed: exit code 255
TASK ERROR: migration aborted
====================================

I am completely at a loss here and I am afraid that if this continues, I can not rely on proxmox for my production environment (so far this is ALMOST all testing, only have a couple of Development machines that are really working).
 
first, replication is a technology preview: see https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE

also the problem with ha is known and documented:

High-Availability is allowed in combination with storage replication, but it has the following implications:

  • redistributing services after a more preferred node comes online will lead to errors.

  • recovery works, but there may be some data loss between the last synced time and the time a node failed.

see https://pve.proxmox.com/wiki/Storage_Replication

those problems are known and we will eventually fix them, but for now, you could try to manually delete the disk 'rpool/data/vm-101-disk-2' on vm1
 
Hi,

thanks for the reply and for the info about known problems, I could live with these problems if the solution of manually deleting the previous replications worked, but it seems it does not.

I deleted all the vm-101-* both from "/dev/zvol/rpool/data" and from "/dev/rpool/data"

del vm.png

then tried to create the replication again

rep 1.png

but as you can see, still the error, that reads
=======================================
2017-08-08 10:09:00 101-1: start replication job
2017-08-08 10:09:00 101-1: guest => VM 101, running => 2939
2017-08-08 10:09:00 101-1: volumes => local-zfs:vm-101-disk-2
2017-08-08 10:09:01 101-1: create snapshot '__replicate_101-1_1502201340__' on local-zfs:vm-101-disk-2
2017-08-08 10:09:01 101-1: full sync 'local-zfs:vm-101-disk-2' (__replicate_101-1_1502201340__)
2017-08-08 10:09:02 101-1: delete previous replication snapshot '__replicate_101-1_1502201340__' on local-zfs:vm-101-disk-2
2017-08-08 10:09:02 101-1: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-101-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_101-1_1502201340__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=vm1' root@172.21.82.21 -- pvesm import local-zfs:vm-101-disk-2 zfs - -with-snapshots 1' failed: exit code 255

======================================
Any other idea?
 
you have to delete those with the 'zfs' tool not in dev
then i would remove the replication job, and wait until it is replicated

then create it new
 
you have to delete those with the 'zfs' tool not in dev
then i would remove the replication job, and wait until it is replicated

then create it new

hmm, ZFS tool ... that I have no idea :)

I guess I have research to do , did I mess it up by deleting in terminal with a simple rm ?
 
I am going y and can not find a solution.

Since I could no longer user zfs destroy to delete the files (and I did not know which files I had to delete anyhow) I tried following what was told in another thread, create a backup, move the file, restore with new ID. All that worked perfectly, but now I want to get rid of the 'garbage' and I can't.

rep fail.png
The original 101 (stopped on VM3) has no active replication, but when I try to remove it, it tells that there is one.

Does any one know how to get rid of that ghost replication job?

longezfsdesto
 
you can look if they are simply defined on another node via Datacenter -> Replication
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!