Replication failed to some node

fjtrooper · Jul 13, 2017

Hello, I've 3 node cluster, and created some replication jobs between them. Suddenly repliocation to node1 stoped working.
I deleted and created again the job but doesn't work.
The solution was to backup vm, delete vm and restore to different ID.
is posible to solve without deleting and restoring.
Here is my log error.
THX!!!!

2017-07-13 14:40:00 101-0: start replication job
2017-07-13 14:40:00 101-0: guest => VM 101, running => 0
2017-07-13 14:40:00 101-0: volumes => zfspool:vm-101-disk-1
2017-07-13 14:40:01 101-0: create snapshot '__replicate_101-0_1499953200__' on zfspool:vm-101-disk-1
2017-07-13 14:40:01 101-0: full sync 'zfspool:vm-101-disk-1' (__replicate_101-0_1499953200__)
2017-07-13 14:40:02 101-0: delete previous replication snapshot '__replicate_101-0_1499953200__' on zfspool:vm-101-disk-1
2017-07-13 14:40:02 101-0: end replication job with error: command 'set -o pipefail && pvesm export zfspool:vm-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1499953200__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pvetest1' root@10.199.0.21 -- pvesm import zfspool:vm-101-disk-1 zfs - -with-snapshots 1' failed: exit code 255

wolfgang · Jul 13, 2017

Hi,

when you delete a job you have to wait until the nodes get synced again (about 1 minute).
Then create a new job and it will work.

fjtrooper · Jul 13, 2017

wolfgang said:
Hi,

when you delete a job you have to wait until the nodes get synced again (about 1 minute).
Then create a new job and it will work.

I didit, but I'll try again
Thanks!

fjtrooper · Jul 14, 2017

After more than 12h, I created the replication job again but, still the error. To one one fails but to the other go right. Cluster works perfect and I can access ssh in both side.

2017-07-14 06:59:00 101-0: start replication job
2017-07-14 06:59:00 101-0: guest => VM 101, running => 0
2017-07-14 06:59:00 101-0: volumes => zfspool:vm-101-disk-1
2017-07-14 06:59:01 101-0: create snapshot '__replicate_101-0_1500011940__' on zfspool:vm-101-disk-1
2017-07-14 06:59:01 101-0: full sync 'zfspool:vm-101-disk-1' (__replicate_101-0_1500011940__)
2017-07-14 06:59:02 101-0: delete previous replication snapshot '__replicate_101-0_1500011940__' on zfspool:vm-101-disk-1
2017-07-14 06:59:02 101-0: end replication job with error: command 'set -o pipefail && pvesm export zfspool:vm-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1500011940__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pvetest1' root@10.199.0.21 -- pvesm import zfspool:vm-101-disk-1 zfs - -with-snapshots 1' failed: exit code 255

fjtrooper · Jul 14, 2017

SOLVED!

I don't know why, but the shared zfs pool didn't have replication images syncroniced. I deleted manually with "zfs destroy <image> -r" the images in replicated nodes and created again jobs, now is working again.

THX!!

fjtrooper · Jul 14, 2017

one question... Is there a way to notify by email when replication fails?
THX

wolfgang · Jul 14, 2017

Not now but it will come.

fjtrooper · Jul 14, 2017

wolfgang said:
Not now but it will come.

Ok thanks! I'll wait for it

fjtrooper · Aug 3, 2017

Still having replication problems, and I can now tell you how to probe it.
I'm testing with 3 nodes and replicating 2 VM in all nodes. When I configure it for the first time everything is right.
When I force to check HA powering off one node (cutting power), the VM running in that node starts in one of the others running perfect, but after that replication doesn't work again in that VM.
Also I can't migrate the VM to another node.
To solve that, I've to delete manually replicate disk and vm disk in the other nodes (I use ZFS local) maintaining only vm disk in the node is running, and then replication and migration works again.

is it normal? or could be a bug?

FYI, after I force HA, if I delete all jobs from GUI, when I execute pvesr list still existing one job, and I have to force delete by console, because it doesn't appears in the GUI.

Thanks and good job!!

guletz · Aug 3, 2017

@fjtrooper - I can say that in my case I see the same . Mybe after the zfs 7.x will be available in Proxmox, will be possible to restart a zfs-send-receive or maybe bookmarks .... if the devs will want to increase the resilient in case of errors.

andy77 · Nov 19, 2017

I do have the same problems with replication. But I do not use no HA, so I actually don't really understand why it happans that the replication job gets faulty.

After this happans I tried to remove the replication, and also destroy the ZFS image on the replication server, as mentioned above. After dooing this, it was possible for me to recreate the replication job and all worked fine for a day. Unfortunately exactly the same error occours on the next day.

So I don't know how to fix that now permanently? Having to delete the job and destroy the ZFS Image seems to be not really the solution!

Any ideas or hints?

Thanks a lot

guletz · Nov 19, 2017

Hi @andi77,

If you do not want to re-create again and again the replication tasks, use pve-zsync. I can told you that is rock solid. I setup this and I forget about the problem. I use this for many month in several and different proxmox clusters.
Try it and then told me if I was wrong

andy77 · Nov 19, 2017

Well, I usually like to use "out of the box" functions. So it would for sure be better to have the problem with Proxmox Replication just fixed

guletz · Nov 19, 2017

Nobody is perfect, and I am sure that the Proxmox devel will try to do the best job. The good news is the fact we can use another tool for the same task.

andy77 · Nov 19, 2017

Sure, in my opinion the Proxmox guys did a fantastic job!

The fact is that the new replication feature is very nice and I would love to continue to use it. Unfortunately it seems to be a bit buggy.

I did checked now the first time pve-zsync and it seems to be similar to replication feature. I would like to stay on the GUI with replication feature but if we can't find no solution that I may have to switch to zsync.

wolfgang · Nov 20, 2017

fjtrooper said:
is it normal? or could be a bug?

This is also not implemented.
So normal behavior.

andy77 · Nov 20, 2017

@wolfgang

Any idea how to solve the "recurring" problem?

Here my log:

Code:

2017-11-20 09:14:00 506-0: start replication job
2017-11-20 09:14:00 506-0: guest => VM 506, running => 6157
2017-11-20 09:14:00 506-0: volumes => local-zfs:vm-506-disk-1
2017-11-20 09:14:01 506-0: create snapshot '__replicate_506-0_1511165640__' on local-zfs:vm-506-disk-1
2017-11-20 09:14:01 506-0: full sync 'local-zfs:vm-506-disk-1' (__replicate_506-0_1511165640__)
2017-11-20 09:14:01 506-0: full send of rpool/data/vm-506-disk-1@__replicate_506-0_1511165640__ estimated size is 1.69G
2017-11-20 09:14:01 506-0: total estimated size is 1.69G
2017-11-20 09:14:01 506-0: TIME        SENT   SNAPSHOT
2017-11-20 09:14:01 506-0: rpool/data/vm-506-disk-1    name    rpool/data/vm-506-disk-1    -
2017-11-20 09:14:01 506-0: volume 'rpool/data/vm-506-disk-1' already exists
2017-11-20 09:14:01 506-0: warning: cannot send 'rpool/data/vm-506-disk-1@__replicate_506-0_1511165640__': signal received
2017-11-20 09:14:01 506-0: cannot send 'rpool/data/vm-506-disk-1': I/O error
2017-11-20 09:14:01 506-0: command 'zfs send -Rpv -- rpool/data/vm-506-disk-1@__replicate_506-0_1511165640__' failed: exit code 1
2017-11-20 09:14:01 506-0: delete previous replication snapshot '__replicate_506-0_1511165640__' on local-zfs:vm-506-disk-1
2017-11-20 09:14:01 506-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-506-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_506-0_1511165640__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=c2b1px' root@10.0.0.100 -- pvesm import local-zfs:vm-506-disk-1 zfs - -with-snapshots 1' failed: exit code 255

andy77 · Nov 23, 2017

I got now the same problem on a different node with another VM. :-(
And there seems to be no "permanent" or even "semi permanent" solution for it.

If I now remove the replication and also the zfs volume form the backup server, and then recreate the replication, it will work for a day, and then the problem occours again.

The strange thing is, that this happans actually only for one VM per node.

andy77 · Nov 23, 2017

I have to withdraw my statement about "only one VM per node".
A few seconds ago another VMs replication broke with the same error on a node where allready another VMs replication was broken.

So it seems that with time passing all replications will get in failed state :-(

andy77 · Nov 24, 2017

Because it seems that there is no solution for the replication problem, I configured the pve-zsync tasks instead of replication.
Now on the first few syncs everything seems to work fine. But after a while it started to show me for one VM "ERROR" state.

Here the full error on manual snyc:

Code:

cannot receive new filesystem stream: destination has snapshots (eg. rpool/data/vm-150-disk-1)

I tried to delete the snapshots with:

Code:

zfs destroy rpool/data/vm-150-disk-1@rep_test_12017-11-24_14:28:40
zfs destroy -R vm-150-disk-1

After dooing this, the first sync works well, but telling me in the end:

Code:

GET ERROR:
        could not find any snapshots to destroy; check snapshot names.

The next time when you try to run the sync again the "cannot receive new filesystem stream: destination has snapshots" error occours again.

What am I dooing wrong?

Please help

thx for any hint

Replication failed to some node

Active Member

Proxmox Retired Staff

Active Member

Active Member

Active Member

Active Member

Proxmox Retired Staff

Active Member

Active Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy