Replication failed to some node

fjtrooper

Active Member
Apr 10, 2017
13
0
41
50
Hello, I've 3 node cluster, and created some replication jobs between them. Suddenly repliocation to node1 stoped working.
I deleted and created again the job but doesn't work.
The solution was to backup vm, delete vm and restore to different ID.
is posible to solve without deleting and restoring.
Here is my log error.
THX!!!!


2017-07-13 14:40:00 101-0: start replication job
2017-07-13 14:40:00 101-0: guest => VM 101, running => 0
2017-07-13 14:40:00 101-0: volumes => zfspool:vm-101-disk-1
2017-07-13 14:40:01 101-0: create snapshot '__replicate_101-0_1499953200__' on zfspool:vm-101-disk-1
2017-07-13 14:40:01 101-0: full sync 'zfspool:vm-101-disk-1' (__replicate_101-0_1499953200__)
2017-07-13 14:40:02 101-0: delete previous replication snapshot '__replicate_101-0_1499953200__' on zfspool:vm-101-disk-1
2017-07-13 14:40:02 101-0: end replication job with error: command 'set -o pipefail && pvesm export zfspool:vm-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1499953200__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pvetest1' root@10.199.0.21 -- pvesm import zfspool:vm-101-disk-1 zfs - -with-snapshots 1' failed: exit code 255
 
Hi,

when you delete a job you have to wait until the nodes get synced again (about 1 minute).
Then create a new job and it will work.
 
After more than 12h, I created the replication job again but, still the error. To one one fails but to the other go right. Cluster works perfect and I can access ssh in both side.

2017-07-14 06:59:00 101-0: start replication job
2017-07-14 06:59:00 101-0: guest => VM 101, running => 0
2017-07-14 06:59:00 101-0: volumes => zfspool:vm-101-disk-1
2017-07-14 06:59:01 101-0: create snapshot '__replicate_101-0_1500011940__' on zfspool:vm-101-disk-1
2017-07-14 06:59:01 101-0: full sync 'zfspool:vm-101-disk-1' (__replicate_101-0_1500011940__)
2017-07-14 06:59:02 101-0: delete previous replication snapshot '__replicate_101-0_1500011940__' on zfspool:vm-101-disk-1
2017-07-14 06:59:02 101-0: end replication job with error: command 'set -o pipefail && pvesm export zfspool:vm-101-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_101-0_1500011940__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pvetest1' root@10.199.0.21 -- pvesm import zfspool:vm-101-disk-1 zfs - -with-snapshots 1' failed: exit code 255
 
SOLVED!

I don't know why, but the shared zfs pool didn't have replication images syncroniced. I deleted manually with "zfs destroy <image> -r" the images in replicated nodes and created again jobs, now is working again.

THX!!
 
Still having replication problems, and I can now tell you how to probe it.
I'm testing with 3 nodes and replicating 2 VM in all nodes. When I configure it for the first time everything is right.
When I force to check HA powering off one node (cutting power), the VM running in that node starts in one of the others running perfect, but after that replication doesn't work again in that VM.
Also I can't migrate the VM to another node.
To solve that, I've to delete manually replicate disk and vm disk in the other nodes (I use ZFS local) maintaining only vm disk in the node is running, and then replication and migration works again.

is it normal? or could be a bug?

FYI, after I force HA, if I delete all jobs from GUI, when I execute pvesr list still existing one job, and I have to force delete by console, because it doesn't appears in the GUI.


Thanks and good job!!
 
@fjtrooper - I can say that in my case I see the same . Mybe after the zfs 7.x will be available in Proxmox, will be possible to restart a zfs-send-receive or maybe bookmarks .... if the devs will want to increase the resilient in case of errors.
 
  • Like
Reactions: fjtrooper
I do have the same problems with replication. But I do not use no HA, so I actually don't really understand why it happans that the replication job gets faulty.

After this happans I tried to remove the replication, and also destroy the ZFS image on the replication server, as mentioned above. After dooing this, it was possible for me to recreate the replication job and all worked fine for a day. Unfortunately exactly the same error occours on the next day.

So I don't know how to fix that now permanently? Having to delete the job and destroy the ZFS Image seems to be not really the solution!

Any ideas or hints?

Thanks a lot
 
Hi @andi77,

If you do not want to re-create again and again the replication tasks, use pve-zsync. I can told you that is rock solid. I setup this and I forget about the problem. I use this for many month in several and different proxmox clusters.
Try it and then told me if I was wrong ;)
 
Well, I usually like to use "out of the box" functions. So it would for sure be better to have the problem with Proxmox Replication just fixed :)
 
Nobody is perfect, and I am sure that the Proxmox devel will try to do the best job. The good news is the fact we can use another tool for the same task.
 
Sure, in my opinion the Proxmox guys did a fantastic job!

The fact is that the new replication feature is very nice and I would love to continue to use it. Unfortunately it seems to be a bit buggy.

I did checked now the first time pve-zsync and it seems to be similar to replication feature. I would like to stay on the GUI with replication feature but if we can't find no solution that I may have to switch to zsync.
 
@wolfgang

Any idea how to solve the "recurring" problem?

Here my log:
Code:
2017-11-20 09:14:00 506-0: start replication job
2017-11-20 09:14:00 506-0: guest => VM 506, running => 6157
2017-11-20 09:14:00 506-0: volumes => local-zfs:vm-506-disk-1
2017-11-20 09:14:01 506-0: create snapshot '__replicate_506-0_1511165640__' on local-zfs:vm-506-disk-1
2017-11-20 09:14:01 506-0: full sync 'local-zfs:vm-506-disk-1' (__replicate_506-0_1511165640__)
2017-11-20 09:14:01 506-0: full send of rpool/data/vm-506-disk-1@__replicate_506-0_1511165640__ estimated size is 1.69G
2017-11-20 09:14:01 506-0: total estimated size is 1.69G
2017-11-20 09:14:01 506-0: TIME        SENT   SNAPSHOT
2017-11-20 09:14:01 506-0: rpool/data/vm-506-disk-1    name    rpool/data/vm-506-disk-1    -
2017-11-20 09:14:01 506-0: volume 'rpool/data/vm-506-disk-1' already exists
2017-11-20 09:14:01 506-0: warning: cannot send 'rpool/data/vm-506-disk-1@__replicate_506-0_1511165640__': signal received
2017-11-20 09:14:01 506-0: cannot send 'rpool/data/vm-506-disk-1': I/O error
2017-11-20 09:14:01 506-0: command 'zfs send -Rpv -- rpool/data/vm-506-disk-1@__replicate_506-0_1511165640__' failed: exit code 1
2017-11-20 09:14:01 506-0: delete previous replication snapshot '__replicate_506-0_1511165640__' on local-zfs:vm-506-disk-1
2017-11-20 09:14:01 506-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-506-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_506-0_1511165640__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=c2b1px' root@10.0.0.100 -- pvesm import local-zfs:vm-506-disk-1 zfs - -with-snapshots 1' failed: exit code 255
 
I got now the same problem on a different node with another VM. :-(
And there seems to be no "permanent" or even "semi permanent" solution for it.

If I now remove the replication and also the zfs volume form the backup server, and then recreate the replication, it will work for a day, and then the problem occours again.

The strange thing is, that this happans actually only for one VM per node.
 
I have to withdraw my statement about "only one VM per node".
A few seconds ago another VMs replication broke with the same error on a node where allready another VMs replication was broken.

So it seems that with time passing all replications will get in failed state :-(
 
Because it seems that there is no solution for the replication problem, I configured the pve-zsync tasks instead of replication.
Now on the first few syncs everything seems to work fine. But after a while it started to show me for one VM "ERROR" state.

Here the full error on manual snyc:
Code:
cannot receive new filesystem stream: destination has snapshots (eg. rpool/data/vm-150-disk-1)


I tried to delete the snapshots with:
Code:
zfs destroy rpool/data/vm-150-disk-1@rep_test_12017-11-24_14:28:40
zfs destroy -R vm-150-disk-1

After dooing this, the first sync works well, but telling me in the end:
Code:
GET ERROR:
        could not find any snapshots to destroy; check snapshot names.

The next time when you try to run the sync again the "cannot receive new filesystem stream: destination has snapshots" error occours again.

What am I dooing wrong?

Please help

thx for any hint
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!