Storage Replication

Gilberto Ferreira · Nov 1, 2017

Hi

I am trying to use Storage Replication.
I have deployied two servers: nodeA and nodeB.
I actived the Storage Replication in a container VM.
First replication ran ok...
I can migrate the VM between nodeA and nodeB.
I actived HA between this two nodes! I know, this a no-recommended behavior... But don't worry! It's just a lab environment!
So when the nodeA has crash, the Vm restarted in nodeB. That is expected behavior.
But, when the nodeA has back online, I cannot migrate the VM from nodeB to nodeA.
I get this error:

task started by HA resource agent
2017-11-01 14:44:52 starting migration of CT 101 to node 'pve01' (172.16.0.10)
2017-11-01 14:44:52 found local volume 'STG1:subvol-101-disk-1' (in current VM config)
full send of ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__ estimated size is 419M
send from @__replicate_101-0_1509553286__ to ZFS-LOCAL/subvol-101-disk-1@__migration__ estimated size is 6.80M
total estimated size is 426M
TIME SENT SNAPSHOT
14:44:54 11.3M ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__
14:44:55 34.6M ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__
14:44:56 53.4M ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__
14:44:57 69.3M ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__
14:44:58 89.2M ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__
14:44:59 111M ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__
14:45:00 121M ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__
14:45:01 124M ZFS-LOCAL/subvol-101-disk-1@__replicate_101-0_1509553286__
cannot receive new filesystem stream: checksum mismatch or incomplete stream
cannot open 'ZFS-LOCAL/subvol-101-disk-1': dataset does not exist
command 'zfs recv -F -- ZFS-LOCAL/subvol-101-disk-1' failed: exit code 1
command 'zfs send -Rpv -- ZFS-LOCAL/subvol-101-disk-1@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-11-01 14:45:02 ERROR: command 'set -o pipefail && pvesm export STG1:subvol-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pve01' root@172.16.0.10 -- pvesm import STG1:subvol-101-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 1
2017-11-01 14:45:02 aborting phase 1 - cleanup resources
2017-11-01 14:45:02 ERROR: found stale volume copy 'STG1:subvol-101-disk-1' on node 'pve01'
2017-11-01 14:45:02 start final cleanup
2017-11-01 14:45:02 ERROR: migration aborted (duration 00:00:10): command 'set -o pipefail && pvesm export STG1:subvol-101-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pve01' root@172.16.0.10 -- pvesm import STG1:subvol-101-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 1
TASK ERROR: migration aborted

In order to remediate this BAD behavior, I maked a script in nodeB, that destroy de zvol in nodeA, and perform a migration. The scipt below do the job:
#!/bin/bash

ssh root@pve01 zpool destroy ZFS-LOCAL -f
ssh root@pve01 zpool create ZFS-LOCAL /dev/vdb -f
ha-manager migrate ct:101 pve01

I don't know if this a bug or what!

Whatever it is, I don't have a clue in how to fix it!

Somebody can help???

Thanks a lot

wolfgang · Nov 2, 2017

Gilberto Ferreira said:
But, when the nodeA has back online, I cannot migrate the VM from nodeB to nodeA.

Yes this is known. It is not implemented yet.

Search

Search

Storage Replication

Gilberto Ferreira

Renowned Member

wolfgang

Proxmox Retired Staff