I have testing cluster with 3 nodes of PVE 5.0, i've managed to setup ZFS replication and HA.
1.) HA Failover was not working unless i created HA group and put my CT in it (originaly i was thinking that any running node will be used when no group is assigned to resource)
2.) When i manualy migrated CT from node1 to node2 the replacation changed direction from node1->node2 to node2->node1. This is cool. But when HA failover finaly took place after node1 failure and started my replica on node2 the direction was not changed. Now it's not even possible to migrate the CT back to node where it was before HA failover was executed.
This is output:
2017-10-11 09:38:07 shutdown CT 100
2017-10-11 09:38:07 # lxc-stop -n 100 --timeout 180
2017-10-11 09:38:12 # lxc-wait -n 100 -t 5 -s STOPPED
2017-10-11 09:38:12 starting migration of CT 100 to node 'virt1' (10.11.56.141)
2017-10-11 09:38:12 found local volume 'vps:subvol-100-disk-1' (in current VM config)
send from @ to tank/vps/subvol-100-disk-1@__replicate_100-0_1507705201__ estimated size is 437M
send from @__replicate_100-0_1507705201__ to tank/vps/subvol-100-disk-1@__migration__ estimated size is 1.27M
total estimated size is 438M
tank/vps/subvol-100-disk-1 name tank/vps/subvol-100-disk-1 -
volume 'tank/vps/subvol-100-disk-1' already exists
TIME SENT SNAPSHOT
command 'zfs send -Rpv -- tank/vps/subvol-100-disk-1@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-10-11 09:38:14 ERROR: command 'set -o pipefail && pvesm export vps:subvol-100-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=virt1' root@10.11.56.141 -- pvesm import vps:subvol-100-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
2017-10-11 09:38:14 aborting phase 1 - cleanup resources
2017-10-11 09:38:14 ERROR: found stale volume copy 'vps:subvol-100-disk-1' on node 'virt1'
2017-10-11 09:38:14 start final cleanup
2017-10-11 09:38:14 start container on target node
2017-10-11 09:38:14 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=virt1' root@10.11.56.141 pct start 100
2017-10-11 09:38:15 Configuration file 'nodes/virt1/lxc/100.conf' does not exist
2017-10-11 09:38:15 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=virt1' root@10.11.56.141 pct start 100' failed: exit code 255
2017-10-11 09:38:15 ERROR: migration aborted (duration 00:00:09): command 'set -o pipefail && pvesm export vps:subvol-100-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=virt1' root@10.11.56.141 -- pvesm import vps:subvol-100-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
TASK ERROR: migration aborted
1.) HA Failover was not working unless i created HA group and put my CT in it (originaly i was thinking that any running node will be used when no group is assigned to resource)
2.) When i manualy migrated CT from node1 to node2 the replacation changed direction from node1->node2 to node2->node1. This is cool. But when HA failover finaly took place after node1 failure and started my replica on node2 the direction was not changed. Now it's not even possible to migrate the CT back to node where it was before HA failover was executed.
This is output:
2017-10-11 09:38:07 shutdown CT 100
2017-10-11 09:38:07 # lxc-stop -n 100 --timeout 180
2017-10-11 09:38:12 # lxc-wait -n 100 -t 5 -s STOPPED
2017-10-11 09:38:12 starting migration of CT 100 to node 'virt1' (10.11.56.141)
2017-10-11 09:38:12 found local volume 'vps:subvol-100-disk-1' (in current VM config)
send from @ to tank/vps/subvol-100-disk-1@__replicate_100-0_1507705201__ estimated size is 437M
send from @__replicate_100-0_1507705201__ to tank/vps/subvol-100-disk-1@__migration__ estimated size is 1.27M
total estimated size is 438M
tank/vps/subvol-100-disk-1 name tank/vps/subvol-100-disk-1 -
volume 'tank/vps/subvol-100-disk-1' already exists
TIME SENT SNAPSHOT
command 'zfs send -Rpv -- tank/vps/subvol-100-disk-1@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-10-11 09:38:14 ERROR: command 'set -o pipefail && pvesm export vps:subvol-100-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=virt1' root@10.11.56.141 -- pvesm import vps:subvol-100-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
2017-10-11 09:38:14 aborting phase 1 - cleanup resources
2017-10-11 09:38:14 ERROR: found stale volume copy 'vps:subvol-100-disk-1' on node 'virt1'
2017-10-11 09:38:14 start final cleanup
2017-10-11 09:38:14 start container on target node
2017-10-11 09:38:14 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=virt1' root@10.11.56.141 pct start 100
2017-10-11 09:38:15 Configuration file 'nodes/virt1/lxc/100.conf' does not exist
2017-10-11 09:38:15 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=virt1' root@10.11.56.141 pct start 100' failed: exit code 255
2017-10-11 09:38:15 ERROR: migration aborted (duration 00:00:09): command 'set -o pipefail && pvesm export vps:subvol-100-disk-1 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=virt1' root@10.11.56.141 -- pvesm import vps:subvol-100-disk-1 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
TASK ERROR: migration aborted