Migration failed (while replication seemed ok)

le_top

Renowned Member
Sep 6, 2013
42
0
71
Tried to migrate some servers for server hardware intervention.
Migration failed and yet another nightmare to understand why.
Replication seemed to be working, but still the migration of several machines failed.
I now have to examine (with servers down) which replicate to retrieve and how... Not sure which server has the best copy...


task started by HA resource agent
2017-12-20 21:53:58 starting migration of CT 104 to node 'p4' (10.0.0.4)
2017-12-20 21:53:58 found local volume 'ZfsStorage:subvol-104-disk-2' (in current VM config)
2017-12-20 21:53:58 start replication job
2017-12-20 21:53:58 guest => CT 104, running => 0
2017-12-20 21:53:58 volumes => ZfsStorage:subvol-104-disk-2
2017-12-20 21:53:58 create snapshot '__replicate_104-2_1513803238__' on ZfsStorage:subvol-104-disk-2
2017-12-20 21:53:58 full sync 'ZfsStorage:subvol-104-disk-2' (__replicate_104-2_1513803238__)
2017-12-20 21:53:59 full send of vmpool/subvol-104-disk-2@__replicate_104-0_1513802705__ estimated size is 4.19G
2017-12-20 21:53:59 send from @__replicate_104-0_1513802705__ to vmpool/subvol-104-disk-2@__replicate_104-2_1513803238__ estimated size is 624B
2017-12-20 21:53:59 total estimated size is 4.19G
2017-12-20 21:53:59 TIME SENT SNAPSHOT
2017-12-20 21:53:59 vmpool/subvol-104-disk-2 name vmpool/subvol-104-disk-2 -
2017-12-20 21:53:59 volume 'vmpool/subvol-104-disk-2' already exists
2017-12-20 21:53:59 command 'zfs send -Rpv -- vmpool/subvol-104-disk-2@__replicate_104-2_1513803238__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-12-20 21:53:59 delete previous replication snapshot '__replicate_104-2_1513803238__' on ZfsStorage:subvol-104-disk-2
2017-12-20 21:53:59 end replication job with error: command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_104-2_1513803238__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1' failed: exit code 255
2017-12-20 21:53:59 ERROR: command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_104-2_1513803238__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1' failed: exit code 255
2017-12-20 21:53:59 aborting phase 1 - cleanup resources
2017-12-20 21:53:59 start final cleanup
2017-12-20 21:53:59 ERROR: migration aborted (duration 00:00:02): command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_104-2_1513803238__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1' failed: exit code 255
TASK ERROR: migration aborted
 
Root cause is better found in report of the migration that failed just before...
 
# ha-manager status
quorum OK
master p3 (active, Wed Dec 20 22:04:13 2017)
lrm p1 (idle, Wed Dec 20 22:04:18 2017)
lrm p3 (active, Wed Dec 20 22:04:09 2017)
lrm p4 (active, Wed Dec 20 22:04:16 2017)
service ct:100 (p4, started)
service ct:102 (p3, relocate)
service ct:104 (p3, relocate)
service ct:105 (p3, relocate)
service ct:109 (p4, started)
service ct:223 (p4, started)
service ct:224 (p4, started)
 
Proxmox tries to migrate back the machines every 15 minutes.

The restore of a 300GB machine is running - I am hoping that this is the issue - waiting for that to finish before further action.

This is a sample of the relocation initiated by Proxmox (failing):

task started by HA resource agent
2017-12-20 22:13:39 starting migration of CT 104 to node 'p4' (10.0.0.4)
2017-12-20 22:13:39 found local volume 'ZfsStorage:subvol-104-disk-2' (in current VM config)
full send of vmpool/subvol-104-disk-2@__replicate_104-0_1513802705__ estimated size is 4.19G
send from @__replicate_104-0_1513802705__ to vmpool/subvol-104-disk-2@__migration__ estimated size is 624B
total estimated size is 4.19G
TIME SENT SNAPSHOT
vmpool/subvol-104-disk-2 name vmpool/subvol-104-disk-2 -
volume 'vmpool/subvol-104-disk-2' already exists
command 'zfs send -Rpv -- vmpool/subvol-104-disk-2@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-12-20 22:13:39 ERROR: command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
2017-12-20 22:13:39 aborting phase 1 - cleanup resources
2017-12-20 22:13:39 ERROR: found stale volume copy 'ZfsStorage:subvol-104-disk-2' on node 'p4'
2017-12-20 22:13:39 start final cleanup
2017-12-20 22:13:39 ERROR: migration aborted (duration 00:00:01): command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
TASK ERROR: migration aborted
 
I "fixed" my issues:

- ZFS backup on server that did not have the designated virtual machine for safety:

zfs snapshot <pool>/<disk>@now
( zfs send <pool>/<disk>@now | zfs receive <pool>/<disk>-backup ) &
# When finished - destroy blocking zfs storage:
zfs destroy -r <pool>/<disk
>​

This allowed the machine to be migrated back.

I then fixed the replications as usual and was able to migrate the machines again.


One of the things that caused this:
- The first machine (p3) was removed from the HA Group in the past. This happened when that machine was not operationnal and the HAGroup configuration was committed.
- As a result, after migrating from machine p4 to machine p3, the HA Manager decided to migrate the machine back to p4 as p3 was not in the HAGroup
- However synchronisation from p3 to p4 was broken for reasons I do not understand well yet (snapshots, etc?). Therefore the migration failed.

I think that can be avoided by:
- Not allowing to do a migration from one machine to another if an HAGroup exists and/or if the migration would result in migrating the machine back to the physical server where it came from.
- Perform checks on the ZFS configuration/status (snapshots,etc) and warn about potential issues with replications in the future and propose ((semi-)automated) fixes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!