Migration failed (while replication seemed ok)

le_top · Dec 20, 2017

Tried to migrate some servers for server hardware intervention.
Migration failed and yet another nightmare to understand why.
Replication seemed to be working, but still the migration of several machines failed.
I now have to examine (with servers down) which replicate to retrieve and how... Not sure which server has the best copy...

task started by HA resource agent
2017-12-20 21:53:58 starting migration of CT 104 to node 'p4' (10.0.0.4)
2017-12-20 21:53:58 found local volume 'ZfsStorage:subvol-104-disk-2' (in current VM config)
2017-12-20 21:53:58 start replication job
2017-12-20 21:53:58 guest => CT 104, running => 0
2017-12-20 21:53:58 volumes => ZfsStorage:subvol-104-disk-2
2017-12-20 21:53:58 create snapshot '__replicate_104-2_1513803238__' on ZfsStorage:subvol-104-disk-2
2017-12-20 21:53:58 full sync 'ZfsStorage:subvol-104-disk-2' (__replicate_104-2_1513803238__)
2017-12-20 21:53:59 full send of vmpool/subvol-104-disk-2@__replicate_104-0_1513802705__ estimated size is 4.19G
2017-12-20 21:53:59 send from @__replicate_104-0_1513802705__ to vmpool/subvol-104-disk-2@__replicate_104-2_1513803238__ estimated size is 624B
2017-12-20 21:53:59 total estimated size is 4.19G
2017-12-20 21:53:59 TIME SENT SNAPSHOT
2017-12-20 21:53:59 vmpool/subvol-104-disk-2 name vmpool/subvol-104-disk-2 -
2017-12-20 21:53:59 volume 'vmpool/subvol-104-disk-2' already exists
2017-12-20 21:53:59 command 'zfs send -Rpv -- vmpool/subvol-104-disk-2@__replicate_104-2_1513803238__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-12-20 21:53:59 delete previous replication snapshot '__replicate_104-2_1513803238__' on ZfsStorage:subvol-104-disk-2
2017-12-20 21:53:59 end replication job with error: command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_104-2_1513803238__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1' failed: exit code 255
2017-12-20 21:53:59 ERROR: command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_104-2_1513803238__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1' failed: exit code 255
2017-12-20 21:53:59 aborting phase 1 - cleanup resources
2017-12-20 21:53:59 start final cleanup
2017-12-20 21:53:59 ERROR: migration aborted (duration 00:00:02): command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1 -snapshot __replicate_104-2_1513803238__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 1' failed: exit code 255
TASK ERROR: migration aborted

le_top · Dec 20, 2017

Root cause is better found in report of the migration that failed just before...

le_top · Dec 20, 2017

# ha-manager status
quorum OK
master p3 (active, Wed Dec 20 22:04:13 2017)
lrm p1 (idle, Wed Dec 20 22:04:18 2017)
lrm p3 (active, Wed Dec 20 22:04:09 2017)
lrm p4 (active, Wed Dec 20 22:04:16 2017)
service ct:100 (p4, started)
service ct:102 (p3, relocate)
service ct:104 (p3, relocate)
service ct:105 (p3, relocate)
service ct:109 (p4, started)
service ct:223 (p4, started)
service ct:224 (p4, started)

le_top · Dec 20, 2017

Proxmox tries to migrate back the machines every 15 minutes.

The restore of a 300GB machine is running - I am hoping that this is the issue - waiting for that to finish before further action.

This is a sample of the relocation initiated by Proxmox (failing):

task started by HA resource agent
2017-12-20 22:13:39 starting migration of CT 104 to node 'p4' (10.0.0.4)
2017-12-20 22:13:39 found local volume 'ZfsStorage:subvol-104-disk-2' (in current VM config)
full send of vmpool/subvol-104-disk-2@__replicate_104-0_1513802705__ estimated size is 4.19G
send from @__replicate_104-0_1513802705__ to vmpool/subvol-104-disk-2@__migration__ estimated size is 624B
total estimated size is 4.19G
TIME SENT SNAPSHOT
vmpool/subvol-104-disk-2 name vmpool/subvol-104-disk-2 -
volume 'vmpool/subvol-104-disk-2' already exists
command 'zfs send -Rpv -- vmpool/subvol-104-disk-2@__migration__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2017-12-20 22:13:39 ERROR: command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
2017-12-20 22:13:39 aborting phase 1 - cleanup resources
2017-12-20 22:13:39 ERROR: found stale volume copy 'ZfsStorage:subvol-104-disk-2' on node 'p4'
2017-12-20 22:13:39 start final cleanup
2017-12-20 22:13:39 ERROR: migration aborted (duration 00:00:01): command 'set -o pipefail && pvesm export ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=p4' root@10.0.0.4 -- pvesm import ZfsStorage:subvol-104-disk-2 zfs - -with-snapshots 0 -delete-snapshot __migration__' failed: exit code 255
TASK ERROR: migration aborted

le_top · Dec 21, 2017

I "fixed" my issues:

- ZFS backup on server that did not have the designated virtual machine for safety:

zfs snapshot <pool>/<disk>@now
( zfs send <pool>/<disk>@now | zfs receive <pool>/<disk>-backup ) &
# When finished - destroy blocking zfs storage:
zfs destroy -r <pool>/<disk>

This allowed the machine to be migrated back.

I then fixed the replications as usual and was able to migrate the machines again.

One of the things that caused this:
- The first machine (p3) was removed from the HA Group in the past. This happened when that machine was not operationnal and the HAGroup configuration was committed.
- As a result, after migrating from machine p4 to machine p3, the HA Manager decided to migrate the machine back to p4 as p3 was not in the HAGroup
- However synchronisation from p3 to p4 was broken for reasons I do not understand well yet (snapshots, etc?). Therefore the migration failed.

I think that can be avoided by:
- Not allowing to do a migration from one machine to another if an HAGroup exists and/or if the migration would result in migrating the machine back to the physical server where it came from.
- Perform checks on the ZFS configuration/status (snapshots,etc) and warn about potential issues with replications in the future and propose ((semi-)automated) fixes.

Search

Search

Migration failed (while replication seemed ok)

le_top

Renowned Member

le_top

Renowned Member

le_top

Renowned Member

le_top

Renowned Member

le_top

Renowned Member