HA VM stuck on migration back to initial node.

iccbroadcast · Oct 13, 2021

Hi Forum.

Today as OVH network was struck by major incident, the cluster HA went crazy moving VMs around.
So far, given the severity of the situation, quite good... however, I got a remaining VM that is left on no-man's land on HA:

IT was moved to an alternate node, but, upon network restablishment, it is in an endless failed/abort migration loop.
I have tried to manually handle that on the HA/GUI to no avail... I don't know what to do:

The failed migrations are log as follows:

Code:

()
task started by HA resource agent
2021-10-13 14:24:44 use dedicated network address for sending migration traffic (10.100.10.51)
2021-10-13 14:24:44 starting migration of VM 30100 to node 'lnd202011a' (10.100.10.51)
2021-10-13 14:24:45 found local, replicated disk 'local-zfs:vm-30100-disk-0' (in current VM config)
2021-10-13 14:24:45 replicating disk images
2021-10-13 14:24:45 start replication job
2021-10-13 14:24:45 guest => VM 30100, running => 0
2021-10-13 14:24:45 volumes => local-zfs:vm-30100-disk-0
2021-10-13 14:24:46 create snapshot '__replicate_30100-0_1634127885__' on local-zfs:vm-30100-disk-0
2021-10-13 14:24:46 using insecure transmission, rate limit: 50 MByte/s
2021-10-13 14:24:46 full sync 'local-zfs:vm-30100-disk-0' (__replicate_30100-0_1634127885__)
2021-10-13 14:24:46 using a bandwidth limit of 50000000 bps for transferring 'local-zfs:vm-30100-disk-0'
volume 'rpool/vm-30100-disk-0' already exists
2021-10-13 14:24:47 file /etc/pve/storage.cfg line 12 (section 'local') - unable to parse value of 'prune-backups': invalid format - format error
2021-10-13 14:24:47 keep-all: property is not defined in schema and the schema does not allow additional properties
2021-10-13 14:24:47 full send of rpool/vm-30100-disk-0@__replicate_30100-0_1604404823__ estimated size is 20.1G
2021-10-13 14:24:47 send from @__replicate_30100-0_1604404823__ to rpool/vm-30100-disk-0@__replicate_30100-0_1634127885__ estimated size is 106M
2021-10-13 14:24:47 total estimated size is 20.2G
2021-10-13 14:24:47 TIME        SENT   SNAPSHOT rpool/vm-30100-disk-0@__replicate_30100-0_1604404823__
2021-10-13 14:24:47 command 'zfs send -Rpv -- rpool/vm-30100-disk-0@__replicate_30100-0_1634127885__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
2021-10-13 14:24:47 delete previous replication snapshot '__replicate_30100-0_1634127885__' on local-zfs:vm-30100-disk-0
2021-10-13 14:24:47 end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-30100-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_30100-0_1634127885__ | /usr/bin/cstream -t 50000000' failed: exit code 141
2021-10-13 14:24:47 ERROR: Failed to sync data - command 'set -o pipefail && pvesm export local-zfs:vm-30100-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_30100-0_1634127885__ | /usr/bin/cstream -t 50000000' failed: exit code 141
2021-10-13 14:24:47 aborting phase 1 - cleanup resources
2021-10-13 14:24:47 ERROR: migration aborted (duration 00:00:03): Failed to sync data - command 'set -o pipefail && pvesm export local-zfs:vm-30100-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_30100-0_1634127885__ | /usr/bin/cstream -t 50000000' failed: exit code 141
TASK ERROR: migration aborted

The VMs are using local (ZFS) storage, with syncing.... so far, it has worked so good ... however, I don't know the right way to recover from this situation and what causes it.

Best regards.

iccbroadcast · Oct 13, 2021

Bypassed the issue by removing HA config and the replication task. Then started the VM, and finally manually performing a migration.

Now I could re-create the replication and HA config I removed.... but I'm somehow unconfident about it...if it failed once, it may fail twice in the same conditions.

Search

Search

HA VM stuck on migration back to initial node.

iccbroadcast

Renowned Member

iccbroadcast

Renowned Member

We value your privacy