VM Lost after HA Fail

cl4x · Dec 17, 2019

Scenario (6.1):

PVE1 host:
- no vm -

PVE2 host:
- VM100: on shared iSCSI, HA Enable
- VM101: on local LVM, HA Enable

PVE2 Fail, then VM100 go on PVE1 and work good, but VM101 appears on PVE1 but, of course, go in error state (because the storage is on PVE2).

Now the PVE2 go up and running fine, but i have problem to recovery VM because it appears on PVE1 but the disk is on PVE2 local lvm storage.

Any operation fail with this error:

2019-12-17 17:51:53 ERROR: Failed to sync data - command 'set -o pipefail && pvesm export local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.4.82 -- pvesm import local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0' failed: exit code 255
2019-12-17 17:51:53 aborting phase 1 - cleanup resources
2019-12-17 17:51:53 ERROR: found stale volume copy 'local-lvm:vm-101-disk-0' on node 'pve2'
2019-12-17 17:51:53 ERROR: migration aborted (duration 00:00:02): Failed to sync data - command 'set -o pipefail && pvesm export local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.4.82 -- pvesm import local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0' failed: exit code 255
TASK ERROR: migration aborted

Moving the disk is impossible. The scan do not show result. any ideas for unlock VM and reassign this VM to PVE2?
Thanks

adamb · Dec 17, 2019

cl4x said:
Scenario (6.1):

PVE1 host:
- no vm -

PVE2 host:
- VM100: on shared iSCSI, HA Enable
- VM101: on local LVM, HA Enable

PVE2 Fail, then VM100 go on PVE1 and work good, but VM101 appears on PVE1 but, of course, go in error state (because the storage is on PVE2).

Now the PVE2 go up and running fine, but i have problem to recovery VM because it appears on PVE1 but the disk is on PVE2 local lvm storage.

Any operation fail with this error:

2019-12-17 17:51:53 ERROR: Failed to sync data - command 'set -o pipefail && pvesm export local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.4.82 -- pvesm import local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0' failed: exit code 255
2019-12-17 17:51:53 aborting phase 1 - cleanup resources
2019-12-17 17:51:53 ERROR: found stale volume copy 'local-lvm:vm-101-disk-0' on node 'pve2'
2019-12-17 17:51:53 ERROR: migration aborted (duration 00:00:02): Failed to sync data - command 'set -o pipefail && pvesm export local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.4.82 -- pvesm import local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0' failed: exit code 255
TASK ERROR: migration aborted

Moving the disk is impossible. The scan do not show result. any ideas for unlock VM and reassign this VM to PVE2?
Thanks

The VM config is just a flat text file sitting in /etc/pve/nodes/hostname/qemu-server/. Manually move it

cl4x · Dec 18, 2019

Ok i try. But this is a Bug, why HA system move the VM if this it cannot be moved?

adamb · Dec 18, 2019

cl4x said:
Ok i try. But this is a Bug, why HA system move the VM if this it cannot be moved?

I guess it depends on how you look at it. Why add the VM to HA in the first place if the storage only exists on one node? Don't add it to HA and you won't have this issue.

cl4x · Dec 18, 2019

mmh. If i don't use the virtualization, don't have this issue.
lol

adamb · Dec 18, 2019

cl4x said:
mmh. If i don't use the virtualization, don't have this issue.
lol

Then don't use it, you also can't run multiple OS's if you don't use virtualiztion. Your other option is to create a ha group with restricted set so that the VM is only allowed to run on specific nodes. However, id just not put the VM in HA as its pointless. This isn't a bug.

cl4x · Dec 18, 2019

adamb said:
Then don't use it, you also can't run multiple OS's if you don't use virtualiztion.

I know the advantages of virtualization well, at the moment I have almost 100 nodes around from customers. but now I'm trying to evaluate if I can also propose an alternative to the classic vmware.

adamb said:
This isn't a bug.

if you want to talk seriously, we talk seriously and without jokes.

if the system allows me to put a machine in HA and it erroneus move the VM config, causing me to lose it permanently due to a split, this is a bug. without a doubt it is a bug.

it solves:

- preventing the user from putting a machine into HA that will be permanently damaged (warning me that HA can only work if I have a shared storage, otherwise I have to provide a replication mechanism)

- or by not starting the HA mechanism if this damage the VM (by investing more in the accuracy of the preliminary HA verification system)

This is my simple advice. I work on enterprise virtualization from 12 years, it is absurd that an config error can destroy a VM. Impossible not to share this and to insist to say that it is not a bug.

adamb · Dec 18, 2019

I highly doubt you "lost" the VM, all HA does is move the VM config which is a flat text file.

Again, HA groups are there for a reason.

Search

Search

VM Lost after HA Fail

cl4x

Member

adamb

Famous Member

cl4x

Member

adamb

Famous Member

cl4x

Member

adamb

Famous Member

cl4x

Member

adamb

Famous Member