VM Lost after HA Fail

cl4x

Member
Dec 17, 2019
6
0
6
42
Scenario (6.1):

PVE1 host:
- no vm -

PVE2 host:
- VM100: on shared iSCSI, HA Enable
- VM101: on local LVM, HA Enable

PVE2 Fail, then VM100 go on PVE1 and work good, but VM101 appears on PVE1 but, of course, go in error state (because the storage is on PVE2).

Now the PVE2 go up and running fine, but i have problem to recovery VM because it appears on PVE1 but the disk is on PVE2 local lvm storage.

Any operation fail with this error:

2019-12-17 17:51:53 ERROR: Failed to sync data - command 'set -o pipefail && pvesm export local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.4.82 -- pvesm import local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0' failed: exit code 255
2019-12-17 17:51:53 aborting phase 1 - cleanup resources
2019-12-17 17:51:53 ERROR: found stale volume copy 'local-lvm:vm-101-disk-0' on node 'pve2'
2019-12-17 17:51:53 ERROR: migration aborted (duration 00:00:02): Failed to sync data - command 'set -o pipefail && pvesm export local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.4.82 -- pvesm import local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0' failed: exit code 255

TASK ERROR: migration aborted

Moving the disk is impossible. The scan do not show result. any ideas for unlock VM and reassign this VM to PVE2?
Thanks
 
Last edited:
Scenario (6.1):

PVE1 host:
- no vm -

PVE2 host:
- VM100: on shared iSCSI, HA Enable
- VM101: on local LVM, HA Enable

PVE2 Fail, then VM100 go on PVE1 and work good, but VM101 appears on PVE1 but, of course, go in error state (because the storage is on PVE2).

Now the PVE2 go up and running fine, but i have problem to recovery VM because it appears on PVE1 but the disk is on PVE2 local lvm storage.

Any operation fail with this error:

2019-12-17 17:51:53 ERROR: Failed to sync data - command 'set -o pipefail && pvesm export local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.4.82 -- pvesm import local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0' failed: exit code 255
2019-12-17 17:51:53 aborting phase 1 - cleanup resources
2019-12-17 17:51:53 ERROR: found stale volume copy 'local-lvm:vm-101-disk-0' on node 'pve2'
2019-12-17 17:51:53 ERROR: migration aborted (duration 00:00:02): Failed to sync data - command 'set -o pipefail && pvesm export local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@192.168.4.82 -- pvesm import local-lvm:vm-101-disk-0 raw+size - -with-snapshots 0' failed: exit code 255

TASK ERROR: migration aborted

Moving the disk is impossible. The scan do not show result. any ideas for unlock VM and reassign this VM to PVE2?
Thanks

The VM config is just a flat text file sitting in /etc/pve/nodes/hostname/qemu-server/. Manually move it
 
Ok i try. But this is a Bug, why HA system move the VM if this it cannot be moved?
 
Ok i try. But this is a Bug, why HA system move the VM if this it cannot be moved?

I guess it depends on how you look at it. Why add the VM to HA in the first place if the storage only exists on one node? Don't add it to HA and you won't have this issue.
 
mmh. If i don't use the virtualization, don't have this issue.
lol

Then don't use it, you also can't run multiple OS's if you don't use virtualiztion. Your other option is to create a ha group with restricted set so that the VM is only allowed to run on specific nodes. However, id just not put the VM in HA as its pointless. This isn't a bug.
 
Then don't use it, you also can't run multiple OS's if you don't use virtualiztion.

I know the advantages of virtualization well, at the moment I have almost 100 nodes around from customers. but now I'm trying to evaluate if I can also propose an alternative to the classic vmware.

This isn't a bug.

if you want to talk seriously, we talk seriously and without jokes.

if the system allows me to put a machine in HA and it erroneus move the VM config, causing me to lose it permanently due to a split, this is a bug. without a doubt it is a bug.

it solves:

- preventing the user from putting a machine into HA that will be permanently damaged (warning me that HA can only work if I have a shared storage, otherwise I have to provide a replication mechanism)

- or by not starting the HA mechanism if this damage the VM (by investing more in the accuracy of the preliminary HA verification system)

This is my simple advice. I work on enterprise virtualization from 12 years, it is absurd that an config error can destroy a VM. Impossible not to share this and to insist to say that it is not a bug.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!