vm migrate partially, leaving disks behind

Manny Vazquez

Well-Known Member
Jul 12, 2017
107
2
58
Miami, FL USA
I had a problem with one of the hosts, the HA worked fine, for 5 vms, but 2 others the HDs did not migrate (stayed on the original VM).
The hosts is now ok,
Screen Shot 2018-10-31 at 2.59.32 PM.png
Now when I try to start the VMs, I get the following error.
================================================
Screen Shot 2018-10-31 at 3.00.34 PM.png Screen Shot 2018-10-31 at 3.00.47 PM.png

TEXT COPIED HERE FROM ONE OF THEM.
-----------------------------------------------------------------------
kvm: -drive file=/dev/zvol/rpool/data/vm-305-disk-1,if=none,id=drive-ide0,format=raw,cache=none,aio=native,detect-zeroes=on: Could not open '/dev/zvol/rpool/data/vm-305-disk-1': No such file or directory
TASK ERROR: start failed: command '/usr/bin/kvm -id 305 -chardev 'socket,id=qmp,path=/var/run/qemu-server/305.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/305.pid -daemonize -smbios 'type=1,uuid=5de2283d-d8bc-4aaa-ae79-7945ebd8d17a' -name w2k8-164-DEV-3395 -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/305.vnc,x509,password -no-hpet -cpu 'kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,enforce' -m 8196 -k en-us -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -chardev 'socket,path=/var/run/qemu-server/305.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:2ba1d1feeb0' -drive 'file=/dev/zvol/rpool/data/vm-305-disk-1,if=none,id=drive-ide0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'ide-hd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=100' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -drive 'file=/dev/zvol/rpool/data/vm-305-disk-2,if=none,id=drive-virtio1,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb' -netdev 'type=tap,id=net0,ifname=tap305i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=52:42:17:2F:14:65,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -global 'kvm-pit.lost_tick_policy=discard'' failed: exit code 1
=========================================================

I see that the problem is that the HDs are not on the current host, but I see no way to either move the HDs to the current host, or return the VMs to where the HDs are, can someone please help with either scenario?

Screen Shot 2018-10-31 at 3.02.30 PM.png
 
I found somewhat solution.

Since the drives were still on the original host, but I couldn't move them, I tried moving the config files, which did not work either, but I made copy of the content (cat xxxx.conf) and created a new one (new id) on the original host where the HDs are (vi zzzz.conf) and pasted the content of the original conf file..

They started right away with NEW id, but fully functional, or so they seem.

Now I have another problem I can't add these to the HA, they try to migrate right away and fail.. so I guess I will have to live with backups only.
 
what are you trying to do ? ha with local disks does not do anything (how can the still up hosts get the data?)
and with replication there are few caveats, see https://pve.proxmox.com/wiki/Storage_Replication
I appreciate that all the support here is free, and I really appreciate the great work that Proxmox has done for the community, but I do not appreciate the lack of interest that you guys (Proxmox Staff Members) show to the community of people that for now can not afford to pay for the support, it is a real turn down, what can we expect if the answers we get, (when we get something) are always dismissive, talking down on us and actually offensive to our intelligence?

What part on my post is not clear? why are you even questioning that replication was not working? you can read that 5 VMs migrate fine, only 2 failed.. should you not understand that replication was working properly? and that the problem is that the other 2 did not replicate / migrate ..

In any case, as posted above, I found a work around myself.

In other post I had asked about the replication issue, why would it work from one host to the others, but not from the other hosts.. nothing bu crickets on that post.

Again, I understand this is free, and that you guys have no obligation to reply, BUT if you are not going to help at all, please just do not say anything or close the forum all together.. your behavior is offensive to the community.. Imagine if my team , when we worked on Asterisk) had done the same thing instead of giving it to the community..

I know I am going to quite possibly get deleted from the community, but I am not going to be quiet for fear of that.

Put yourself in the situation of people that have dedicated their lives to give to the world and therefore can not afford to pay for the support, at least not for not.
 
I appreciate that all the support here is free, and I really appreciate the great work that Proxmox has done for the community, but I do not appreciate the lack of interest that you guys (Proxmox Staff Members) show to the community of people that for now can not afford to pay for the support, it is a real turn down, what can we expect if the answers we get, (when we get something) are always dismissive, talking down on us and actually offensive to our intelligence?
huh? sorry if my post sounded dismissive, that was really not my intention (probably does not help that my first language is not english). i honestly asked those question because some things are not clear from your post

What part on my post is not clear? why are you even questioning that replication was not working? you can read that 5 VMs migrate fine, only 2 failed.. should you not understand that replication was working properly? and that the problem is that the other 2 did not replicate / migrate ..
you never mentioned replication, only migration (those are two different things)

from your post i gathered this:

you have 3 hosts, at least one with zfs (not clear if there are other storages)
you have HA enabled
you migrated 5 vms (online/offline? you did not specify), of which 2 could not start on the new host since the disks were missing

are all of your vms on zfs?
can you post the content of :

the relevant vm configs (and one of a working one, to compare)
the content of /etc/pve/storage.cfg

my guess was, that you had marked the zfs storage 'shared' and now wondered why the disks were not migrated, thus my answer
HA with local storage makes only (partially) sense if you enabled replication (which you did not mention), and only then with some problems (e.g. error recovery, as outlined in the link i send)
 
huh? sorry if my post sounded dismissive, that was really not my intention (probably does not help that my first language is not english). i honestly asked those question because some things are not clear from your post

English is not my mother or first language either, thus more emphasis on care.

huh? sorry if my post sounded dismissive, that was really not my intention (probably does not help that my first language is not english). i honestly asked those question because some things are not clear from your post


you never mentioned replication, only migration (those are two different things)

==> You are correct, maybe misused the word, but I assumed that in order to automatically migrate (as stated due to a problem with host) replication working was assumed.
As explained, this host that failed (name is host2), was replicating perfectly fine to the other 2 (name them host 3 and host4), although the other do now want to replicate into each other or to this one.

from your post i gathered this:

you have 3 hosts, at least one with zfs (not clear if there are other storages)
==> All 3 are on ZFS, nothing else.
you have HA enabled
==>correct and it had worked fine before. when a host (name it host1) died a while back, after reformatting it and fixing the problem it was added back into the group as now host4
you migrated 5 vms (online/offline? you did not specify), of which 2 could not start on the new host since the disks were missing
==> I did not migrate them, they (HA) migrated them to HOST4 when HOST2 died. 3 of them came up perfectly fine (about 3 minutes in total) but the other 2 did not, upon investigation, I saw that the had drives were still on host2.

are all of your vms on zfs?
==> Yes
can you post the content of :

the relevant vm configs (and one of a working one, to compare)
==> at this monent all of them are again working, but I will post one from each host
==============================
9305=> the one that did not migrate on HOST2
----------------------------------------------
root@vm2:/etc/pve/qemu-server# cat 9305.conf

agent: 1

balloon: 1024

bootdisk: ide0

cores: 2

ide0: local-zfs:vm-305-disk-1,size=50G

ide2: none,media=cdrom

memory: 8196

name: w2k8-164-DEV-3395

net0: virtio=52:42:17:2F:14:65,bridge=vmbr0

numa: 0

ostype: w2k8

scsihw: virtio-scsi-pci

smbios1: uuid=5de2283d-d8bc-4aaa-ae79-7945ebd8d17a

sockets: 2

virtio1: local-zfs:vm-305-disk-2,size=300G

root@vm2:/etc/pve/qemu-server#
------------------------------------------------------------
3391 => working VM, no problem => Host3
--------------------------------------------------------
root@vm3:/etc/pve/qemu-server# cat 3391.conf

balloon: 2048

bootdisk: ide0

cores: 4

ide0: local-zfs:vm-3391-disk-1,size=100G

ide2: none,media=cdrom

memory: 12288

name: 3391w2k8

net0: virtio=AA:A4:DF:69:A8:C6,bridge=vmbr0

numa: 0

ostype: w2k8

scsihw: virtio-scsi-pci

smbios1: uuid=dec7c05a-c78a-4504-8fee-0489f209cb0b

sockets: 2

root@vm3:/etc/pve/qemu-server#
--------------------------------------------------------
1601 => working fine, one of the VMs that migrated via HA by themselves, host4
---------------------------------------------
root@vm4:/etc/pve/qemu-server# cat 1601.conf

balloon: 2024

bootdisk: ide0

cores: 4

ide0: local-zfs:vm-1601-disk-1,size=100G

ide2: none,media=cdrom

memory: 8192

name: 3394w2k8

net0: virtio=86:51:AA:A3:84:07,bridge=vmbr0

numa: 0

ostype: w2k8

scsihw: virtio-scsi-pci

smbios1: uuid=25c48319-504d-4bd8-8e8d-34fd679d04da

sockets: 2


[PENDING]

memory: 12000

root@vm4:/etc/pve/qemu-server#

==========================
This one I just increased the RAM, still pending implementation
==========================



the content of /etc/pve/storage.cfg
==> Host4
root@vm4:/etc/pve# cat storage.cfg

dir: local

path /var/lib/vz

content vztmpl,backup,iso

maxfiles 30

shared 1


zfspool: local-zfs

pool rpool/data

content images,rootdir

sparse 1


root@vm4:/etc/pve#
=============
hosts3
root@vm3:/etc/pve/qemu-server# cat /etc/pve/storage.cfg

dir: local

path /var/lib/vz

content vztmpl,backup,iso

maxfiles 30

shared 1


zfspool: local-zfs

pool rpool/data

content images,rootdir

sparse 1


root@vm3:/etc/pve/qemu-server#
==================
host2
root@vm2:/etc/pve/qemu-server# cat /etc/pve/storage.cfg

dir: local

path /var/lib/vz

content vztmpl,backup,iso

maxfiles 30

shared 1


zfspool: local-zfs

pool rpool/data

content images,rootdir

sparse 1


root@vm2:/etc/pve/qemu-server#
======================================





my guess was, that you had marked the zfs storage 'shared' and now wondered why the disks were not migrated, thus my answer
HA with local storage makes only (partially) sense if you enabled replication (which you did not mention), and only then with some problems (e.g. error recovery, as outlined in the link i send)

Thanks..
 
ok i guess in that case replication did not do its job
can you post the output of
Code:
pvesr status
on all nodes?
 
ok i guess in that case replication did not do its job
can you post the output of
Code:
pvesr status
on all nodes?
it seems like I have a major problem... VMs seem to still work fine.
=======================
HOST2 .. no response after 5 minutes to
pvesr status

=======================
HOST3
root@vm3:~# pvesr status
JobID Enabled Target LastSync NextSync Duration FailCount State
3381-0 Yes local/vm2 - pending - 0 OK
3381-1 Yes local/vm4 - pending - 0 OK
3382-0 Yes local/vm2 - pending - 0 OK
3382-1 Yes local/vm4 - pending - 0 OK
3391-0 Yes local/vm4 - pending - 0 OK
3393-0 Yes local/vm2 - pending - 0 OK
================================
HOST4
root@vm4:~# pvesr status

JobID Enabled Target LastSync NextSync Duration FailCount State
401-0 Yes local/vm3 - pending 1.511765 1 zfs error: For the delegated permission list, run: zfs allow|unallow
402-1 Yes local/vm3 - pending - 0 OK
1601-0 Yes local/vm3 - pending - 0 OK
1604-0 No local/vm2 - pending - 0 OK
1604-1 No local/vm3 - pending - 0 OK
3380-0 Yes local/vm3 - pending - 0 OK
3399-0 Yes local/vm2 - pending - 0 OK
3399-1 Yes local/vm3 - pending - 0 OK
6000-0 Yes local/vm2 - pending - 0 OK
6001-0 Yes local/vm3 - pending - 0 OK


=============================
What do I have to do to get HOST2 ok.?

============================
root@vm2:~# pvesr status
^C
root@vm2:~# pvecm status

Quorum information
------------------
Date: Mon Nov 5 10:43:34 2018
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 2/184
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.21.82.22 (local)
0x00000003 1 172.21.82.23
0x00000001 1 172.21.82.24

root@vm2:~#
====================
Same answer I think if requested from HOST3
===================
root@vm3:~# pvecm status
Quorum information
------------------
Date: Mon Nov 5 10:45:03 2018
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 2/184
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.21.82.22
0x00000003 1 172.21.82.23 (local)
0x00000001 1 172.21.82.24
root@vm3:~#
==================================
What is your recommendation?
Should I schedule downtime for my whole setup and upgrade? I was told this all could be because i am running
5.0-23/af4267bf , that possible?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!