VM migration between nodes hanging

criva · Aug 9, 2024

Hi, we have some issues with our cluster and I need to migrate some VM.
There is one node that gives quite some issues.
Now the migration is stuck to "stopping NBD migration server on target" stage.
The VM is running and present only on the target but shows that it is in migration?

What is safe to do? Normally when migration has been stuck at an earlier stage I have deleted the locks on the source and replayed the migration but now I am not sure to what to do.
Any suggestion is appreciated. (screenshot for reference)

fiona · Aug 9, 2024

Hi,
please share the output of pveversion -v from both source and target node as well as the VM configuration qm config <ID>. Can you check on the target with ps faxl if there is a qm nbdstop process running?

criva · Aug 9, 2024

Hi Fiona,
thanks for your reaction.
on the source the qm is no longer present on the target:
TARGET ~ # qm config 107
bootdisk: ide0
cores: 8
ide0: Data2-lvm:vm-107-disk-0,format=raw,size=150G
ide1: none,media=cdrom
lock: migrate
memory: 16384
name: *namevm--STTD*
net0: e1000=42:64:AA:00:6F:EB,bridge=vmbr888
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=c8bce94b-9051-484a-b75b-ed284519ecbd
sockets: 1
tags: TARGET
vmgenid: 213e78d1-2c0f-49d6-ad6a-d329c0ec1d39

The pveversion from source and target are in attachment.

This machine was yesterday at first migrated from the target to the source, essentially I need to put it back.

fiona · Aug 9, 2024

fiona said:
Can you check on the target with ps faxl if there is a qm nbdstop process running?

If yes, you could try to send an interrupt to that process. And you can run qm nbdstop 107 yourself.

If no or if it doesn't help, you should be able unlock the VM on the target, and resume it, as the migration of the state is already finished. Stopping NBD/etc. happens afterwards to clean up.

criva · Aug 9, 2024

Hi Fiona,
I did manage to unlock the VM on the target, however on the source I still have this:

1 0 1079 1 20 0 5308 1920 do_epo Ss ? 0:05 /usr/sbin/qmeventd /var/run/qmeventd.sock
4 0 2759790 2759158 20 0 347640 139216 do_sel S pts/0 0:15 | \_ /usr/bin/perl /usr/sbin/qm migrate 107 TARGET --online --with-local-disks
1 0 2759791 2759790 20 0 354904 117552 do_sel S+ pts/0 0:07 | \_ task UPID:inia:002A1C6F:2542B1C4:66B52CCF:qmigrate:107:root@pam:
0 0 2871857 2871729 20 0 6332 2176 pipe_r S+ pts/1 0:00 \_ grep 107

Is it safe to just kill this process?
thanks.
Clara

fiona · Aug 12, 2024

It should be safe, but I'd try to send it an interrupt or terminate signal first, so it has the chance to do other cleanups. Afterwards, it might be good to check if there are local disks from the VM left-over on the source node.

criva · Aug 13, 2024

We found still machine with a migration hanging on one of the two machines, the VM was sunnig properly but the hypervisor was having all sort of question marks. We did stop one of the VM (215) but now it does not want to start. At first it seemed a lock issue but now I get the error ---unable to read tail (got 0 byted instead)

a /var/lock/qemu-server # qm showcmd 215 --pretty
/usr/bin/kvm \
-id 215 \
-name 'glpi--SV36-127--web-helpdesk,debug-threads=on' \
-no-shutdown \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/215.qmp,server=on,wait=off' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/215.pid \
-daemonize \
-smbios 'type=1,uuid=b350704f-e282-4461-83e3-7fe86f6ab037' \
-smp '4,sockets=1,cores=4,maxcpus=4' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc 'unix:/var/run/qemu-server/215.vnc,password=on' \
-cpu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep \
-m 8192 \
-device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
-device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
-device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
-device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
-device 'VGA,id=vga,bus=pci.0,addr=0x2' \
-chardev 'socket,path=/var/run/qemu-server/215.qga,server=on,wait=off,id=qga0' \
-device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
-device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:a27619c4c6c1' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' \
-drive 'file=/dev/data2/vm-215-disk-0,if=none,id=drive-virtio0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
-device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' \
-netdev 'type=tap,id=net0,ifname=tap215i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
-device 'virtio-net-pci,mac=7E:49:E5:1D:50

E,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=300' \
-machine 'type=pc+pve0'

fiona · Aug 13, 2024

Please share the output of qm start 215 as well as the system log/journal from around the time the issue happens.

criva · Aug 13, 2024

dear Fiona,
this is the event for qm 215, I do not get a specific cli output

Attached also the journal for this afternoon, there are a few errors corresponding to our attempt to start and stop the VM

thx

fiona · Aug 14, 2024

Code:

Aug 13 15:16:46 inia qm[3557979]: <root@pam> end task UPID:inia:00364A5C:27AD5FB0:66BB5CA1:qmstart:215:root@pam: command '/sbin/lvs --separator : --noheadings --units b --unbuffered --nosuffix --config 'report/time_format="%s"' --options vg_name,lv_name,lv_size,lv_attr,pool_lv,data_percent,metadata_percent,snap_percent,uuid,tags,metadata_size,time' failed: received interrupt

What if you manually run lvs? Maybe there's an issue with the disks/LVM.

If you double click on the start task you should see the task log. Does that contain anything more?

criva · Aug 14, 2024

Lvs indeed gives and error on data1
inia ~ # lvs
^C Interrupted...
Giving up waiting for lock.
Can't get lock for data1.
Cannot process volume group data1
Interrupted...
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
vm-105-disk-0 data2 Vwi-aotz-- 40.00g vm-thin2 100.00
vm-107-disk-0 data2 Vwi-a-tz-- 150.00g vm-thin2 100.00
vm-122-disk-0 data2 Vwi-a-tz-- 4.00m vm-thin2 25.00
vm-122-disk-1 data2 Vwi-a-tz-- 40.00g vm-thin2 100.00
vm-173-disk-0 data2 Vwi-a-tz-- 50.00g vm-thin2 60.09
vm-210-disk-0 data2 Vwi-a-tz-- 60.00g vm-thin2 100.00
vm-210-disk-1 data2 Vwi-a-tz-- 119.00g vm-thin2 100.00
vm-215-disk-0 data2 Vwi-a-tz-- 50.00g vm-thin2 100.00
vm-234-disk-0 data2 Vwi-a-tz-- 100.00g vm-thin2 90.00
vm-236-disk-0 data2 Vwi-a-tz-- 60.00g vm-thin2 100.00
vm-236-disk-1 data2 Vwi-a-tz-- 10.00g vm-thin2 2.51
vm-thin2 data2 twi-aotz-- 2.00t 31.22 23.39
vms2 data2 -wi-ao---- 2.00t
From the GUI I get no information on the task it is totally messed up:

fiona · Aug 14, 2024

criva said:
Lvs indeed gives and error on data1
inia ~ # lvs
^C Interrupted...
Giving up waiting for lock.
Can't get lock for data1.
Cannot process volume group data1
Interrupted...
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
vm-105-disk-0 data2 Vwi-aotz-- 40.00g vm-thin2 100.00
vm-107-disk-0 data2 Vwi-a-tz-- 150.00g vm-thin2 100.00
vm-122-disk-0 data2 Vwi-a-tz-- 4.00m vm-thin2 25.00
vm-122-disk-1 data2 Vwi-a-tz-- 40.00g vm-thin2 100.00
vm-173-disk-0 data2 Vwi-a-tz-- 50.00g vm-thin2 60.09
vm-210-disk-0 data2 Vwi-a-tz-- 60.00g vm-thin2 100.00
vm-210-disk-1 data2 Vwi-a-tz-- 119.00g vm-thin2 100.00
vm-215-disk-0 data2 Vwi-a-tz-- 50.00g vm-thin2 100.00
vm-234-disk-0 data2 Vwi-a-tz-- 100.00g vm-thin2 90.00
vm-236-disk-0 data2 Vwi-a-tz-- 60.00g vm-thin2 100.00
vm-236-disk-1 data2 Vwi-a-tz-- 10.00g vm-thin2 2.51
vm-thin2 data2 twi-aotz-- 2.00t 31.22 23.39
vms2 data2 -wi-ao---- 2.00t
From the GUI I get no information on the task it is totally messed up:
View attachment 73006

Check your boot log regarding any errors about the volume group data1 and check the health of your physical disk that the volume group is on, e.g. with smartctl -a /dev/XYZ (check the output of pvs and vgs to see which). You can also try running lvs -vvv to get more verbose output.

Ahmedjaser · Aug 14, 2024

Hi Fiona,
Thanks for the answer, here is the output of smartctl -a /dev/sdb1, I think there is no issue with the physical disk,

I have checked also the services lvm2-lvmpolld.service (it's up and running) but the service lvm2-monitor.service is failing to monitor data1 vg here is the out put of systemctl restart lvm2-monitor.service

Ahmedjaser · Aug 14, 2024

Hi Fiona,
I think I solved the issue, there was lock file (V_data1 ) in /var/lock/lvm/V_data1 After deleting this file I could restart lvm2-monitor.service, The storage now is online and I can list vgs and lvs without any issue, Thanks for your help, I think we can close the ticket

Ahmedjaser · Aug 14, 2024

Hi Fiona, The storage is back online but a ran into another issue when I try to access vm on the node I got 596: connection timeout, Maybe you have an idae, thanks

fiona · Aug 14, 2024

Ahmedjaser said:
Hi Fiona, The storage is back online but a ran into another issue when I try to access vm on the node I got 596: connection timeout, Maybe you have an idae, thanksView attachment 73013

Can you see anything in the system logs/journal? Is the guest on the same node you are accessing the web interface from?

Ahmedjaser · Aug 14, 2024

Hi Fiona, Yes the guest in the same node, I see this in the logs for pveproxy

Search

Search

VM migration between nodes hanging

criva

New Member

fiona

Proxmox Staff Member

criva

New Member

Attachments

fiona

Proxmox Staff Member

criva

New Member

fiona

Proxmox Staff Member

criva

New Member

fiona

Proxmox Staff Member

criva

New Member

Attachments

fiona

Proxmox Staff Member

criva

New Member

fiona

Proxmox Staff Member

Ahmedjaser

New Member

Ahmedjaser

New Member

Ahmedjaser

New Member

fiona

Proxmox Staff Member

Ahmedjaser

New Member

We value your privacy