VM migration between nodes hanging

criva

New Member
May 13, 2024
10
0
1
Hi, we have some issues with our cluster and I need to migrate some VM.
There is one node that gives quite some issues.
Now the migration is stuck to "stopping NBD migration server on target" stage.
The VM is running and present only on the target but shows that it is in migration?

What is safe to do? Normally when migration has been stuck at an earlier stage I have deleted the locks on the source and replayed the migration but now I am not sure to what to do.
Any suggestion is appreciated. (screenshot for reference)
1723186410151.png
 
Hi,
please share the output of pveversion -v from both source and target node as well as the VM configuration qm config <ID>. Can you check on the target with ps faxl if there is a qm nbdstop process running?
 
Hi Fiona,
thanks for your reaction.
on the source the qm is no longer present on the target:
TARGET ~ # qm config 107
bootdisk: ide0
cores: 8
ide0: Data2-lvm:vm-107-disk-0,format=raw,size=150G
ide1: none,media=cdrom
lock: migrate
memory: 16384
name: *namevm--STTD*
net0: e1000=42:64:AA:00:6F:EB,bridge=vmbr888
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=c8bce94b-9051-484a-b75b-ed284519ecbd
sockets: 1
tags: TARGET
vmgenid: 213e78d1-2c0f-49d6-ad6a-d329c0ec1d39

The pveversion from source and target are in attachment.

This machine was yesterday at first migrated from the target to the source, essentially I need to put it back.
 

Attachments

Can you check on the target with ps faxl if there is a qm nbdstop process running?
If yes, you could try to send an interrupt to that process. And you can run qm nbdstop 107 yourself.

If no or if it doesn't help, you should be able unlock the VM on the target, and resume it, as the migration of the state is already finished. Stopping NBD/etc. happens afterwards to clean up.
 
Hi Fiona,
I did manage to unlock the VM on the target, however on the source I still have this:

1 0 1079 1 20 0 5308 1920 do_epo Ss ? 0:05 /usr/sbin/qmeventd /var/run/qmeventd.sock
4 0 2759790 2759158 20 0 347640 139216 do_sel S pts/0 0:15 | \_ /usr/bin/perl /usr/sbin/qm migrate 107 TARGET --online --with-local-disks
1 0 2759791 2759790 20 0 354904 117552 do_sel S+ pts/0 0:07 | \_ task UPID:inia:002A1C6F:2542B1C4:66B52CCF:qmigrate:107:root@pam:
0 0 2871857 2871729 20 0 6332 2176 pipe_r S+ pts/1 0:00 \_ grep 107

Is it safe to just kill this process?
thanks.
Clara
 
It should be safe, but I'd try to send it an interrupt or terminate signal first, so it has the chance to do other cleanups. Afterwards, it might be good to check if there are local disks from the VM left-over on the source node.
 
We found still machine with a migration hanging on one of the two machines, the VM was sunnig properly but the hypervisor was having all sort of question marks. We did stop one of the VM (215) but now it does not want to start. At first it seemed a lock issue but now I get the error ---unable to read tail (got 0 byted instead)

a /var/lock/qemu-server # qm showcmd 215 --pretty
/usr/bin/kvm \
-id 215 \
-name 'glpi--SV36-127--web-helpdesk,debug-threads=on' \
-no-shutdown \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/215.qmp,server=on,wait=off' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/215.pid \
-daemonize \
-smbios 'type=1,uuid=b350704f-e282-4461-83e3-7fe86f6ab037' \
-smp '4,sockets=1,cores=4,maxcpus=4' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc 'unix:/var/run/qemu-server/215.vnc,password=on' \
-cpu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep \
-m 8192 \
-device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
-device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
-device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
-device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
-device 'VGA,id=vga,bus=pci.0,addr=0x2' \
-chardev 'socket,path=/var/run/qemu-server/215.qga,server=on,wait=off,id=qga0' \
-device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' \
-device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:a27619c4c6c1' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' \
-drive 'file=/dev/data2/vm-215-disk-0,if=none,id=drive-virtio0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
-device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' \
-netdev 'type=tap,id=net0,ifname=tap215i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
-device 'virtio-net-pci,mac=7E:49:E5:1D:50:DE,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=300' \
-machine 'type=pc+pve0'
 
Please share the output of qm start 215 as well as the system log/journal from around the time the issue happens.
 
dear Fiona,
this is the event for qm 215, I do not get a specific cli output
1723560477976.png
Attached also the journal for this afternoon, there are a few errors corresponding to our attempt to start and stop the VM

thx
 

Attachments

Code:
Aug 13 15:16:46 inia qm[3557979]: <root@pam> end task UPID:inia:00364A5C:27AD5FB0:66BB5CA1:qmstart:215:root@pam: command '/sbin/lvs --separator : --noheadings --units b --unbuffered --nosuffix --config 'report/time_format="%s"' --options vg_name,lv_name,lv_size,lv_attr,pool_lv,data_percent,metadata_percent,snap_percent,uuid,tags,metadata_size,time' failed: received interrupt
What if you manually run lvs? Maybe there's an issue with the disks/LVM.

If you double click on the start task you should see the task log. Does that contain anything more?
 
Lvs indeed gives and error on data1
inia ~ # lvs
^C Interrupted...
Giving up waiting for lock.
Can't get lock for data1.
Cannot process volume group data1
Interrupted...
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
vm-105-disk-0 data2 Vwi-aotz-- 40.00g vm-thin2 100.00
vm-107-disk-0 data2 Vwi-a-tz-- 150.00g vm-thin2 100.00
vm-122-disk-0 data2 Vwi-a-tz-- 4.00m vm-thin2 25.00
vm-122-disk-1 data2 Vwi-a-tz-- 40.00g vm-thin2 100.00
vm-173-disk-0 data2 Vwi-a-tz-- 50.00g vm-thin2 60.09
vm-210-disk-0 data2 Vwi-a-tz-- 60.00g vm-thin2 100.00
vm-210-disk-1 data2 Vwi-a-tz-- 119.00g vm-thin2 100.00
vm-215-disk-0 data2 Vwi-a-tz-- 50.00g vm-thin2 100.00
vm-234-disk-0 data2 Vwi-a-tz-- 100.00g vm-thin2 90.00
vm-236-disk-0 data2 Vwi-a-tz-- 60.00g vm-thin2 100.00
vm-236-disk-1 data2 Vwi-a-tz-- 10.00g vm-thin2 2.51
vm-thin2 data2 twi-aotz-- 2.00t 31.22 23.39
vms2 data2 -wi-ao---- 2.00t
From the GUI I get no information on the task it is totally messed up:
1723622613621.png
 
Lvs indeed gives and error on data1
inia ~ # lvs
^C Interrupted...
Giving up waiting for lock.
Can't get lock for data1.
Cannot process volume group data1
Interrupted...
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
vm-105-disk-0 data2 Vwi-aotz-- 40.00g vm-thin2 100.00
vm-107-disk-0 data2 Vwi-a-tz-- 150.00g vm-thin2 100.00
vm-122-disk-0 data2 Vwi-a-tz-- 4.00m vm-thin2 25.00
vm-122-disk-1 data2 Vwi-a-tz-- 40.00g vm-thin2 100.00
vm-173-disk-0 data2 Vwi-a-tz-- 50.00g vm-thin2 60.09
vm-210-disk-0 data2 Vwi-a-tz-- 60.00g vm-thin2 100.00
vm-210-disk-1 data2 Vwi-a-tz-- 119.00g vm-thin2 100.00
vm-215-disk-0 data2 Vwi-a-tz-- 50.00g vm-thin2 100.00
vm-234-disk-0 data2 Vwi-a-tz-- 100.00g vm-thin2 90.00
vm-236-disk-0 data2 Vwi-a-tz-- 60.00g vm-thin2 100.00
vm-236-disk-1 data2 Vwi-a-tz-- 10.00g vm-thin2 2.51
vm-thin2 data2 twi-aotz-- 2.00t 31.22 23.39
vms2 data2 -wi-ao---- 2.00t
From the GUI I get no information on the task it is totally messed up:
View attachment 73006
Check your boot log regarding any errors about the volume group data1 and check the health of your physical disk that the volume group is on, e.g. with smartctl -a /dev/XYZ (check the output of pvs and vgs to see which). You can also try running lvs -vvv to get more verbose output.
 
Hi Fiona,
Thanks for the answer, here is the output of smartctl -a /dev/sdb1, I think there is no issue with the physical disk,
1723625065598.png




I have checked also the services lvm2-lvmpolld.service (it's up and running) but the service lvm2-monitor.service is failing to monitor data1 vg here is the out put of systemctl restart lvm2-monitor.service
1723625294412.png
 
Hi Fiona,
I think I solved the issue, there was lock file (V_data1 ) in /var/lock/lvm/V_data1 After deleting this file I could restart lvm2-monitor.service, The storage now is online and I can list vgs and lvs without any issue, Thanks for your help, I think we can close the ticket
 
  • Like
Reactions: fiona
Hi Fiona, The storage is back online but a ran into another issue when I try to access vm on the node I got 596: connection timeout, Maybe you have an idae, thanks1723626555568.png
 
Hi Fiona, The storage is back online but a ran into another issue when I try to access vm on the node I got 596: connection timeout, Maybe you have an idae, thanksView attachment 73013
Can you see anything in the system logs/journal? Is the guest on the same node you are accessing the web interface from?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!