random VM ends up in paused state after "Bulk migrate"

May 17, 2019
50
4
8
Hello,

we are running a Proxmox cluster with 5 Hosts, live migration works fine until we do a "Bulk migrate" to another host. A random VM ends up in paused state. Sometimes more than one. The tasks says "Ok" at the end and i'm able to resume the paused VM. We are running Virtual Environment 5.4-5. Is this a known issue? Any log files i can provide?

Thanks & Cheers,
Daniel
 
Hi,

Any log files i can provide?
try another bulk migrate and check the task logs in /var/log/pve/tasks (search for the file for migration task inside the directories)
 
Hi,

currently I'm unable to reproduce this (I did two bulk migrations to test).

Here is the log entry from one of the machines ending up in pause state when it happend the last time:
Code:
UPID:pm-04:000030BC:07BDA5F4:5CB5EEDE:qmstart:111:root@pam: 5CB5EEDF start failed: command '/usr/bin/kvm -id 111 -name munin-master -chardev 'socket,id=qmp,path=/var/run/qemu-server/111.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/111.pid -daemonize -smbios 'type=1,uuid=d8ff93c1-bc3e-47aa-a6fb-a1b36e458ba3' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/111.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 8192 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=b2288d0a-b9e1-4aba-a814-a5833fc59f62' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'VGA,id=vga,bus=pci.0,addr=0x2' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:5ddedba5dc5e' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'pvscsi,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=/dev/vg-cluster01-storage01/vm-111-disk-0,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap111i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=AE:5D:9F:54:03:F4,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -machine 'type=pc-i440fx-2.12' -incoming unix:/run/qemu-server/111.migrate -S' failed: exit code 1

I'll do some more bulk migrations this week to test this.

Cheers,
 
Hi,

today i ran into this issue again:


Code:
root@pm-05:/var/log/pve/tasks# grep -iR 5CE3FBA5 *
active:UPID:pm-05:0000D83C:0298D730:5CE3FBA5:migrateall::root@pam: 1 5CE40274 OK
active:UPID:pm-05:0000D840:0298D735:5CE3FBA5:qmigrate:105:root@pam: 1 5CE40274 OK
active:UPID:pm-05:0000D83D:0298D734:5CE3FBA5:qmigrate:102:root@pam: 1 5CE40138 OK
active:UPID:pm-05:0000D83E:0298D734:5CE3FBA5:qmigrate:103:root@pam: 1 5CE3FEE3 OK
active:UPID:pm-05:0000D842:0298D736:5CE3FBA5:qmigrate:106:root@pam: 1 5CE3FE85 migration problems
index:UPID:pm-05:0000D842:0298D736:5CE3FBA5:qmigrate:106:root@pam: 5CE3FE85 migration problems
index:UPID:pm-05:0000D83E:0298D734:5CE3FBA5:qmigrate:103:root@pam: 5CE3FEE3 OK
index:UPID:pm-05:0000D83D:0298D734:5CE3FBA5:qmigrate:102:root@pam: 5CE40138 OK
index:UPID:pm-05:0000D83C:0298D730:5CE3FBA5:migrateall::root@pam: 5CE40274 OK
index:UPID:pm-05:0000D840:0298D735:5CE3FBA5:qmigrate:105:root@pam: 5CE40274 OK
root@pm-05:/var/log/pve/tasks#

Code:
root@pm-05:/var/log/pve/tasks/5# cat UPID:pm-05:0000D842:0298D736:5CE3FBA5:qmigrate:106:root@pam:
2019-05-21 15:22:45 starting migration of VM 106 to node 'pm-04' (x.x.x.x)
2019-05-21 15:22:45 copying disk images
2019-05-21 15:22:45 starting VM 106 on remote node 'pm-04'
2019-05-21 15:22:47 start remote tunnel
2019-05-21 15:22:47 ssh tunnel ver 1
2019-05-21 15:22:47 starting online/live migration on unix:/run/qemu-server/106.migrate
2019-05-21 15:22:47 migrate_set_speed: 8589934592
2019-05-21 15:22:47 migrate_set_downtime: 0.1
2019-05-21 15:22:47 set migration_caps
2019-05-21 15:22:47 set cachesize: 4294967296
2019-05-21 15:22:47 start migrate command to unix:/run/qemu-server/106.migrate
2019-05-21 15:22:48 migration status: active (transferred 38526092, remaining 33482719232), total 34377375744)
2019-05-21 15:22:48 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 0 overflow 0
2019-05-21 15:22:49 migration status: active (transferred 77836005, remaining 32512753664), total 34377375744)
[...]
2019-05-21 15:34:57 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 56483 overflow 0
2019-05-21 15:34:57 migration status: active (transferred 28547736624, remaining 6930432), total 34377375744)
2019-05-21 15:34:57 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 57461 overflow 0
2019-05-21 15:34:57 migration speed: 44.89 MB/s - downtime 18 ms
2019-05-21 15:34:57 migration status: completed
2019-05-21 15:34:57 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 106 - no such machine' to command 'resume 106'
2019-05-21 15:35:00 ERROR: migration finished with problems (duration 00:12:15)
TASK ERROR: migration problems

Cheers
 
Last edited:
A bit more on the setup maybe someone has an idea how to debug further:
* 5 Dell Server with Proxmox VE 5.4-5 connected to a flash storage via Fiberchannel (multipath) using LVM thick
* NIC: 1Gbit/s (checked TX/RX error count and both are 0)

Any more logs to have a look at? We are considering upgrading to 10Gbit/s NICs in the future and also a paid subscription for Proxmox. But i somehow can't imagine the paid subscription makes my problem go away.
 
A bit more on the setup maybe someone has an idea how to debug further:
* 5 Dell Server with Proxmox VE 5.4-5 connected to a flash storage via Fiberchannel (multipath) using LVM thick
* NIC: 1Gbit/s (checked TX/RX error count and both are 0)

Any more logs to have a look at? We are considering upgrading to 10Gbit/s NICs in the future and also a paid subscription for Proxmox. But i somehow can't imagine the paid subscription makes my problem go away.

All of our front ends have a dedicated 10G backend for the cluster and we are on paid subscriptions as well. All of them are HP DL 380 Gen9's. I hit this issue on almost every bulk live migrate. Our cluster is in production though, so I can't do much testing. I have been dealing with it, hitting resume on the VM's as they end up in the pause state. My next round of maintenance I will try to capture some logs as well.
 
this is quite weird.

18ms is a bit too slow,

* NIC: 1Gbit/s (checked TX/RX error count and both are 0)

it's likely that it has to do with your network (1Gbit/s).

All of our front ends have a dedicated 10G backend for the cluster

in your case network might be less likely. will be looking forward to your logs.

how big are the vms you are migrating?
 
this is quite weird.

18ms is a bit too slow,



it's likely that it has to do with your network (1Gbit/s).



in your case network might be less likely. will be looking forward to your logs.

how big are the vms you are migrating?

They range a bit, but most of them are around the 10G size range. We do use ksm heavily and the hosts almost always have 150G in ksm sharing.

We just moved to proxmox 5.4 on these hosts, on the next big round of updates, I will make sure to catch some logs as this has been a thorn in my side for quite some time.
 
this is quite weird.

18ms is a bit too slow,



it's likely that it has to do with your network (1Gbit/s).

I hope to get 10GBit soonish. But since adamb has the same probleme i doubt it will help. Also i've seen VMs migrate with a lot higher than 18ms before and they where fine. My VMs are 50GB - 1TB in size. The failed VM is 510GB, 32GB ram.
 
I think the following patch should fix the issue - although I was not able to reproduce this just with bulk migration alone:
https://pve.proxmox.com/pipermail/pve-devel/2019-May/037084.html

When the issue occurs, do you also see high CPU load and/or network load on the corosync/cluster links?

Nice, sounds promising. While my cluster is kinda productive i still can do tests on it. Looking forward to do so!

I'm unsure about the CPU & network load when this happens. But i'll have a look at it when doing the next bulk migrate.
 
Last edited:
A package which includes a fix for this issue is available through our pvetest repository[0] as of now, the package in questions is "qemu-server" in version "5.0-52"

You can either (temporarily) add the pvetest repository, or if you're on latest 5.4 (Enterprise or no-subscription repository) install it manually, if you want to test it:
Code:
# cd /tmp; wget http://download.proxmox.com/debian/pve/dists/stretch/pvetest/binary-amd64/qemu-server_5.0-52_amd64.deb
# sha512sum qemu-server_5.0-52_amd64.deb
bd736b17b28981355124be95efb2f5592f4446acb23c9a8e7a306c0aad61a6923fbe8b78af902e133903bc6eb405e20341123ef6005afc954886c6a2558911ea  qemu-server_5.0-52_amd64.deb
# apt install ./qemu-server_5.0-52_amd64.deb

[0]: https://pve.proxmox.com/wiki/Package_Repositories#_proxmox_ve_test_repository
 
I hit it this morning (twice) while doing some maintenance on one of my nodes. I didn't notice any high CPU, but I am sure the live migration was chewing up 50% of the 10G link.

2019-05-29 05:41:27 migration status: active (transferred 11611027342, remaining 32604160), total 32230154240)
2019-05-29 05:41:27 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 1238 overflow 0
2019-05-29 05:41:27 migration speed: 512.00 MB/s - downtime 32 ms
2019-05-29 05:41:27 migration status: completed
2019-05-29 05:41:27 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 127 - no such machine' to command 'resume 127'
2019-05-29 05:41:30 ERROR: migration finished with problems (duration 00:01:07)
TASK ERROR: migration problems

2019-05-29 05:09:07 migration status: active (transferred 9645033842, remaining 33140736), total 10503659520)
2019-05-29 05:09:07 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 41483 overflow 0
2019-05-29 05:09:07 migration speed: 384.62 MB/s - downtime 23 ms
2019-05-29 05:09:07 migration status: completed
2019-05-29 05:09:07 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 111 - no such machine' to command 'resume 111'
2019-05-29 05:09:10 ERROR: migration finished with problems (duration 00:00:32)
TASK ERROR: migration problems

When I manually hit resume, all is well. Its almost like the live migration is hitting resume before the vm config file is moved into place.
 
I hit it this morning (twice) while doing some maintenance on one of my nodes. I didn't notice any high CPU, but I am sure the live migration was chewing up 50% of the 10G link.

2019-05-29 05:41:27 migration status: active (transferred 11611027342, remaining 32604160), total 32230154240)
2019-05-29 05:41:27 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 1238 overflow 0
2019-05-29 05:41:27 migration speed: 512.00 MB/s - downtime 32 ms
2019-05-29 05:41:27 migration status: completed
2019-05-29 05:41:27 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 127 - no such machine' to command 'resume 127'
2019-05-29 05:41:30 ERROR: migration finished with problems (duration 00:01:07)
TASK ERROR: migration problems

2019-05-29 05:09:07 migration status: active (transferred 9645033842, remaining 33140736), total 10503659520)
2019-05-29 05:09:07 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 41483 overflow 0
2019-05-29 05:09:07 migration speed: 384.62 MB/s - downtime 23 ms
2019-05-29 05:09:07 migration status: completed
2019-05-29 05:09:07 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 111 - no such machine' to command 'resume 111'
2019-05-29 05:09:10 ERROR: migration finished with problems (duration 00:00:32)
TASK ERROR: migration problems

When I manually hit resume, all is well. Its almost like the live migration is hitting resume before the vm config file is moved into place.

is this with the patched qemu-server version or without?
 
I've upgraded two of my servers to qemu 5.0-52 and did 4 bulk migrations between those host without running into the issue.
I've tested bulk migration on two servers with qemu 5.0-51 and hit the bug on the first bulk migration.

So far i would say this looks really good with qemu 5.0-52.
 
thanks to both of you for reporting back! I am fairly certain that the issue is resolved with the patch :)
 
Today I had this issue on multiple VMs. It seems like it happens more often if you use parallel jobs on bulk migrate. 1 parallel job only 1 failed vm that needs a resume, if parallel job =4 ---> 4 vms have to be resumed.

Code:
root@pve02:~# qm config 108
boot: order=scsi0;net0
cores: 2
memory: 4000
name: np-ubuntu-01
net0: virtio=E6:BE:1C:A4:92:AD,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: vm_nvme:vm-108-disk-0,size=64G
scsihw: virtio-scsi-pci
smbios1: uuid=e65f56b8-49c9-4932-ba1e-c953e90be155
sockets: 1
vmgenid: 6c4d5c46-1458-487d-a69a-4e0ae4437e6e

Code:
task started by HA resource agent
2022-03-08 13:17:28 use dedicated network address for sending migration traffic (10.100.100.232)
2022-03-08 13:17:28 starting migration of VM 108 to node 'pve02' (10.100.100.232)
2022-03-08 13:17:28 starting VM 108 on remote node 'pve02'
2022-03-08 13:17:30 [pve02] Dump was interrupted and may be inconsistent.
2022-03-08 13:17:30 start remote tunnel
2022-03-08 13:17:31 ssh tunnel ver 1
2022-03-08 13:17:31 starting online/live migration on unix:/run/qemu-server/108.migrate
2022-03-08 13:17:31 set migration capabilities
2022-03-08 13:17:31 migration downtime limit: 100 ms
2022-03-08 13:17:31 migration cachesize: 512.0 MiB
2022-03-08 13:17:31 set migration parameters
2022-03-08 13:17:31 start migrate command to unix:/run/qemu-server/108.migrate
2022-03-08 13:17:32 migration active, transferred 505.5 MiB of 3.9 GiB VM-state, 563.2 MiB/s
2022-03-08 13:17:33 migration active, transferred 1.1 GiB of 3.9 GiB VM-state, 593.8 MiB/s
2022-03-08 13:17:34 average migration speed: 1.3 GiB/s - downtime 118 ms
2022-03-08 13:17:34 migration status: completed
2022-03-08 13:17:37 ERROR: tunnel replied 'ERR: resume failed - VM 108 qmp command 'query-status' failed - got timeout' to command 'resume 108'
2022-03-08 13:17:57 ERROR: migration finished with problems (duration 00:00:29)
TASK ERROR: migration problems

1 parallel job= 1 vm needed to be resumed
2 parallel jobs= 3 vms needed to be resumed
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!