Migration fails (PVE 6.2-12): Query migrate failed

Nov 10, 2020
15
21
23
41
Sweden
Hello!

First post, yay! We are in the works of setting up a cluster using this awesome software.

We've set up a 3 node cluster. As backend for our VM disks we are currently running GlusterFS (A move to CEPH is underway).

While live-migrating a VM we bump into the following error every now and then:

Code:
2020-11-12 20:32:17 starting migration of VM 101 to node 'XXXXXXX' (x.x.x.x)
2020-11-12 20:32:17 starting VM 101 on remote node 'XXXXXXX'
2020-11-12 20:32:18 start remote tunnel
2020-11-12 20:32:19 ssh tunnel ver 1
2020-11-12 20:32:19 starting online/live migration on unix:/run/qemu-server/101.migrate
2020-11-12 20:32:19 set migration_caps
2020-11-12 20:32:19 migration speed limit: 8589934592 B/s
2020-11-12 20:32:19 migration downtime limit: 100 ms
2020-11-12 20:32:19 migration cachesize: 268435456 B
2020-11-12 20:32:19 set migration parameters
2020-11-12 20:32:19 start migrate command to unix:/run/qemu-server/101.migrate
2020-11-12 20:32:20 migration status: active (transferred 116961080, remaining 2042331136), total 2165121024)
2020-11-12 20:32:20 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:21 migration status: active (transferred 227347565, remaining 1926090752), total 2165121024)
2020-11-12 20:32:21 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:22 migration status: active (transferred 335118928, remaining 1810997248), total 2165121024)
2020-11-12 20:32:22 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:23 migration status: active (transferred 449655259, remaining 1690288128), total 2165121024)
2020-11-12 20:32:23 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:24 migration status: active (transferred 561048878, remaining 1577897984), total 2165121024)
2020-11-12 20:32:24 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:25 migration status: active (transferred 666718984, remaining 1470496768), total 2165121024)
2020-11-12 20:32:25 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:26 migration status: active (transferred 781827768, remaining 1352048640), total 2165121024)
2020-11-12 20:32:26 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:27 migration status: active (transferred 892257326, remaining 1238573056), total 2165121024)
2020-11-12 20:32:27 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:28 migration status: active (transferred 997598483, remaining 1131782144), total 2165121024)
2020-11-12 20:32:28 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:29 migration status: active (transferred 1113164386, remaining 1012162560), total 2165121024)
2020-11-12 20:32:29 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:30 migration status: active (transferred 1223998457, remaining 899096576), total 2165121024)
2020-11-12 20:32:30 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:31 migration status: active (transferred 1329081638, remaining 792301568), total 2165121024)
2020-11-12 20:32:31 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:32 migration status: active (transferred 1442324470, remaining 673230848), total 2165121024)
2020-11-12 20:32:32 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:33 migration status: active (transferred 1552935405, remaining 559210496), total 2165121024)
2020-11-12 20:32:33 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:34 migration status: active (transferred 1659646306, remaining 442191872), total 2165121024)
2020-11-12 20:32:34 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:35 migration status: active (transferred 1775577033, remaining 324268032), total 2165121024)
2020-11-12 20:32:35 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:36 migration status: active (transferred 1888253567, remaining 209465344), total 2165121024)
2020-11-12 20:32:36 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:37 migration status: active (transferred 2002826716, remaining 83193856), total 2165121024)
2020-11-12 20:32:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:37 migration status: active (transferred 2014116080, remaining 70406144), total 2165121024)
2020-11-12 20:32:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:37 migration status: active (transferred 2025801264, remaining 58253312), total 2165121024)
2020-11-12 20:32:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:37 migration status: active (transferred 2037826621, remaining 45969408), total 2165121024)
2020-11-12 20:32:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:37 migration status: active (transferred 2049654365, remaining 34164736), total 2165121024)
2020-11-12 20:32:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:37 migration status: active (transferred 2061482109, remaining 22360064), total 2165121024)
2020-11-12 20:32:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 1816 overflow 0
2020-11-12 20:32:37 migration status: active (transferred 2073500001, remaining 13221888), total 2165121024)
2020-11-12 20:32:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 3482 overflow 0
query migrate failed: VM 101 qmp command 'query-migrate' failed - client closed connection

2020-11-12 20:32:38 query migrate failed: VM 101 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 101 not running

2020-11-12 20:32:39 query migrate failed: VM 101 not running
query migrate failed: VM 101 not running

2020-11-12 20:32:40 query migrate failed: VM 101 not running
query migrate failed: VM 101 not running

2020-11-12 20:32:41 query migrate failed: VM 101 not running
query migrate failed: VM 101 not running

2020-11-12 20:32:42 query migrate failed: VM 101 not running
query migrate failed: VM 101 not running

2020-11-12 20:32:43 query migrate failed: VM 101 not running
2020-11-12 20:32:43 ERROR: online migrate failure - too many query migrate failures - aborting
2020-11-12 20:32:43 aborting phase 2 - cleanup resources
2020-11-12 20:32:43 migrate_cancel
2020-11-12 20:32:43 migrate_cancel error: VM 101 not running
2020-11-12 20:32:45 ERROR: migration finished with problems (duration 00:00:28)
TASK ERROR: migration problems

And the above error causes the VM to go offline aswell.

The servers are running everything on a single 1GbE NIC at the moment. We are currently waiting for a batch of new 2x10GbE NICs for the servers and then we will separate all traffic. Could it have something to do with that? But the error always happens after the transfer has completed.

Sometimes it just says something like this (But the VM stays online when this happens):
Code:
2020-11-12 20:14:21 starting migration of VM 101 to node 'XXXXXXXXX' (x.x.x.x)
2020-11-12 20:14:22 starting VM 101 on remote node 'XXXXXXXXXX'
2020-11-12 20:14:23 start remote tunnel
2020-11-12 20:14:24 ssh tunnel ver 1
2020-11-12 20:14:24 starting online/live migration on unix:/run/qemu-server/101.migrate
2020-11-12 20:14:24 set migration_caps
2020-11-12 20:14:24 migration speed limit: 8589934592 B/s
2020-11-12 20:14:24 migration downtime limit: 100 ms
2020-11-12 20:14:24 migration cachesize: 268435456 B
2020-11-12 20:14:24 set migration parameters
2020-11-12 20:14:24 start migrate command to unix:/run/qemu-server/101.migrate
2020-11-12 20:14:25 migration status: active (transferred 112043967, remaining 2047471616), total 2165121024)
2020-11-12 20:14:25 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:26 migration status: active (transferred 217177637, remaining 1936338944), total 2165121024)
2020-11-12 20:14:26 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:27 migration status: active (transferred 329286990, remaining 1816891392), total 2165121024)
2020-11-12 20:14:27 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:28 migration status: active (transferred 442813862, remaining 1697144832), total 2165121024)
2020-11-12 20:14:28 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:29 migration status: active (transferred 547443685, remaining 1591676928), total 2165121024)
2020-11-12 20:14:29 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:30 migration status: active (transferred 663283881, remaining 1473953792), total 2165121024)
2020-11-12 20:14:30 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:31 migration status: active (transferred 775638773, remaining 1358303232), total 2165121024)
2020-11-12 20:14:31 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:32 migration status: active (transferred 885616774, remaining 1245331456), total 2165121024)
2020-11-12 20:14:32 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:33 migration status: active (transferred 994590204, remaining 1134813184), total 2165121024)
2020-11-12 20:14:33 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:34 migration status: active (transferred 1106552588, remaining 1018884096), total 2165121024)
2020-11-12 20:14:34 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:35 migration status: active (transferred 1211361304, remaining 912138240), total 2165121024)
2020-11-12 20:14:35 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:36 migration status: active (transferred 1324329033, remaining 797130752), total 2165121024)
2020-11-12 20:14:36 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:37 migration status: active (transferred 1438043897, remaining 677556224), total 2165121024)
2020-11-12 20:14:37 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:38 migration status: active (transferred 1543380904, remaining 568930304), total 2165121024)
2020-11-12 20:14:38 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:39 migration status: active (transferred 1657495205, remaining 444620800), total 2165121024)
2020-11-12 20:14:39 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:40 migration status: active (transferred 1771698661, remaining 328187904), total 2165121024)
2020-11-12 20:14:40 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:41 migration status: active (transferred 1873987359, remaining 224030720), total 2165121024)
2020-11-12 20:14:41 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:42 migration status: active (transferred 1987290428, remaining 99909632), total 2165121024)
2020-11-12 20:14:42 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:42 migration status: active (transferred 1998258528, remaining 87965696), total 2165121024)
2020-11-12 20:14:42 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:42 migration status: active (transferred 2010311056, remaining 74498048), total 2165121024)
2020-11-12 20:14:42 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:42 migration status: active (transferred 2022132059, remaining 62033920), total 2165121024)
2020-11-12 20:14:42 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:42 migration status: active (transferred 2034128796, remaining 49729536), total 2165121024)
2020-11-12 20:14:42 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:42 migration status: active (transferred 2045931916, remaining 37949440), total 2165121024)
2020-11-12 20:14:42 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:42 migration status: active (transferred 2057759660, remaining 26144768), total 2165121024)
2020-11-12 20:14:42 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-11-12 20:14:43 migration status error: failed
2020-11-12 20:14:43 ERROR: online migrate failure - aborting
2020-11-12 20:14:43 aborting phase 2 - cleanup resources
2020-11-12 20:14:43 migrate_cancel
2020-11-12 20:14:44 ERROR: migration finished with problems (duration 00:00:23)
TASK ERROR: migration problems

Any pointers?
Tell me if you need anything else i'm pretty new to Proxmox :)

Best regards
Marcus
 
can you post your vm configs (qm config ID) as well as the output of 'pveversion -v' of both nodes?
is it always the same vm? always the same nodes?
 
During that maintenance day we had two nodes fail live migration. It's not always the same VM. Neither is it always the same node/server.

Here's the vm config of one that failed:
Code:
bootdisk: scsi0
cores: 2
memory: 2048
name: log-01
net0: e1000=C6:F0:1D:8D:BF:AB,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: gv_vm-storage:101/vm-101-disk-0.qcow2,size=42G
scsihw: pvscsi
smbios1: uuid=9e333b2b-a790-4fe7-bf07-a2075f69c515
sockets: 1
vmgenid: 31be4304-443f-48e8-a735-d1fd1d710445

And here's the config of the other one that failed:
Code:
bootdisk: scsi0
cores: 2
memory: 2048
name: app-01
net0: e1000=26:94:03:2A:D3:7D,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: gv_vm-storage:100/vm-100-disk-0.qcow2,size=32G
scsihw: pvscsi
smbios1: uuid=3cfd0067-957d-40d9-bb9b-d9e5a59fbd88
sockets: 1
vmgenid: a096582c-dbe2-4ea2-a4d0-a45a0d721dd6

Here's the output of pveversion -v for server1:
Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-1
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-3
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-15
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

Here's the output of pveversion -v for server2:
Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-1
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-3
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-15
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

Here's the output of pveversion -v for server2:
Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-1
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-3
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-15
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

There's really nothing fancy going on inside the VMs either. Standard Debian 10 Buster with vanilla kernel 4.19...



Marcus
 
HI,

we have the same situation. Only one VM is affected in the cluster (3 node cluster, local ZFS). All other machines can migrate without a problem. Every VM has qemu-guest-agent (all of them are Debian).

syslog
Code:
Dec 01 09:46:53 proxmox- QEMU[23726]: kvm: Bitmap 'repl_scsi1' is currently in use by another operation and cannot be used
Dec 01 09:47:00 proxmox- systemd[1]: Starting Proxmox VE replication runner...
Dec 01 09:47:01 proxmox- pvesr[38394]: trying to acquire lock...
Dec 01 09:47:03 proxmox- pvesr[38394]: 104-0: got unexpected replication job error - can't lock file '/var/lock/pve-manager/pve-migrate-104' - got timeout
Dec 01 09:47:03 proxmox- pvesr[38394]: trying to acquire lock...
Dec 01 09:47:05 proxmox- pvesr[38394]: 104-0: got unexpected replication job error - can't lock file '/var/lock/pve-manager/pve-migrate-104' - got timeout
.
.

Dec 01 09:48:10 proxmox- QEMU[23726]: kvm: block/io.c:1891: bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
Dec 01 09:48:10 proxmox- pvedaemon[36769]: VM 104 qmp command failed - VM 104 not running
Dec 01 09:48:10 proxmox- pvedaemon[36769]: query migrate failed: VM 104 not running
Dec 01 09:48:11 proxmox- pvesr[38394]: 104-0: got unexpected replication job error - can't lock file '/var/lock/pve-manager/pve-migrate-104' - got timeout
Dec 01 09:48:11 proxmox- pvesr[38394]: trying to acquire lock...
Dec 01 09:48:11 proxmox- pvedaemon[36769]: VM 104 qmp command failed - VM 104 not running
Dec 01 09:48:11 proxmox- pvedaemon[36769]: query migrate failed: VM 104 not running
Dec 01 09:48:12 proxmox- kernel: vmbr1v19: port 4(tap104i0) entered disabled state
Dec 01 09:48:12 proxmox- kernel: vmbr1v19: port 4(tap104i0) entered disabled state
Dec 01 09:48:12 proxmox- systemd[1]: 104.scope: Succeeded.
Dec 01 09:48:12 proxmox- pvedaemon[36769]: VM 104 qmp command failed - VM 104 not running
Dec 01 09:48:12 proxmox- pvedaemon[36769]: query migrate failed: VM 104 not running
.
.
Dec 01 09:48:16 proxmox- pvedaemon[36769]: VM 104 qmp command failed - VM 104 not running
Dec 01 09:48:16 proxmox- pvedaemon[36769]: VM 104 qmp command failed - VM 104 not running
Dec 01 09:48:16 proxmox- pvedaemon[36769]: VM 104 qmp command failed - VM 104 not running
Dec 01 09:48:17 proxmox- pmxcfs[2917]: [status] notice: received log
Dec 01 09:48:17 proxmox- pmxcfs[2917]: [status] notice: received log
Dec 01 09:48:17 proxmox- pvesr[38394]: 104-0: got unexpected replication job error - can't lock file '/var/lock/pve-manager/pve-migrate-104' - got timeout
Dec 01 09:48:17 proxmox- pvesr[38394]: trying to acquire lock...
Dec 01 09:48:18 proxmox- qmeventd[2403]:  OK
Dec 01 09:48:18 proxmox- pvesr[38394]:  OK
Dec 01 09:48:18 proxmox- pvedaemon[36769]: migration problems
Dec 01 09:48:18 proxmox- pvedaemon[9156]: <> end task UPID:proxmox-:00008FA1:00649CBD:5FC602F2:qmigrate:104:@: migration problems
Dec 01 09:48:18 proxmox- qmeventd[2403]: Finished cleanup for 104

Proxmox version (newest in community subscription repository)

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.73-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve
 
Last edited:
OK, i think I found the cause of the problem. Migrating machine was Debain 7 with qemu agent package 1:2.1+dfsg-12+deb8u5a~bpo70+1. I can confirm this worked as expected prior the newest update. After I disabled qemu on VM level, migration is going as expected (LocalZFS). I also tested this with shared storage (VM with same version) and it works without issue with qemu enabled.

BR

Michael
 
Last edited:
Same problem here, but with debian 10. Unable to replicate the bug, it's now and then, different vms, different hosts. Not sure if "fstrim -av" before migration may help or it's just a "homeopatic" solution.

On source:

Mar 27 16:09:28 QEMU[19778]: kvm: ../block/io.c:1810: bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
Mar 27 16:09:28 pvedaemon[10521]: VM 107 qmp command failed - VM 107 not running
Mar 27 16:09:28 pvedaemon[10521]: VM 107 qmp command failed - VM 107 not running

On destination:

Mar 27 16:09:28 QEMU[11891]: kvm: Disconnect client, due to: Failed to read request: Unexpected end-of-file before all bytes were read
Mar 27 16:09:28 QEMU[11891]: kvm: Disconnect client, due to: Failed to read request: Unexpected end-of-file before all bytes were read

Both machines have latest versions of packages (updated minutes ago). Storage is both zfs with replication (so the migration is delta).
 
I just encounter a similar issue. Proxmox 6.4.1.3, up to date, VM Ubuntu 20.04 with qemu-guest-agent activated. During migration, VM just stopped. I'll disable all guest-agents on vm's and see if the problem remains.
 
I just encounter a similar issue. Proxmox 6.4.1.3, up to date, VM Ubuntu 20.04 with qemu-guest-agent activated. During migration, VM just stopped. I'll disable all guest-agents on vm's and see if the problem remains.
me too, yesterday, but without issuing a fstrim -av before moving that vm (but like I've said above, I'm not sure if it does the trick); still can't reproduce the failure
 
  • Like
Reactions: Dan Nicolae
We're seeing this as well, though without the overflow during RAM transmission. More info further into this reply.

(SSD moved successfully as part of live migration ahead of this log excerpt, with VM427 running fine as always.)

Code:
all 'mirror' jobs are ready
2021-11-12 11:20:25 starting online/live migration on unix:/run/qemu-server/427.migrate
2021-11-12 11:20:25 set migration capabilities
2021-11-12 11:20:25 migration speed limit: 600.0 MiB/s
2021-11-12 11:20:25 migration downtime limit: 100 ms
2021-11-12 11:20:25 migration cachesize: 4.0 GiB
2021-11-12 11:20:25 set migration parameters
2021-11-12 11:20:25 start migrate command to unix:/run/qemu-server/427.migrate
query migrate failed: VM 427 not running

2021-11-12 11:20:26 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:28 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:30 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:32 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:34 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:36 query migrate failed: VM 427 not running
2021-11-12 11:20:36 ERROR: online migrate failure - too many query migrate failures - aborting
2021-11-12 11:20:36 aborting phase 2 - cleanup resources
2021-11-12 11:20:36 migrate_cancel
2021-11-12 11:20:36 migrate_cancel error: VM 427 not running
drive-scsi0: Cancelling block job
2021-11-12 11:20:36 ERROR: VM 427 not running
2021-11-12 11:20:59 ERROR: migration finished with problems

The VM was then powered down (crashed, etc) on the old host node. To clarify, this VM is ordinarily stable.

We were able to replicate the problem with another VM of different SSD/RAM sizes.

For reference:

This problem was discovered following a cluster-of-4 upgrade wave from PVE 6.4 latest to 7.0 latest.

Upgrades went successfully however the problem has emerged since completion. Many VMs moved around before, during and after the migration without problems, however this issue that we can replicate is occurring between 2 of the 4 nodes in particular, in a single direction.

Code:
root@node1:~# qm config 427
balloon: 0
bootdisk: scsi0
cores: 5
memory: 24576
name: vm427
net0: e1000=*removed*,bridge=vmbr0,rate=200
numa: 0
ostype: l26
scsi0: local-lvm:vm-427-disk-0,backup=0,format=raw,size=500G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=3dcb69d6-671a-4b23-8ed4-6cfcfc85683d
sockets: 2
vmgenid: 5c7e1dbb-a9e4-4517-bb95-030748de1db1
root@node1:~#

pveversion -v as below, noting that node1 has an extra older kernel that node 3 did not. Removed and thus output is now identical, problem re-verified to exist following the removal of that extra kernel version (pve-kernel-5.4.34-1-pve) (which autoremove wasn't talking about... odd).

Code:
# pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-7-pve)
pve-manager: 7.0-14+1 (running version: 7.0-14+1/08975a4c)
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-7
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-12
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-13
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.13-1
proxmox-backup-file-restore: 2.0.13-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.1-1
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.1.0-1
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-18
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

No VM has the QEMU guest agent installed or configured, so I'd say it's not relevant here.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!