promox5.2 bulk migrate make vm lock

helben.cai

Member
Apr 22, 2014
14
0
21
when i use bulk migrate,the virtual machine easy lock ,i use "for;do qm migrate;done" command,also to;but when i use "for;do qm migrate && sleep 20;done" ,it's OK,i don't know it's a bug?
proxmox-ve: 5.2-2 (running kernel: 4.15.18-7-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-10
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1
 
the correct way to do a bulk migration is either via the webinterface or

pvenode migrateall

see man pvenode for details and options
 
the correct way to do a bulk migration is either via the webinterface or

pvenode migrateall

see man pvenode for details and options
i use web bulk migrate,after migrate successful, vm is lock,i don't know what happen,so i use command "qm migrate",lock too;
 
can you post the task log of the migration from the locked vm ?
 
no i meant the task log

each migration has a task log in the webinterface, when you double click on it , you should see some the migration output
 
no i meant the task log

each migration has a task log in the webinterface, when you double click on it , you should see some the migration output
2018-11-27 12:43:22 migration status: completed
2018-11-27 12:43:22 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 120 - no such machine' to command 'resume 120'
2018-11-27 12:43:25 ERROR: migration finished with problems (duration 00:00:19)
TASK ERROR: migration problems
 
can you please post the complete output?
 
can you please post the complete output?
task started by HA resource agent
2018-11-27 12:43:06 starting migration of VM 120 to node 'cuvmpssvr04' (10.86.12.87)
2018-11-27 12:43:06 copying disk images
2018-11-27 12:43:06 starting VM 120 on remote node 'cuvmpssvr04'
2018-11-27 12:43:08 start remote tunnel
2018-11-27 12:43:09 ssh tunnel ver 1
2018-11-27 12:43:09 starting online/live migration on unix:/run/qemu-server/120.migrate
2018-11-27 12:43:09 migrate_set_speed: 8589934592
2018-11-27 12:43:09 migrate_set_downtime: 0.1
2018-11-27 12:43:09 set migration_caps
2018-11-27 12:43:09 set cachesize: 536870912
2018-11-27 12:43:09 start migrate command to unix:/run/qemu-server/120.migrate
2018-11-27 12:43:10 migration status: active (transferred 97308060, remaining 1646395392), total 4312604672)
2018-11-27 12:43:10 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:11 migration status: active (transferred 213105462, remaining 1512734720), total 4312604672)
2018-11-27 12:43:11 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:12 migration status: active (transferred 307794498, remaining 1412001792), total 4312604672)
2018-11-27 12:43:12 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:13 migration status: active (transferred 378614976, remaining 1338679296), total 4312604672)
2018-11-27 12:43:13 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:14 migration status: active (transferred 449595538, remaining 1265180672), total 4312604672)
2018-11-27 12:43:14 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:15 migration status: active (transferred 520749394, remaining 1191092224), total 4312604672)
2018-11-27 12:43:15 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:16 migration status: active (transferred 583013276, remaining 302448640), total 4312604672)
2018-11-27 12:43:16 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:17 migration status: active (transferred 642844358, remaining 239587328), total 4312604672)
2018-11-27 12:43:17 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:18 migration status: active (transferred 703589436, remaining 174489600), total 4312604672)
2018-11-27 12:43:18 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:19 migration status: active (transferred 757859113, remaining 109940736), total 4312604672)
2018-11-27 12:43:19 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:20 migration status: active (transferred 797484447, remaining 62455808), total 4312604672)
2018-11-27 12:43:20 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:20 migration status: active (transferred 801695167, remaining 58253312), total 4312604672)
2018-11-27 12:43:20 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:20 migration status: active (transferred 806037215, remaining 53919744), total 4312604672)
2018-11-27 12:43:20 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:20 migration status: active (transferred 809985279, remaining 49979392), total 4312604672)
2018-11-27 12:43:20 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:20 migration status: active (transferred 814327327, remaining 45645824), total 4312604672)
2018-11-27 12:43:20 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:20 migration status: active (transferred 818406719, remaining 41574400), total 4312604672)
2018-11-27 12:43:20 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 822617439, remaining 37371904), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 826828159, remaining 33169408), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 830907551, remaining 29097984), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 835249626, remaining 24752128), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 839571748, remaining 20168704), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 843758061, remaining 12165120), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 847849154, remaining 16842752), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 673 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 852191202, remaining 12509184), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 1731 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 856336258, remaining 8372224), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 2741 overflow 0
2018-11-27 12:43:21 migration status: active (transferred 860448482, remaining 4268032), total 4312604672)
2018-11-27 12:43:21 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 3743 overflow 0
2018-11-27 12:43:22 migration speed: 315.08 MB/s - downtime 22 ms
2018-11-27 12:43:22 migration status: completed
2018-11-27 12:43:22 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 120 - no such machine' to command 'resume 120'
2018-11-27 12:43:25 ERROR: migration finished with problems (duration 00:00:19)
TASK ERROR: migration problems
 
maybe the node where you want to migrate to is not in the ha group for this vm ?
 
I test 15 vm migrate,in active-backup mode network ,about 10 vm were lock,another,about 5 vm were lock
 
it sounds like your cluster network is overloaded, what does pvecm status ha-manager status say?
how does your network setup look like?

maybe you can try to reduce the ha workers (see man datacenter.cfg for that)
 
it sounds like your cluster network is overloaded, what does pvecm status ha-manager status say?
how does your network setup look like?

maybe you can try to reduce the ha workers (see man datacenter.cfg for that)
bwlimit:clone=LIMIT,default=LIMIT,migration=142MiB/s,move=LIMIT,restore=LIMIT
like this format?
 
bwlimit:clone=LIMIT,default=LIMIT,migration=142MiB/s,move=LIMIT,restore=LIMIT
like this format?
pveproxy[2590481]: parse error in '/etc/pve/datacenter.cfg' - 'bwlimit': invalid format - format error#012bwlimit.migration: type check ('number') failed - got '142MiB/s'#012bwlimit.restore: type check ('number') failed - got 'LIMIT'#012bwlimit.move: type check ('number') failed - got 'LIMIT'#012bwlimit.clone: type check ('number') failed - got 'LIMIT'#012bwlimit.default: type check ('number') failed - got 'LIMIT'
 
pveproxy[2590481]: parse error in '/etc/pve/datacenter.cfg' - 'bwlimit': invalid format - format error#012bwlimit.migration: type check ('number') failed - got '142MiB/s'#012bwlimit.restore: type check ('number') failed - got 'LIMIT'#012bwlimit.move: type check ('number') failed - got 'LIMIT'#012bwlimit.clone: type check ('number') failed - got 'LIMIT'#012bwlimit.default: type check ('number') failed - got 'LIMIT'
i think /etc/pve/datacenter.cfg maybe unuseful,i echo 'bwlimit: migration=142' in it ,but speed also more than 200M/s
 
Hey guys,

I am encountering the same error with a much up-2-date version of pve.

task started by HA resource agent
2019-05-15 16:55:38 starting migration of VM 30406 to node 'pve06' (10.20.31.116)
2019-05-15 16:55:38 copying disk images
2019-05-15 16:55:38 starting VM 30406 on remote node 'pve06'
2019-05-15 16:55:40 start remote tunnel
2019-05-15 16:55:41 ssh tunnel ver 1
2019-05-15 16:55:41 starting online/live migration on tcp:10.20.31.116:60000
2019-05-15 16:55:41 migrate_set_speed: 8589934592
2019-05-15 16:55:41 migrate_set_downtime: 0.1
2019-05-15 16:55:41 set migration_caps
2019-05-15 16:55:41 set cachesize: 134217728
2019-05-15 16:55:41 start migrate command to tcp:10.20.31.116:60000
2019-05-15 16:55:42 migration status: active (transferred 445338360, remaining 640929792), total 1091379200)
2019-05-15 16:55:42 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2019-05-15 16:55:43 migration status: active (transferred 796825997, remaining 288079872), total 1091379200)
2019-05-15 16:55:43 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2019-05-15 16:55:44 migration speed: 341.33 MB/s - downtime 11 ms
2019-05-15 16:55:44 migration status: completed
2019-05-15 16:55:44 ERROR: tunnel replied 'ERR: resume failed - unable to find configuration file for VM 30406 - no such machine' to command 'resume 30406'
2019-05-15 16:55:46 ERROR: migration finished with problems (duration 00:00:08)
TASK ERROR: migration problems

proxmox-ve: 5.2-2 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-2
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-12-pve: 4.15.18-36
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-9
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-42
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-37
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Environment details:
  • 7 prox nodes running the same version
  • each prox node connects via dual 10 Gbit NIC with 2 switches and forms an MLAG Po
  • each traffic type (also cluster -corosync vlan) separated/designated on each vlan
  • use of openvswitch 2.7.0-3 with balance-tcp and lacp rate set to fast
datacenter.cfg looks like:

keyboard: en-us
migration: type=insecure
bwlimit: move=8192
bwlimit: migration=8192
bwlimit: clone=8192

So, I haven't touch any of the datacenter.cfg settings related to ha-manager in this file.

Quorum information
------------------
Date: Wed May 15 17:11:41 2019
Quorum provider: corosync_votequorum
Nodes: 7
Node ID: 0x00000006
Ring ID: 1/1576
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 7
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.20.31.111
0x00000002 1 10.20.31.112
0x00000003 1 10.20.31.113
0x00000004 1 10.20.31.114
0x00000005 1 10.20.31.115
0x00000006 1 10.20.31.116 (local)
0x00000007 1 10.20.31.117

quorum OK
master pve04 (active, Wed May 15 17:12:17 2019)
lrm pve01 (active, Wed May 15 17:12:17 2019)
lrm pve02 (active, Wed May 15 17:12:17 2019)
lrm pve03 (active, Wed May 15 17:12:17 2019)
lrm pve04 (active, Wed May 15 17:12:17 2019)
lrm pve05 (idle, Wed May 15 17:12:13 2019)
lrm pve06 (active, Wed May 15 17:12:09 2019)
lrm pve07 (active, Wed May 15 17:12:08 2019)
service vm:1001 (pve01, started)
service vm:103 (pve07, stopped)
service vm:1709 (pve03, started)
service vm:1710 (pve07, started)
service vm:1712 (pve02, started)
service vm:30402 (pve06, started)
service vm:30403 (pve06, started)
service vm:30404 (pve06, started)
service vm:30405 (pve06, started)
service vm:30406 (pve06, started)
service vm:30407 (pve06, started)
service vm:30408 (pve06, started)
service vm:30409 (pve06, started)
service vm:30416 (pve06, started)
service vm:30417 (pve06, started)
service vm:30418 (pve06, started)
service vm:30429 (pve04, started)

Note: the hagroup is created to have all nodes included, restricted is unchecked and nofailback is checked.

The VM migration is done via the GUI using bulk migrate action, selecting the specific instance IDs and settings 2+ parallel migration jobs.
I am under the impression that running more bulk migrations on the same time for ha-managed vm id cause the vms to enter stall state, somehow like the VM get's on the node and the config is left behind :)

Let me know if I can provide you with more information.

Thanks,
Alex
 
I have a cluster with a large amount of VM's and I also hit issues on large bulk migrates were some VM's end up in a "freeze" state and stay that way until I hit resume. Running a dedicated 10G backend network as well, only traffic on it is the cluster.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!