Live migration fails for high memory VMs

opn

Member
Jan 19, 2018
7
0
21
54
Hi all.

I am experiencing an issue that occurs at almost every attempt when using live migration of a virtual machine with high memory allocation (32GB).

The live migration starts, then suddenly the VM is turned off and restarts.

We are using a 3-node HA Proxmox cluster with Ceph storage.

Here is the task log output:

Code:
task started by HA resource agent
2020-05-07 06:48:40 starting migration of VM 102 to node 'clusternode1' (10.81.250.100)
2020-05-07 06:48:40 starting VM 102 on remote node 'clusternode1'
2020-05-07 06:48:42 start remote tunnel
2020-05-07 06:48:43 ssh tunnel ver 1
2020-05-07 06:48:43 starting online/live migration on unix:/run/qemu-server/102.migrate
2020-05-07 06:48:43 set migration_caps
2020-05-07 06:48:43 migration speed limit: 8589934592 B/s
2020-05-07 06:48:43 migration downtime limit: 100 ms
2020-05-07 06:48:43 migration cachesize: 4294967296 B
2020-05-07 06:48:43 set migration parameters
2020-05-07 06:48:43 start migrate command to unix:/run/qemu-server/102.migrate
2020-05-07 06:48:44 migration status: active (transferred 28051947, remaining 34337816576), total 34377441280)
2020-05-07 06:48:44 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-07 06:48:45 migration status: active (transferred 57369637, remaining 34305478656), total 34377441280)
2020-05-07 06:48:45 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-07 06:48:46 migration status: active (transferred 86952918, remaining 34275262464), total 34377441280)
...
2020-05-07 06:55:41 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-07 06:55:42 migration status: active (transferred 31143197399, remaining 354840576), total 34377441280)
2020-05-07 06:55:42 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-07 06:55:43 migration status: active (transferred 31260030231, remaining 238235648), total 34377441280)
2020-05-07 06:55:43 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-07 06:55:44 migration status: active (transferred 31377121607, remaining 121372672), total 34377441280)
2020-05-07 06:55:44 migration xbzrle cachesize: 4294967296 transferred 0 pages 0 cachemiss 0 overflow 0
query migrate failed: VM 102 not running

2020-05-07 06:55:45 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2020-05-07 06:55:47 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2020-05-07 06:55:49 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2020-05-07 06:55:51 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2020-05-07 06:55:53 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2020-05-07 06:55:55 query migrate failed: VM 102 not running
2020-05-07 06:55:55 ERROR: online migrate failure - too many query migrate failures - aborting
2020-05-07 06:55:55 aborting phase 2 - cleanup resources
2020-05-07 06:55:55 migrate_cancel
2020-05-07 06:55:55 migrate_cancel error: VM 102 not running
2020-05-07 06:55:57 ERROR: migration finished with problems (duration 00:07:18)
TASK ERROR: migration problems

Is there anything I can do to get live migration working more reliably than this? At the moment I am better of powering off the virtual machine and turning it back on on a different host.

Kind regards
 
Can you please post your package versions: pveversion -v

Code:
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
I would like to report that we had to attempt another live migration today due to a hardware issue with one of our servers, with the same result: the virtual machine "died" during the live migration and had to be restarted on a different node.
 
I would like to bump this up. Am I the only one with this problem at the moment?
 
Is there any error/warning in the syslog/journal during that time, maybe the QEMU process crashed? DId you upgrade to latest 6.2, the QEMU in version 5.0 has quite some fixes included also.

Further, you say you're using 32GB in this VM, how much does the host system has, what runs all on there?
 
I have attached two screenshots, with log entries from both Proxmox and syslog from the specified time. I can't see any qemu-related error messages.

These crashes obviously occurred with version 6.1. Upgrading to 6.2 will have to be done on a weekend, since we can't use the live migration feature reliably at the moment.

We have three nodes with ~ 200 GB RAM each, each node is using ~ 50 GB for VMs, so there are plenty of resources available at any time.
 

Attachments

  • Screen Shot 2020-05-28 at 7.37.31 am.png
    Screen Shot 2020-05-28 at 7.37.31 am.png
    132.9 KB · Views: 16
  • Screen Shot 2020-05-28 at 7.37.36 am.png
    Screen Shot 2020-05-28 at 7.37.36 am.png
    90.1 KB · Views: 19
any update on this? is it a win or linux based vm. im just curious if u found the solution
 
any update on this? is it a win or linux based vm. im just curious if u found the solution

I have not received any helpful feedback, except for "update to the latest version". We are still having this issue but are moving to a different kind of solution in the future, so the priority in solving this problem is quite low for us at the moment.
 
I have not received any helpful feedback, except for "update to the latest version". We are still having this issue but are moving to a different kind of solution in the future, so the priority in solving this problem is quite low for us at the moment.

thanks for a follow up.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!