Slow Live Migration when guest has been running for a while

Nov 23, 2017
12
2
43
55
I'm running into an issue where Live Migration between nodes in the 3 node cluster is terribly slow if the guest has been running for a couple of weeks. Network is 10GB, shared storage is on NFS storage server.

This happens on both Windows and Linux VM guests.

Here is an example of a Linux VM, that I tried to Live Migrate, and eventually gave up and shutdown from inside the guest at approx 14:48, after 15 minutes had elapsed. The node doesn't have any snapshots.
Code:
task started by HA resource agent
2018-10-09 14:34:12 use dedicated network address for sending migration traffic (10.99.99.105)
2018-10-09 14:34:12 starting migration of VM 119 to node 'vmserver5' (10.99.99.105)
2018-10-09 14:34:12 copying disk images
2018-10-09 14:34:12 starting VM 119 on remote node 'vmserver5'
2018-10-09 14:34:15 start remote tunnel
2018-10-09 14:34:16 ssh tunnel ver 1
2018-10-09 14:34:16 starting online/live migration on tcp:10.99.99.105:60000
2018-10-09 14:34:16 migrate_set_speed: 8589934592
2018-10-09 14:34:16 migrate_set_downtime: 0.1
2018-10-09 14:34:16 set migration_caps
2018-10-09 14:34:16 set cachesize: 134217728
2018-10-09 14:34:16 start migrate command to tcp:10.99.99.105:60000
2018-10-09 14:34:19 migration status: active (transferred 78118, remaining 1091235840), total 1091379200)
2018-10-09 14:34:19 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:20 migration status: active (transferred 328470, remaining 1090985984), total 1091379200)
2018-10-09 14:34:20 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:21 migration status: active (transferred 1383789, remaining 1089675264), total 1091379200)
2018-10-09 14:34:21 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:22 migration status: active (transferred 1810621, remaining 1089249280), total 1091379200)
2018-10-09 14:34:22 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:23 migration status: active (transferred 2175885, remaining 1088884736), total 1091379200)
2018-10-09 14:34:23 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:24 migration status: active (transferred 2483693, remaining 1088577536), total 1091379200)
2018-10-09 14:34:24 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:25 migration status: active (transferred 2758669, remaining 1088303104), total 1091379200)
2018-10-09 14:34:25 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:26 migration status: active (transferred 3045957, remaining 1088016384), total 1091379200)
2018-10-09 14:34:26 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:27 migration status: active (transferred 4055943, remaining 1086836736), total 1091379200)
2018-10-09 14:34:27 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:28 migration status: active (transferred 4409075, remaining 1086402560), total 1091379200)
2018-10-09 14:34:28 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:34:29 migration status: active (transferred 4749849, remaining 1086005248), total 1091379200)
2018-10-09 14:34:29 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
.
.
.
2018-10-09 14:48:29 migration status: active (transferred 649486833, remaining 421224448), total 1091379200)
2018-10-09 14:48:29 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:48:30 migration status: active (transferred 650537534, remaining 420155392), total 1091379200)
2018-10-09 14:48:30 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:48:31 migration status: active (transferred 651653863, remaining 419037184), total 1091379200)
2018-10-09 14:48:31 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
query migrate failed: VM 119 qmp command 'query-migrate' failed - client closed connection

2018-10-09 14:48:34 query migrate failed: VM 119 qmp command 'query-migrate' failed - client closed connection
query migrate failed: unable to open monitor socket

2018-10-09 14:48:36 query migrate failed: unable to open monitor socket
query migrate failed: VM 119 not running

2018-10-09 14:48:38 query migrate failed: VM 119 not running
query migrate failed: VM 119 not running

2018-10-09 14:48:40 query migrate failed: VM 119 not running
query migrate failed: VM 119 not running

2018-10-09 14:48:42 query migrate failed: VM 119 not running
query migrate failed: VM 119 not running

2018-10-09 14:48:44 query migrate failed: VM 119 not running
2018-10-09 14:48:44 ERROR: online migrate failure - too many query migrate failures - aborting
2018-10-09 14:48:44 aborting phase 2 - cleanup resources
2018-10-09 14:48:44 migrate_cancel
2018-10-09 14:48:44 migrate_cancel error: VM 119 not running
2018-10-09 14:48:46 ERROR: migration finished with problems (duration 00:14:35)
TASK ERROR: migration problems


After the guest VM was restarted by HA on the same node, transferring to the same destination node:
Code:
task started by HA resource agent
2018-10-09 14:50:32 use dedicated network address for sending migration traffic (10.99.99.105)
2018-10-09 14:50:32 starting migration of VM 119 to node 'vmserver5' (10.99.99.105)
2018-10-09 14:50:32 copying disk images
2018-10-09 14:50:32 starting VM 119 on remote node 'vmserver5'
2018-10-09 14:50:36 start remote tunnel
2018-10-09 14:50:37 ssh tunnel ver 1
2018-10-09 14:50:37 starting online/live migration on tcp:10.99.99.105:60000
2018-10-09 14:50:37 migrate_set_speed: 8589934592
2018-10-09 14:50:37 migrate_set_downtime: 0.1
2018-10-09 14:50:37 set migration_caps
2018-10-09 14:50:37 set cachesize: 134217728
2018-10-09 14:50:37 start migrate command to tcp:10.99.99.105:60000
2018-10-09 14:50:38 migration status: active (transferred 299569750, remaining 470827008), total 1091379200)
2018-10-09 14:50:38 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
2018-10-09 14:50:39 migration speed: 512.00 MB/s - downtime 34 ms
2018-10-09 14:50:39 migration status: completed
2018-10-09 14:50:42 migration finished successfully (duration 00:00:11)
TASK OK

For guests that I let finish the migration without rebooting, I am seeing migrations speeds of as low as 1MB/s. Guests that are freshly rebooted, I have seen up to 2GB/s live migration speeds.

pve-manager is at: 5.2-9/4b30e8f9
Was happing earlier today, when all kernels were at 4.15.19-4-pve, and after upgrading 2 of the nodes to 4.15.18-5-pve, copying from the older kernel node to newer.

Additionally, I just finished Live Migrating our VoIP PBX, which I couldn't shutdown. The migration took 38 minutes, at 1.34MB/s. After rebooting the final node, Live Migrating it back took 19 seconds at 384 MB/s.

Any help in resolving this would be appreciated. If any more info is needed, let me know.
 
I don't quite understand what your question is. But comparing the values on migration start (below), the amount of memory that has to be transferred between nodes is different. And depending on how much the memory has changes during the migration window, the migration needs to re-transfer these.
2018-10-09 14:34:19 migration status: active (transferred 78118, remaining 1091235840), total 1091379200) 2018-10-09 14:34:19 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0

2018-10-09 14:50:38 migration status: active (transferred 299569750, remaining 470827008), total 1091379200) 2018-10-09 14:50:38 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0
 
Thanks for the response. I understand that an active guest would have more changed memory to be migrated. I would expect that to affect total time of transfer, but should it affect transfer speed?

To clarify my initial post, my first question is: Why is the Live Migration transfer speed so much slower(less than 10MB/s) when the guest has been running for a period of time, but if I restart the guest, the transfer speed always increases to at least 256MB/s, if not more.

My second question is: Why do transfer speeds improve after a node has been rebooted?
Additionally, I just finished Live Migrating our VoIP PBX, which I couldn't shutdown. The migration took 38 minutes, at 1.34MB/s. After rebooting the final node, Live Migrating it back took 19 seconds at 384 MB/s.
In this example, the PBX guest was similarily loaded with calls.


In response to how much memory needs to be transferred. For both transfers, is the total not 1091379200, or am I misreading this information? In the first Live Migration, the first line amount transferred is so small, because the transfer speed is so slow.
 
My second question is: Why do transfer speeds improve after a node has been rebooted?
I assume, you mean the guest not the host itself, or not? But AFAIU, the bandwidth is calculated by amount of RAM devided by the time it took to transfer. With the qemu monitor you can query the migration info and get some stats.