Slow livemigration performance since 5.0

Sralityhe

Well-Known Member
Jul 5, 2017
81
3
48
30
Dear Promox Team,

first of all thanks for the great work, as always!

Is it normal the live-migration performance is way slower then before? Im using local discs and
"qm migrate 107 hostnamehere --online --with-local-disks" for example to migrate a vm. The copy itselfs is fullspeed (gbit) as before, but the downtime time seem to increase alot.

2017-07-05 04:52:15 starting online/live migration on unix:/run/qemu-server/107.migrate
2017-07-05 04:52:15 migrate_set_speed: 8589934592
2017-07-05 04:52:15 migrate_set_downtime: 0.1
2017-07-05 04:52:15 set migration_caps
2017-07-05 04:52:15 set cachesize: 53687091
2017-07-05 04:52:15 start migrate command to unix:/run/qemu-server/107.migrate
2017-07-05 04:52:17 migration status: active (transferred 235529221, remaining 56500224), total 554508288)
2017-07-05 04:52:17 migration xbzrle cachesize: 33554432 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-05 04:52:19 migration speed: 2.02 MB/s - downtime 71 ms
2017-07-05 04:52:19 migration status: completed
drive-virtio0: transferred: 42954326016 bytes remaining: 0 bytes total: 42954326016 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-virtio0: Completing block job...
drive-virtio0: Completed successfully.
drive-virtio0 : finished
2017-07-05 04:52:31 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=hostnamehere' root@internal-ip-here pvesr set-state 107 \''{}'\'
2017-07-05 04:52:36 migration finished successfully (duration 00:04:34)

It says 71 ms but it seems to take the time from 2017-07-05 04:52:19 to 2017-07-05 04:52:31 to get it started on the other node (or set its power state?), so its ~12 seconds instead <100 like before.

Can you confirm this and is there anything i can change about it?

Kind regards
 
and is there anything i can change about it?

People use shared storage if they want fast live migration. It usually also helps if you use faster Network (>= 10 GB).

Besides, effective downtime was 71ms which is quite good!
 
the thing it its not 71 ms, that would be great! It takes the 12 seconds between the two commands, since the vm seems down on the other node until then!

It took this time until the 5.0 upgrade, but since then the downtime time isnt acurate anymore.
Kind regards
 
the thing it its not 71 ms, that would be great! It takes the 12 seconds between the two commands, since the vm seems down on the other node until then!

No, the VM is only down for 71ms. The other time is the overall time to transfer the storage content.
 
No, the VM is only down for 71ms. The other time is the overall time to transfer the storage content.


Thanks for your answer, dont get me wrong, It sounds cooky AF, your know your code, but trust me when i say its not 72ms, I can reproduce it for every vm, every migration, i losse all connections, Teamspeak, FTP, SCP; SSH.. I tested with Sites like isitdownrightnow.com, no icmp request get answered. Also everyone else gets an timeout, so its not just me. That was not the case with pve 4.4! If i open up a windows shell and "ping -t" the ip adress it almost gets 10 timeout requests before answering aggain.

I dont complain about any overall transfer speed, sure this cant exeed 1gbit when im only using 2x1gbit lacp for each node, this is more then enough since i dont really use this feature often and just for one vm with 40gb hdd which then takes like 5minutes to copy it.

What does the command at 2017-07-05 04:52:31 do?

2017-07-05 04:52:31 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=hostnamehere' root@internal-ip-here pvesr set-state 107 \''{}'\'

If i read it correctly, it opens up an ssh session to other node and set the power state to on? Which is defentlity offline before that and thats i think is the reason why everything ond every application looses connections. It takes theese 12 secounds. the VM is offline on the node!

I used PVE4.4 before and it worked great there, i hat 8ms once, sometimes ~150 but overall well less then 12 seconds.

kind reagrds
 
I think your network is simply overloaded (or maybe the local disks), and this is why the VM does not respond. Maybe it helps if you set a rate limit?
 
Last edited:
What does the command at 2017-07-05 04:52:31 do?

2017-07-05 04:52:31 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=hostnamehere' root@internal-ip-here pvesr set-state 107 \''{}'\'

If i read it correctly, it opens up an ssh session to other node and set the power state to on?

No, the VM is started long time before. This is something else (transfer replication state after migration is finished).

I used PVE4.4 before and it worked great there, i hat 8ms once, sometimes ~150 but overall well less then 12 seconds.

Using '--with-local-disks' ?
 
I think your network is simply overloaded (or maybe the local disks), and this is why the VM does not respond. Maybe it helps if you set a rate limit?

Thanks for your answer!

The network shoulndt be overloaded. I have roughly 2 mbit traffic in&out total for my entire small network with 4x1G uplink for the switch and 2x1G for each node (IGMP resolver is also within my network).

While the transfer one nic for each node gets full load but besides that there should be more then enough bandwith (1G ech node) that could be used for other connections or mgmt commands like settings the power state.

I just made another test and icmp response time doenst seem to increase (i know.. just one indicator) while an transfer.

Im pretty much out of ideas and the fact that it worked before 5.0 and the vm response exact at the same time the ssh command gets executed aggain made me think that it mybe get executed way to late in the transfer?

Kind regards
 
No, the VM is started long time before. This is something else (transfer replication state after migration is finished).

I see, that sounds strange! I dont know anything about coding but mybe its somehow related?

Using '--with-local-disks' ?

Yes indeed, that setup itself has not changed, not more vms, nothing within the network plus i am migrating it using the exact same way (with local disks). I just did the 4.4 to 5.0 upgrade like your documentions shows and besides that everything works great.

Ill try it with rate limit now! Ty!

Edit: How do i migrate with an rate limit set? I just found this:
Code:
USAGE: qm migrate <vmid> <target> [OPTIONS]

  Migrate virtual machine. Creates a new migration task.

  <vmid>     <integer> (1 - N)

             The (unique) ID of the VM.

  <target>   <string>

             Target node.

  -force     <boolean>

             Allow to migrate VMs which use local devices. Only root may
             use this option.

  -migration_network <string>

             CIDR of the (sub) network that is used for migration.

  -migration_type <insecure | secure>

             Migration traffic is encrypted using an SSH tunnel by default.
             On secure, completely private networks this can be disabled to
             increase performance.

  -online    <boolean>

             Use online/live migration.

  -targetstorage <string>

             Default target storage.

  -with-local-disks <boolean>

             Enable live storage migration for local disk

Here you can see the load on network and nodes:

Code:
> show interfaces ae2 | find rate
  Input rate     : 392384 bps (216 pps)
  Output rate    : 295088 bps (142 pps)

> show interfaces ae0 | find rate
  Input rate     : 910136 bps (577 pps)
  Output rate    : 1006328 bps (639 pps)

uptime
 17:35:45 up 2 days, 17:38,  1 user,  load average: 0,48, 0,45, 0,42

uptime
 17:36:01 up 2 days, 12:51,  1 user,  load average: 0,40, 0,49, 0,51


dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=dsync
1+0 Datensätze ein
1+0 Datensätze aus
1073741824 Bytes (1,1 GB, 1,0 GiB) kopiert, 2,79875 s, 384 MB/s

dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=dsync
1+0 Datensätze ein
1+0 Datensätze aus
1073741824 Bytes (1,1 GB, 1,0 GiB) kopiert, 2,71752 s, 395 MB/s
 
Last edited:
I see, that sounds strange! I dont know anything about coding but mybe its somehow related?

No, it is absolutely unrelated.

Yes indeed, that setup itself has not changed, not more vms, nothing within the network plus i am migrating it using the exact same way (with local disks). I just did the 4.4 to 5.0 upgrade like your documentions shows and besides that everything works great.

and the behaviour is reproducible?

Ill try it with rate limit now! Ty!

Edit: How do i migrate with an rate limit set?

To limit to 5 megabyte/s

# qm set <VMID> --migrate_speed 5
 
No, it is absolutely unrelated.

Mhh but how it comes that it gets back online exact at the same time?


and the behaviour is reproducible?

Ill setup a new cluster and get back to this asap. I also try to install 5.0 directly and give that a try.

To limit to 5 megabyte/s

# qm set <VMID> --migrate_speed 5

Thats doesnt seem to impact an migration, perhaps just the replication?
Code:
qm set 114 --migrate_speed 25
update VM 114: -migrate_speed 25
Afterwards i started to migrate it and the speed went up to 980Mbit.

Kind Regards
 
Still testing but until now i got:


fresh 4.4 cluster installed via iso from you guys:

Code:
Jul 07 19:09:07 starting online/live migration on unix:/run/qemu-server/100.migrate
Jul 07 19:09:07 migrate_set_speed: 8589934592
Jul 07 19:09:07 migrate_set_downtime: 0.1
Jul 07 19:09:07 set migration_caps
Jul 07 19:09:07 set cachesize: 107374182
Jul 07 19:09:07 start migrate command to unix:/run/qemu-server/100.migrate
Jul 07 19:09:09 migration status: active (transferred 410373256, remaining 667066368), total 1082990592)
Jul 07 19:09:09 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
Jul 07 19:09:11 migration status: active (transferred 663043934, remaining 413294592), total 1082990592)
Jul 07 19:09:11 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
Jul 07 19:09:13 migration status: active (transferred 1035620523, remaining 28966912), total 1082990592)
Jul 07 19:09:13 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
Jul 07 19:09:13 migration speed: 22.26 MB/s - downtime 16 ms
Jul 07 19:09:13 migration status: completed
drive-sata0: transferred: 5368774656 bytes remaining: 0 bytes total: 5368774656 bytes progression: 100.00 % busy: false ready: true
all mirroring jobs are ready
drive-sata0: Completing block job...
drive-sata0: Completed successfully.
drive-sata0 : finished
Jul 07 19:09:28 migration finished successfully (duration 00:01:04)

The 16ms definitely are acurate aggain. No loss, no lagg, no timeout. Perfect!


After upgrade to 5.0
Code:
sed -i 's/jessie/stretch/g' /etc/apt/sources.list
sed -i 's/jessie/stretch/g' /etc/apt/sources.list.d/pve-enterprise.list
apt-get update && apt-get dist-upgrade
reboot


Code:
2017-07-07 19:52:11 starting online/live migration on unix:/run/qemu-server/100.migrate
2017-07-07 19:52:11 migrate_set_speed: 8589934592
2017-07-07 19:52:11 migrate_set_downtime: 0.1
2017-07-07 19:52:11 set migration_caps
2017-07-07 19:52:11 set cachesize: 107374182
2017-07-07 19:52:11 start migrate command to unix:/run/qemu-server/100.migrate
2017-07-07 19:52:13 migration status: active (transferred 389242786, remaining 683032576), total 1091379200)
2017-07-07 19:52:13 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-07 19:52:15 migration status: active (transferred 735550058, remaining 335929344), total 1091379200)
2017-07-07 19:52:15 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-07 19:52:17 migration status: active (transferred 972642897, remaining 97275904), total 1091379200)
2017-07-07 19:52:17 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-07 19:52:17 migration status: active (transferred 1011161604, remaining 57643008), total 1091379200)
2017-07-07 19:52:17 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-07 19:52:17 migration status: active (transferred 1046032404, remaining 21401600), total 1091379200)
2017-07-07 19:52:17 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-07 19:52:18 migration speed: 28.44 MB/s - downtime 67 ms
2017-07-07 19:52:18 migration status: completed
drive-virtio0: transferred: 5368774656 bytes remaining: 0 bytes total: 5368774656 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-virtio0: Completing block job...
drive-virtio0: Completed successfully.
drive-virtio0 : finished
2017-07-07 19:52:30 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pve04' root@10.14.0.21 pvesr set-state 100 \''{}'\'
2017-07-07 19:52:34 migration finished successfully (duration 00:00:54)

Time not acurate anymore aggain and downtime from 2017-07-07 19:52:18 to 2017-07-07 19:52:30, like in my other cluster.

Ill try the installation from scratch 5.0 tomorrow.

Kind regards
 
Hello!

i just tested with directly installed 5.0 iso and the problem seems to be the same.

Code:
2017-07-08 20:21:40 starting online/live migration on unix:/run/qemu-server/100.migrate
2017-07-08 20:21:40 migrate_set_speed: 8589934592
2017-07-08 20:21:40 migrate_set_downtime: 0.1
2017-07-08 20:21:40 set migration_caps
2017-07-08 20:21:40 set cachesize: 107374182
2017-07-08 20:21:40 start migrate command to unix:/run/qemu-server/100.migrate
2017-07-08 20:21:42 migration status: active (transferred 225417227, remaining 860528640), total 1091379200)
2017-07-08 20:21:42 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:44 migration status: active (transferred 464941119, remaining 620638208), total 1091379200)
2017-07-08 20:21:44 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:46 migration status: active (transferred 702130050, remaining 382984192), total 1091379200)
2017-07-08 20:21:46 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:48 migration status: active (transferred 916509546, remaining 164085760), total 1091379200)
2017-07-08 20:21:48 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:49 migration status: active (transferred 937403177, remaining 143171584), total 1091379200)
2017-07-08 20:21:49 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:49 migration status: active (transferred 956421423, remaining 124067840), total 1091379200)
2017-07-08 20:21:49 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:49 migration status: active (transferred 964236175, remaining 115941376), total 1091379200)
2017-07-08 20:21:49 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:50 migration status: active (transferred 982525368, remaining 96899072), total 1091379200)
2017-07-08 20:21:50 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:50 migration status: active (transferred 994857939, remaining 84578304), total 1091379200)
2017-07-08 20:21:50 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:50 migration status: active (transferred 1004071758, remaining 75239424), total 1091379200)
2017-07-08 20:21:50 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:50 migration status: active (transferred 1034717715, remaining 44044288), total 1091379200)
2017-07-08 20:21:50 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2017-07-08 20:21:51 migration speed: 26.95 MB/s - downtime 35 ms
2017-07-08 20:21:51 migration status: completed
drive-virtio0: transferred: 5368971264 bytes remaining: 0 bytes total: 5368971264 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-virtio0: Completing block job...
drive-virtio0: Completed successfully.
drive-virtio0 : finished
2017-07-08 20:22:03 # /usr/bin/ssh -o 'BatchMode=yes' -o 'HostKeyAlias=pve04' root@10.14.0.21 pvesr set-state 100 \''{}'\'
2017-07-08 20:22:07 migration finished successfully (duration 00:00:56)

Downtime 12 seconds like after an upgrade. I also find it kind of strange that it seems to be always 12 seconds.

Kind regards
 
Hi,

on proxmox 5.0,
can you try to edit (on source and target host to be sure)

/usr/share/perl5/PVE/QemuMigrate.pm

and comment
"
# foreach my $drive (keys %{$self->{target_drive}}){
# PVE::QemuServer::vm_mon_cmd_nocheck($vmid, "device_del", id => $drive);
# }
"

then restart pvedaemon

#systemctl restart pvedaemon


and try migration again ?
 
Hi,

on proxmox 5.0,
can you try to edit (on source and target host to be sure)

/usr/share/perl5/PVE/QemuMigrate.pm

and comment
"
# foreach my $drive (keys %{$self->{target_drive}}){
# PVE::QemuServer::vm_mon_cmd_nocheck($vmid, "device_del", id => $drive);
# }
"

then restart pvedaemon

#systemctl restart pvedaemon


and try migration again ?

Thank you for your answer!

Unfortunately the changes didnt seem to change anything.

Kind Regards
 
The problem is the tunnel will not be closed and we wait 10 sec before we kill it external.
I'm not sure why the tunnel is not closing, if the mirror process is finished.
But I investigating.
 
The problem is the tunnel will not be closed and we wait 10 sec before we kill it external.
I'm not sure why the tunnel is not closing, if the mirror process is finished.
But I investigating.

Thank you for your answer and more for your work and investigation process. Great work!

Kind regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!