VM migration speed question

Pavletto

New Member
Sep 13, 2023
12
1
1
42
Hi collegues,
i would like to ask you about migration speed between PVE cluster nodes.
I have a 3-node PVE 8 cluster with 2x40G network links: one for CEPH cluster (1) and another one for PVE cluster/CEPH public network (2).
CEPH OSDs is all-nvme.
In cluster options i've set also one of these 40G netwoks for migration (2)

When i migrating VM with 32G RAM from one node to another i get these results:
Code:
2023-09-18 11:01:18 starting migration of VM 100 to node 'pve-up-1' (10.100.41.30)
2023-09-18 11:01:18 starting VM 100 on remote node 'pve-up-1'
2023-09-18 11:01:20 start remote tunnel
2023-09-18 11:01:20 ssh tunnel ver 1
2023-09-18 11:01:20 starting online/live migration on unix:/run/qemu-server/100.migrate
2023-09-18 11:01:20 set migration capabilities
2023-09-18 11:01:20 migration downtime limit: 100 ms
2023-09-18 11:01:20 migration cachesize: 4.0 GiB
...
2023-09-18 11:02:20 xbzrle: send updates to 65487 pages in 7.9 MiB encoded memory, cache-miss 84.62%, overflow 604
2023-09-18 11:02:21 auto-increased downtime to continue migration: 200 ms
2023-09-18 11:02:22 migration active, transferred 28.6 GiB of 32.1 GiB VM-state, 287.9 MiB/s
2023-09-18 11:02:22 xbzrle: send updates to 190269 pages in 50.6 MiB encoded memory, cache-miss 34.56%, overflow 861
2023-09-18 11:02:24 migration active, transferred 28.6 GiB of 32.1 GiB VM-state, 151.9 MiB/s, VM dirties lots of memory: 299.1 MiB/s
2023-09-18 11:02:24 xbzrle: send updates to 306352 pages in 60.5 MiB encoded memory, cache-miss 27.66%, overflow 1180
2023-09-18 11:02:25 auto-increased downtime to continue migration: 400 ms
2023-09-18 11:02:27 average migration speed: 491.1 MiB/s - downtime 531 ms
2023-09-18 11:02:27 migration status: completed
2023-09-18 11:02:28 Waiting for spice server migration
2023-09-18 11:02:30 migration finished successfully (duration 00:01:13)

After some googling i've set "insecure" type for migration:

Code:
/etc/pve/datacenter.cfg

migration: network=10.100.41.10/24,type=insecure

After this change result is:

Code:
2023-09-18 11:18:03 use dedicated network address for sending migration traffic (10.100.41.20)
2023-09-18 11:18:03 starting migration of VM 100 to node 'pve-down-2' (10.100.41.20)
2023-09-18 11:18:03 starting VM 100 on remote node 'pve-down-2'
2023-09-18 11:18:05 start remote tunnel
2023-09-18 11:18:06 ssh tunnel ver 1
2023-09-18 11:18:06 starting online/live migration on tcp:10.100.41.20:60000
2023-09-18 11:18:06 set migration capabilities
2023-09-18 11:18:06 migration downtime limit: 100 ms
2023-09-18 11:18:06 migration cachesize: 4.0 GiB
...
2023-09-18 11:18:37 xbzrle: send updates to 47084 pages in 4.1 MiB encoded memory, cache-miss 78.66%, overflow 294
2023-09-18 11:18:38 auto-increased downtime to continue migration: 200 ms
2023-09-18 11:18:39 migration active, transferred 28.2 GiB of 32.1 GiB VM-state, 179.1 MiB/s
2023-09-18 11:18:39 xbzrle: send updates to 129779 pages in 20.8 MiB encoded memory, cache-miss 16.74%, overflow 357
2023-09-18 11:18:40 auto-increased downtime to continue migration: 400 ms
2023-09-18 11:18:41 migration active, transferred 28.2 GiB of 32.1 GiB VM-state, 104.6 MiB/s, VM dirties lots of memory: 177.4 MiB/s
2023-09-18 11:18:41 xbzrle: send updates to 220757 pages in 42.1 MiB encoded memory, cache-miss 20.16%, overflow 424
2023-09-18 11:18:42 auto-increased downtime to continue migration: 800 ms
2023-09-18 11:18:44 migration active, transferred 28.3 GiB of 32.1 GiB VM-state, 339.8 MiB/s
2023-09-18 11:18:44 xbzrle: send updates to 339327 pages in 50.7 MiB encoded memory, cache-miss 21.37%, overflow 620
2023-09-18 11:18:46 average migration speed: 822.5 MiB/s - downtime 472 ms
2023-09-18 11:18:46 migration status: completed
2023-09-18 11:18:47 Waiting for spice server migration
2023-09-18 11:18:49 migration finished successfully (duration 00:00:47)

Speed is increased but whis is a way too far from capabilities of 40G network and disk speed:

Code:
root@pve-down-1:/tmp# rados bench -p scbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve-down-1_1588020
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       615       599   2395.85      2396   0.0213653   0.0262351
    2      16      1238      1222    2443.8      2492    0.164208   0.0239879
    3      16      1605      1589   2118.48      1468   0.0379306   0.0300604
    4      16      2261      2245    2244.8      2624   0.0292071   0.0283965
    5      16      2949      2933   2346.19      2752    0.015009   0.0272138
    6      16      3687      3671   2447.11      2952   0.0131768   0.0261067
    7      16      4417      4401   2514.63      2920   0.0262535     0.02521
    8      16      5089      5073   2536.27      2688    0.010768   0.0251883
    9      16      5756      5740   2550.88      2668   0.0152786   0.0250424
   10      15      6397      6382   2552.56      2568  0.00827939   0.0247853
   11       6      6397      6391   2323.79        36    0.218062   0.0248901
   12       6      6397      6391   2130.14         0           -   0.0248901
Total time run:         12.5638
Total writes made:      6397
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2036.65
Stddev Bandwidth:       1056.8
Max bandwidth (MB/sec): 2952
Min bandwidth (MB/sec): 0
Average IOPS:           509
Stddev IOPS:            264.2
Max IOPS:               738
Min IOPS:               0
Average Latency(s):     0.0274665
Stddev Latency(s):      0.0937761
Max latency(s):         3.00742
Min latency(s):         0.00576705

May be there is more parameters i should use too speed up a migration process?
 
Hi,

Thank you for sharing the outputs!

After some googling i've set "insecure" type for migration:
This is described in `man datacenter.cfg`.

May I ask you about the MTU in the 10.100.41.10/24 network? If it's 1500 consider configuring it to use frames, e.g., 9000, and see if that will help.
 
  • Like
Reactions: Pavletto
Thanks for your reply.
I've set MTU 9000 on interface, then in vmbr, and finally on vlan if

Code:
auto ens6f1
iface ens6f1 inet static
        address 10.100.40.20/24
#40G iface CEPH

auto vmbr0
iface vmbr0 inet static
        address 192.168.10.160/23
        gateway 192.168.10.145
        bridge-ports ens9f0np0
        bridge-stp off
        bridge-fd 0
#10G MGMT

auto vmbr10
iface vmbr10 inet manual
        bridge-ports ens9f1np1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#10G Trunk NW

auto vmbr40
iface vmbr40 inet manual
        bridge-ports ens6f0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9000
#40G Trunk NW

auto vlan41
iface vlan41 inet static
        address 10.100.41.20/24
        mtu 9000
        vlan-raw-device vmbr40
#40G VLAN41 CLUSTER_NW

auto vlan9
iface vlan9 inet static
        address 192.168.25.20/24
        vlan-raw-device vmbr40
#40G VLAN9 192.168.25.20

Here is results:
Code:
2023-09-18 13:09:37 starting migration of VM 100 to node 'pve-up-1' (10.100.41.30)
2023-09-18 13:09:37 starting VM 100 on remote node 'pve-up-1'
2023-09-18 13:09:39 start remote tunnel
2023-09-18 13:09:40 ssh tunnel ver 1
2023-09-18 13:09:40 starting online/live migration on tcp:10.100.41.30:60000
2023-09-18 13:09:40 set migration capabilities
2023-09-18 13:09:40 migration downtime limit: 100 ms
2023-09-18 13:09:40 migration cachesize: 4.0 GiB
...
2023-09-18 13:10:00 auto-increased downtime to continue migration: 200 ms
2023-09-18 13:10:01 migration active, transferred 28.2 GiB of 32.1 GiB VM-state, 397.5 MiB/s
2023-09-18 13:10:01 xbzrle: send updates to 92212 pages in 6.2 MiB encoded memory, cache-miss 26.87%, overflow 208
2023-09-18 13:10:02 auto-increased downtime to continue migration: 400 ms
2023-09-18 13:10:03 migration active, transferred 28.3 GiB of 32.1 GiB VM-state, 156.4 MiB/s, VM dirties lots of memory: 190.0 MiB/s
2023-09-18 13:10:03 xbzrle: send updates to 188758 pages in 13.6 MiB encoded memory, cache-miss 20.51%, overflow 354
2023-09-18 13:10:04 average migration speed: 1.3 GiB/s - downtime 457 ms
2023-09-18 13:10:04 migration status: completed
2023-09-18 13:10:06 Waiting for spice server migration
2023-09-18 13:10:07 migration finished successfully (duration 00:00:30)

So more progress achieved but always want some more-)
 
vm memory transfert can reach around 10gbit/s (seem that you already have 1,3GiB/s , so around 10gbit/s).
So maximum VM migration speed is limited to approx. 10Gbit/s? Did i understood correctly?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!