"Slow" VM Migration Speeds

wbarnard81

Member
Sep 2, 2022
16
0
6
South Africa
www.vaimo.com
Hi All.

It feels like I am going to complain about something with a golden spoon in my mouth, but this is our production servers...

When I migrate a VM form one node to another, in the cluster, it seems like it tops out at 800Mbps-ish. Some information first...

Hardware I am using:
AMD EPYC 7702P 64-Core Processor
512GB DDR4 3200Mhz
2x Micron 7300 480GB (Raid1 zfs for boot)
6x Kioxia KCD6XLUL960G 960GB NVMe
Mellanox ConnectX 4 (25Gbps) NICs, connecting with DAC Cables.

Disk setup for the 6x NVMe:
zpool create -f -o ashift=12 houmyvas mirror /dev/nvme0n1 /dev/nvme1n1 mirror /dev/nvme2n1 /dev/nvme3n1 mirror /dev/nvme6n1 /dev/nvme7n1
Source node is using Proxmox 7.2 and Kernel 5.15 and Destination node is Proxmox 7.3 and Kernel 6.1.

Screenshot 2023-01-06 103735.jpg
Screenshot 2023-01-06 111953.jpg
Screenshot 2023-01-06 111606.jpg

Yes, the screen shot above shows speed when migrating the memory, but the one above that should "show" that it is the same when migrating the HDD.

So my questions:
1. How can I figure out what would be the bottleneck here?
2. What happens to the data that is written to the HDD, while the migration takes place?
 
how did you achieve these speeds? is it an --online migration or offline? just out of curiosity. I have a cluster each with 1Gig for corosync with 2 interfaces on each node connected as redundant (active-backup). However I am achieving migration speeds of around 100MiB max. What am I doing wrong here?
 
Um, 1 gigabit = 1000 megabits = 1000/8 megabytes = 125 megabytes. With overhead and required dead times between frames 100 MiB is about what you would expect on a gigabit LAN.

ETA: The OP has a 25 gigabit LAN.
Thanks for the math. I will get to buying some 25G cards then. :p
 
please post the full migration task log and "pveversion -v"
 
VM was migrated from h1021 to h1013. Specs on both servers are the same.
VM Specs: 32vCPU, 32GB Ram and 200GB disk.

h1013:
Code:
root@h1013:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

h1021:

Code:
root@h1021:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
pve-kernel-5.15: 7.4-13
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-20-bpo11-pve: 6.2.16-20~bpo11+1
pve-kernel-6.2.11-1-pve: 6.2.11-1
pve-kernel-5.15.152-1-pve: 5.15.152-1
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.3-1
proxmox-backup-file-restore: 3.2.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2
 

Attachments

  • task-h1021-qmigrate-2024-10-26T09_57_42Z.log
    42.8 KB · Views: 2
it's possible that this is the max speed that SSH achieves on your host.. if you trust your local network between the nodes, you could try an "insecure" migration that uses a plain TCP connection without any encryption or authentication and see if that is faster..
 
I don't think so...

iperf3 test:
Code:
root@h1013:~# iperf3 -c 102.xxx.xxx.98
Connecting to host 102.xxx.xxx.98, port 5201
[  5] local 102.xxx.xxx.18 port 41018 connected to 102.xxx.xxx.98 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   892 MBytes  7.48 Gbits/sec  108   1001 KBytes
[  5]   1.00-2.00   sec  1.08 GBytes  9.29 Gbits/sec    2   1.01 MBytes
[  5]   2.00-3.00   sec  1.05 GBytes  8.99 Gbits/sec   36    865 KBytes
[  5]   3.00-4.00   sec  1.55 GBytes  13.3 Gbits/sec  141    902 KBytes
[  5]   4.00-5.00   sec  1.63 GBytes  14.0 Gbits/sec   43   1.29 MBytes
[  5]   5.00-6.00   sec  1.85 GBytes  15.9 Gbits/sec   96   1.10 MBytes
[  5]   6.00-7.00   sec  1.72 GBytes  14.8 Gbits/sec   58   1.07 MBytes
[  5]   7.00-8.00   sec  2.19 GBytes  18.9 Gbits/sec  147   1.07 MBytes
[  5]   8.00-9.00   sec  2.34 GBytes  20.1 Gbits/sec  181    831 KBytes
[  5]   9.00-10.00  sec  1.86 GBytes  16.0 Gbits/sec  152    953 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  16.2 GBytes  13.9 Gbits/sec  964             sender
[  5]   0.00-10.00  sec  16.1 GBytes  13.9 Gbits/sec                  receiver

iperf Done.

and a scp test:
Code:
root@h1013:/var/lib/vz/template/iso# scp 102.xxx.xxx.98:/var/lib/vz/template/iso/ubuntu-22.04.4-live-server-amd64.iso .
The authenticity of host '102.xxx.xxx.98 (102.xxx.xxx.98)' can't be established.
ED25519 key fingerprint is SHA256:xxxxxxxxxxxxxxxxxW5HcyrEkJaulBJVvcJ3qpdM.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '102.xxx.xxx.98' (ED25519) to the list of known hosts.
ubuntu-22.04.4-live-server-amd64.iso                                                                                                                      100% 2007MB 224.7MB/s   00:08
root@h1013:/var/lib/vz/template/iso#

224MB/s = 1708Mbps+-?
 
please try what I asked you to try - all you showed us is that your effective line speed is faster (but also not able to fully load your 25Gbps link), and that scp is even slower ;)
 
SCP is going to be limited by the NVMe speeds. Since there are VMs running on these hosts, I think that is a decent speed for scp. Since I can only do insecure transfers via CLI, even if it is faster, I do not see it as a solution to my original question.

If someone else has a similar setup and are also getting the migration speeds that I am getting, then it is fine and I will accept it.
 
you can configure migrations to be insecure by default via datacenter.cfg.

your migration task log shows:
a 200GB volume in ~10m == 341MB/s == 2730 mbits
the ~32GB of RAM+state with 399MB/s == 3192 mbits

which is roughly the same ballpark, and nowhere near the (effective) linespeed.. of course the throughput will always be affected/determined by whichever component is the bottle neck, and it may well be that if you switch to insecure migration, the disk part becomes limited by your source or target storage.

there is also the multi-fd feature that might bring additional speedup once it's implemented:

https://bugzilla.proxmox.com/show_bug.cgi?id=5766
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!