[SOLVED] slow migrations

RobFantini

Famous Member
May 24, 2012
2,009
102
133
Boston,Mass
we have had very slow migrations for a few months now.

our storage is ceph with nvme

i would say migration speed is as if the storage were local and not shared.

I ran into this a few months back and posted a crazy man thread as i was in the middle of a few things after a 15 hour day. now still crazy but one thing at a time and plenty of sleep. so there is less likely operator errors, although those are always possible.

local and local-lvm storages are disabled. I saw that on another thread.


Code:
# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-5
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.4.128-1-pve: 5.4.128-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 15.2.15-pve1
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.16.0-1
pve-zsync: 2.2.1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

Code:
# cat storage.cfg
dir: local
        disable
        path /var/lib/vz
        content vztmpl
        prune-backups keep-last=1
        shared 0

lvmthin: local-lvm
        disable
        thinpool data
        vgname pve
        content rootdir,images
        nodes pve15

rbd: nvme-4tb
        content images,rootdir
        krbd 0
        pool nvme-4tb

dir: y-nfs-share
        path /media/pbs-nfs
        content iso,vztmpl,backup
        prune-backups keep-last=1
        shared 1

dir: z-local-nvme
        path /nvme-ext4
        content images,snippets,vztmpl,rootdir,backup,iso
        prune-backups keep-last=1
        shared 0
 
this vm takes over 15 min to migrate
Code:
bootdisk: scsi0
cores: 4
lock: migrate
memory: 32765
name: imap
net0: virtio=72:3B:6D:47:A5:CB,bridge=vmbr3,tag=3
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: nvme-4tb:vm-216-disk-0,discard=on,size=16G,ssd=1
scsi2: nvme-4tb:vm-216-disk-1,discard=on,size=500G
scsihw: virtio-scsi-pci
smbios1: uuid=19ee107f-c8ff-4806-9d60-3d8b76904fc9
sockets: 1
vmgenid: 67026d11-b273-447b-88d7-2fbc062902cc

Code:
ask started by HA resource agent
2022-02-06 10:27:05 starting migration of VM 216 to node 'pve2' (10.10.0.2)
2022-02-06 10:27:05 starting VM 216 on remote node 'pve2'
2022-02-06 10:27:07 start remote tunnel
2022-02-06 10:27:08 ssh tunnel ver 1
2022-02-06 10:27:08 starting online/live migration on unix:/run/qemu-server/216.migrate
2022-02-06 10:27:08 set migration capabilities
2022-02-06 10:27:08 migration downtime limit: 100 ms
2022-02-06 10:27:08 migration cachesize: 4.0 GiB
2022-02-06 10:27:08 set migration parameters
2022-02-06 10:27:08 start migrate command to unix:/run/qemu-server/216.migrate
2022-02-06 10:27:09 migration active, transferred 28.2 MiB of 32.0 GiB VM-state, 24.9 MiB/s
2022-02-06 10:27:10 migration active, transferred 54.3 MiB of 32.0 GiB VM-state, 29.2 MiB/s
2022-02-06 10:27:11 migration active, transferred 79.9 MiB of 32.0 GiB VM-state, 25.7 MiB/s
2022-02-06 10:27:12 migration active, transferred 106.6 MiB of 32.0 GiB VM-state, 24.5 MiB/s
2022-02-06 10:27:13 migration active, transferred 131.9 MiB of 32.0 GiB VM-state, 20.5 MiB/s
2022-02-06 10:27:14 migration active, transferred 158.0 MiB of 32.0 GiB VM-state, 25.3 MiB/s
2022-02-06 10:27:15 migration active, transferred 183.6 MiB of 32.0 GiB VM-state, 25.5 MiB/s
2022-02-06 10:27:16 migration active, transferred 206.2 MiB of 32.0 GiB VM-state, 25.5 MiB/s
2022-02-06 10:27:17 migration active, transferred 233.9 MiB of 32.0 GiB VM-state, 25.3 MiB/s
..
2022-02-06 10:47:20 migration active, transferred 31.9 GiB of 32.0 GiB VM-state, 45.1 MiB/s
2022-02-06 10:47:21 migration active, transferred 32.0 GiB of 32.0 GiB VM-state, 41.4 MiB/s
2022-02-06 10:47:22 migration active, transferred 32.0 GiB of 32.0 GiB VM-state, 36.3 MiB/s
2022-02-06 10:47:23 migration active, transferred 32.1 GiB of 32.0 GiB VM-state, 40.1 MiB/s
2022-02-06 10:47:24 migration active, transferred 32.1 GiB of 32.0 GiB VM-state, 40.4 MiB/s
2022-02-06 10:47:25 average migration speed: 26.9 MiB/s - downtime 56 ms
2022-02-06 10:47:25 migration status: completed
2022-02-06 10:47:28 migration finished successfully (duration 00:20:23)

Okay so the issue is slow MEMORY as that vm has 32GB . I thought it was slow disk transfer.

so perhaps there is no way to get around the slow memory transfer?
 
Okay so the issue is slow MEMORY as that vm has 32GB . I thought it was slow disk transfer.
so perhaps there is no way to get around the slow memory transfer?

Is it possible to set some vm's to shutdown migrate and start?

just a few vm's like phone and main database require live migration.
 
so perhaps there is no way to get around the slow memory transfer?
how much network bandwith do you have ?

it's possible to speedup transfert by disabling ssh tunnel, edit

/etc/pve/datacenter.cfg
migration: insecure

Is it possible to set some vm's to shutdown migrate and start?
I don't think it's possible, you need to stop/migrate/start manually.
 
Hello Spirit !
bandwidth is 40G . connect-x5 cards, 2 mlaged mellonax 40G switches.

I'll test migration: insecure

migration:insecure solved the issue. 30 secs:
Code:
ask started by HA resource agent
2022-02-06 12:39:15 starting migration of VM 216 to node 'pve11' (10.1.10.11)
2022-02-06 12:39:15 starting VM 216 on remote node 'pve11'
2022-02-06 12:39:16 start remote tunnel
2022-02-06 12:39:17 ssh tunnel ver 1
2022-02-06 12:39:17 starting online/live migration on tcp:10.1.10.11:60000
2022-02-06 12:39:17 set migration capabilities
2022-02-06 12:39:17 migration downtime limit: 100 ms
2022-02-06 12:39:17 migration cachesize: 4.0 GiB
2022-02-06 12:39:17 set migration parameters
2022-02-06 12:39:17 start migrate command to tcp:10.1.10.11:60000
2022-02-06 12:39:18 migration active, transferred 1.1 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:19 migration active, transferred 2.1 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:20 migration active, transferred 3.2 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:21 migration active, transferred 4.2 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:22 migration active, transferred 5.3 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:23 migration active, transferred 6.3 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:24 migration active, transferred 7.4 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:25 migration active, transferred 8.4 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:26 migration active, transferred 9.5 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:27 migration active, transferred 10.6 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:28 migration active, transferred 11.5 GiB of 32.0 GiB VM-state, 507.4 MiB/s
2022-02-06 12:39:29 migration active, transferred 12.9 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:30 migration active, transferred 14.4 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:31 migration active, transferred 15.8 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:32 migration active, transferred 17.2 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:33 migration active, transferred 18.5 GiB of 32.0 GiB VM-state, 1.3 GiB/s
2022-02-06 12:39:34 migration active, transferred 19.8 GiB of 32.0 GiB VM-state, 1.3 GiB/s
2022-02-06 12:39:35 migration active, transferred 21.1 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:36 migration active, transferred 22.6 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:37 migration active, transferred 24.0 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:38 migration active, transferred 25.4 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:39 migration active, transferred 26.8 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:40 migration active, transferred 28.1 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:41 migration active, transferred 29.5 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:42 migration active, transferred 31.7 GiB of 32.0 GiB VM-state, 1.3 GiB/s
2022-02-06 12:39:43 average migration speed: 1.2 GiB/s - downtime 137 ms
2022-02-06 12:39:43 migration status: completed
2022-02-06 12:39:45 migration finished successfully (duration 00:00:30)
TASK OK


Thank you Spirit!
 
Hello
I am seeing this:

fast migrations as a system is moving vms prior to reboot.

slow migrations are seen when the vms migrate back after reboot.
 
Hello,

We are back to seeing a slow migration on this. I double checked and this settings is still set.

/etc/pve/datacenter.cfg
migration: insecure

Running proxmox-ve: 7.1-1 (running kernel: 5.13.19-5-pve)


Thanks
 
Last edited:
I've noticed something similar on our 5 node cluster with a 10Gbps migration network. When I update my nodes I use the following process. Migrate all VMs from node 1 to node 2. Update and reboot node 1, then migrate the VMs back to node 1. If my nodes have been running for a while I see slow migration from node 1 to node 2. But after rebooting node 1 after the upgrade migrating the same VMs back to node 1 is WAY faster. 20-50 MiB/s from node 1 to node 2 and 1-1.4GiB/s from node 2 to node 1 after rebooting node 1. This theme continue until all of the nodes have been updated and rebooted. When I migrate node 5 VMs to node 1 prior to the update and reboot of node 5 the transfer is at 1-1.4GiB/s as node 1 was rebooted earlier in the process.

I'm doing an update today and notice the transfers from node 2 to node 3 went fast. Then a remember I had rebooted node 3 a couple of days ago so the Dell server could recover and repair a correctable memory issue. So this definitely appears to be something which manifests itself if the nodes have been running for <some> amount of time.
 
Last edited:
I hope yall don't mind me bumping this thread but I'm seeing several responses here that I share similar experiences with.

Migration is FAST between recently rebooted nodes. Even with encryption I see ~400-600MB/s on our 10Gb network. We have 2 corosync networks, both 10Gb, and we use the secondary network for migration traffic.

Migration is Slower between nodes that have been running a long time. Big variability in transfer speed from moment to moment. ~25-400MB/s.

It's still functional but it kinda scares me sometimes because we do have a few VM's that are assigned lots of memory and they dirty memory very rapidly. (128GB per VM, Security Onion Search Nodes). These often struggle to migrate successfully if the nodes haven't been rebooted in a few weeks.
 
what gets me is if they are replicated and it takes only a few seconds and I do it every minute , why does HA migration take hours , why does it seem to re-copy all the data before migrating ?

using ZFS
 
what gets me is if they are replicated and it takes only a few seconds and I do it every minute , why does HA migration take hours , why does it seem to re-copy all the data before migrating ?

using ZFS
In a live migration needs to transfer the state of the running VM. If the VM has a lot of memory or has its memory changing rapidly, it can take a long time or might never finish because the migration might not be able to catch up to the changing RAM of the VM to get the difference small enough so that the switch to the other node will only take a few milliseconds.

You can post the logs of such a migration m it should be visible where the migration takes the most time.
 
what is the best way to collect the logs ? and which ones

anyway for it to fall back without doing a memory sync , as that is not needed , not that it ever has as its always just booted the image ?
 

Attachments

  • syslog.txt
    1,018.9 KB · Views: 3
  • replicate.zip
    6.5 KB · Views: 5
Last edited:
what is the best way to collect the logs ? and which ones
If it is a live migration, you should see a task (bottom panel in the UI) that is called "VM <vmid> - Migrate". Double click on it and you should see the logs. That is what would be interesting to see.
anyway for it to fall back without doing a memory sync , as that is not needed , not that it ever has as its always just booted the image ?

Not sure I fully understand the overall situation, but in order to migrate a VM without transferring state, it needs to be powered off to do an offline migration. In that case, a VM that is already replicated should be done almost instantaneously.
 
what gets me is if they are replicated and it takes only a few seconds and I do it every minute , why does HA migration take hours , why does it seem to re-copy all the data before migrating ?

using ZFS
i think, HA and replication (zfs) are two different things here. So far as i know you have to use shared storage for this.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!