[SOLVED] slow migrations

RobFantini · Feb 6, 2022

we have had very slow migrations for a few months now.

our storage is ceph with nvme

i would say migration speed is as if the storage were local and not shared.

I ran into this a few months back and posted a crazy man thread as i was in the middle of a few things after a 15 hour day. now still crazy but one thing at a time and plenty of sleep. so there is less likely operator errors, although those are always possible.

local and local-lvm storages are disabled. I saw that on another thread.

Code:

# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-5
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.4.128-1-pve: 5.4.128-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 15.2.15-pve1
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.16.0-1
pve-zsync: 2.2.1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

Code:

# cat storage.cfg
dir: local
        disable
        path /var/lib/vz
        content vztmpl
        prune-backups keep-last=1
        shared 0

lvmthin: local-lvm
        disable
        thinpool data
        vgname pve
        content rootdir,images
        nodes pve15

rbd: nvme-4tb
        content images,rootdir
        krbd 0
        pool nvme-4tb

dir: y-nfs-share
        path /media/pbs-nfs
        content iso,vztmpl,backup
        prune-backups keep-last=1
        shared 1

dir: z-local-nvme
        path /nvme-ext4
        content images,snippets,vztmpl,rootdir,backup,iso
        prune-backups keep-last=1
        shared 0

RobFantini · Feb 6, 2022

this vm takes over 15 min to migrate

Code:

bootdisk: scsi0
cores: 4
lock: migrate
memory: 32765
name: imap
net0: virtio=72:3B:6D:47:A5:CB,bridge=vmbr3,tag=3
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: nvme-4tb:vm-216-disk-0,discard=on,size=16G,ssd=1
scsi2: nvme-4tb:vm-216-disk-1,discard=on,size=500G
scsihw: virtio-scsi-pci
smbios1: uuid=19ee107f-c8ff-4806-9d60-3d8b76904fc9
sockets: 1
vmgenid: 67026d11-b273-447b-88d7-2fbc062902cc

Code:

ask started by HA resource agent
2022-02-06 10:27:05 starting migration of VM 216 to node 'pve2' (10.10.0.2)
2022-02-06 10:27:05 starting VM 216 on remote node 'pve2'
2022-02-06 10:27:07 start remote tunnel
2022-02-06 10:27:08 ssh tunnel ver 1
2022-02-06 10:27:08 starting online/live migration on unix:/run/qemu-server/216.migrate
2022-02-06 10:27:08 set migration capabilities
2022-02-06 10:27:08 migration downtime limit: 100 ms
2022-02-06 10:27:08 migration cachesize: 4.0 GiB
2022-02-06 10:27:08 set migration parameters
2022-02-06 10:27:08 start migrate command to unix:/run/qemu-server/216.migrate
2022-02-06 10:27:09 migration active, transferred 28.2 MiB of 32.0 GiB VM-state, 24.9 MiB/s
2022-02-06 10:27:10 migration active, transferred 54.3 MiB of 32.0 GiB VM-state, 29.2 MiB/s
2022-02-06 10:27:11 migration active, transferred 79.9 MiB of 32.0 GiB VM-state, 25.7 MiB/s
2022-02-06 10:27:12 migration active, transferred 106.6 MiB of 32.0 GiB VM-state, 24.5 MiB/s
2022-02-06 10:27:13 migration active, transferred 131.9 MiB of 32.0 GiB VM-state, 20.5 MiB/s
2022-02-06 10:27:14 migration active, transferred 158.0 MiB of 32.0 GiB VM-state, 25.3 MiB/s
2022-02-06 10:27:15 migration active, transferred 183.6 MiB of 32.0 GiB VM-state, 25.5 MiB/s
2022-02-06 10:27:16 migration active, transferred 206.2 MiB of 32.0 GiB VM-state, 25.5 MiB/s
2022-02-06 10:27:17 migration active, transferred 233.9 MiB of 32.0 GiB VM-state, 25.3 MiB/s
..
2022-02-06 10:47:20 migration active, transferred 31.9 GiB of 32.0 GiB VM-state, 45.1 MiB/s
2022-02-06 10:47:21 migration active, transferred 32.0 GiB of 32.0 GiB VM-state, 41.4 MiB/s
2022-02-06 10:47:22 migration active, transferred 32.0 GiB of 32.0 GiB VM-state, 36.3 MiB/s
2022-02-06 10:47:23 migration active, transferred 32.1 GiB of 32.0 GiB VM-state, 40.1 MiB/s
2022-02-06 10:47:24 migration active, transferred 32.1 GiB of 32.0 GiB VM-state, 40.4 MiB/s
2022-02-06 10:47:25 average migration speed: 26.9 MiB/s - downtime 56 ms
2022-02-06 10:47:25 migration status: completed
2022-02-06 10:47:28 migration finished successfully (duration 00:20:23)

Okay so the issue is slow MEMORY as that vm has 32GB . I thought it was slow disk transfer.

so perhaps there is no way to get around the slow memory transfer?

RobFantini · Feb 6, 2022

Okay so the issue is slow MEMORY as that vm has 32GB . I thought it was slow disk transfer.
so perhaps there is no way to get around the slow memory transfer?

Is it possible to set some vm's to shutdown migrate and start?

just a few vm's like phone and main database require live migration.

spirit · Feb 6, 2022

so perhaps there is no way to get around the slow memory transfer?

how much network bandwith do you have ?

it's possible to speedup transfert by disabling ssh tunnel, edit

/etc/pve/datacenter.cfg
migration: insecure

Is it possible to set some vm's to shutdown migrate and start?

I don't think it's possible, you need to stop/migrate/start manually.

RobFantini · Feb 6, 2022

Hello Spirit !
bandwidth is 40G . connect-x5 cards, 2 mlaged mellonax 40G switches.

I'll test migration: insecure

RobFantini · Feb 6, 2022

RobFantini said:
Hello Spirit !
bandwidth is 40G . connect-x5 cards, 2 mlaged mellonax 40G switches.

I'll test migration: insecure

migration:insecure solved the issue. 30 secs:

Code:

ask started by HA resource agent
2022-02-06 12:39:15 starting migration of VM 216 to node 'pve11' (10.1.10.11)
2022-02-06 12:39:15 starting VM 216 on remote node 'pve11'
2022-02-06 12:39:16 start remote tunnel
2022-02-06 12:39:17 ssh tunnel ver 1
2022-02-06 12:39:17 starting online/live migration on tcp:10.1.10.11:60000
2022-02-06 12:39:17 set migration capabilities
2022-02-06 12:39:17 migration downtime limit: 100 ms
2022-02-06 12:39:17 migration cachesize: 4.0 GiB
2022-02-06 12:39:17 set migration parameters
2022-02-06 12:39:17 start migrate command to tcp:10.1.10.11:60000
2022-02-06 12:39:18 migration active, transferred 1.1 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:19 migration active, transferred 2.1 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:20 migration active, transferred 3.2 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:21 migration active, transferred 4.2 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:22 migration active, transferred 5.3 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:23 migration active, transferred 6.3 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:24 migration active, transferred 7.4 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:25 migration active, transferred 8.4 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:26 migration active, transferred 9.5 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:27 migration active, transferred 10.6 GiB of 32.0 GiB VM-state, 1.1 GiB/s
2022-02-06 12:39:28 migration active, transferred 11.5 GiB of 32.0 GiB VM-state, 507.4 MiB/s
2022-02-06 12:39:29 migration active, transferred 12.9 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:30 migration active, transferred 14.4 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:31 migration active, transferred 15.8 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:32 migration active, transferred 17.2 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:33 migration active, transferred 18.5 GiB of 32.0 GiB VM-state, 1.3 GiB/s
2022-02-06 12:39:34 migration active, transferred 19.8 GiB of 32.0 GiB VM-state, 1.3 GiB/s
2022-02-06 12:39:35 migration active, transferred 21.1 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:36 migration active, transferred 22.6 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:37 migration active, transferred 24.0 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:38 migration active, transferred 25.4 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:39 migration active, transferred 26.8 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:40 migration active, transferred 28.1 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:41 migration active, transferred 29.5 GiB of 32.0 GiB VM-state, 1.4 GiB/s
2022-02-06 12:39:42 migration active, transferred 31.7 GiB of 32.0 GiB VM-state, 1.3 GiB/s
2022-02-06 12:39:43 average migration speed: 1.2 GiB/s - downtime 137 ms
2022-02-06 12:39:43 migration status: completed
2022-02-06 12:39:45 migration finished successfully (duration 00:00:30)
TASK OK

Thank you Spirit!

RobFantini · Feb 8, 2022

Hello
I am seeing this:

fast migrations as a system is moving vms prior to reboot.

slow migrations are seen when the vms migrate back after reboot.

fantaxp7 · Mar 15, 2022

Hello,

We are back to seeing a slow migration on this. I double checked and this settings is still set.

/etc/pve/datacenter.cfg
migration: insecure

Running proxmox-ve: 7.1-1 (running kernel: 5.13.19-5-pve)

Thanks

Adam Koczarski · May 24, 2022

I've noticed something similar on our 5 node cluster with a 10Gbps migration network. When I update my nodes I use the following process. Migrate all VMs from node 1 to node 2. Update and reboot node 1, then migrate the VMs back to node 1. If my nodes have been running for a while I see slow migration from node 1 to node 2. But after rebooting node 1 after the upgrade migrating the same VMs back to node 1 is WAY faster. 20-50 MiB/s from node 1 to node 2 and 1-1.4GiB/s from node 2 to node 1 after rebooting node 1. This theme continue until all of the nodes have been updated and rebooted. When I migrate node 5 VMs to node 1 prior to the update and reboot of node 5 the transfer is at 1-1.4GiB/s as node 1 was rebooted earlier in the process.

I'm doing an update today and notice the transfers from node 2 to node 3 went fast. Then a remember I had rebooted node 3 a couple of days ago so the Dell server could recover and repair a correctable memory issue. So this definitely appears to be something which manifests itself if the nodes have been running for <some> amount of time.

wigor · May 24, 2022

hey,

i sometimes see same behavior. i think it has to do with memory fragmentation on target node.

AllanM · Jul 15, 2022

I hope yall don't mind me bumping this thread but I'm seeing several responses here that I share similar experiences with.

Migration is FAST between recently rebooted nodes. Even with encryption I see ~400-600MB/s on our 10Gb network. We have 2 corosync networks, both 10Gb, and we use the secondary network for migration traffic.

Migration is Slower between nodes that have been running a long time. Big variability in transfer speed from moment to moment. ~25-400MB/s.

It's still functional but it kinda scares me sometimes because we do have a few VM's that are assigned lots of memory and they dirty memory very rapidly. (128GB per VM, Security Onion Search Nodes). These often struggle to migrate successfully if the nodes haven't been rebooted in a few weeks.

MasterCATZ · Nov 26, 2022

what gets me is if they are replicated and it takes only a few seconds and I do it every minute , why does HA migration take hours , why does it seem to re-copy all the data before migrating ?

using ZFS

narrateourale · Nov 26, 2022

MasterCATZ said:
what gets me is if they are replicated and it takes only a few seconds and I do it every minute , why does HA migration take hours , why does it seem to re-copy all the data before migrating ?

using ZFS

In a live migration needs to transfer the state of the running VM. If the VM has a lot of memory or has its memory changing rapidly, it can take a long time or might never finish because the migration might not be able to catch up to the changing RAM of the VM to get the difference small enough so that the switch to the other node will only take a few milliseconds.

You can post the logs of such a migration m it should be visible where the migration takes the most time.

MasterCATZ · Nov 27, 2022

what is the best way to collect the logs ? and which ones

anyway for it to fall back without doing a memory sync , as that is not needed , not that it ever has as its always just booted the image ?

narrateourale · Nov 27, 2022

MasterCATZ said:
what is the best way to collect the logs ? and which ones

If it is a live migration, you should see a task (bottom panel in the UI) that is called "VM <vmid> - Migrate". Double click on it and you should see the logs. That is what would be interesting to see.

MasterCATZ said:
anyway for it to fall back without doing a memory sync , as that is not needed , not that it ever has as its always just booted the image ?

Not sure I fully understand the overall situation, but in order to migrate a VM without transferring state, it needs to be powered off to do an offline migration. In that case, a VM that is already replicated should be done almost instantaneously.

wigor · Nov 28, 2022

MasterCATZ said:
what gets me is if they are replicated and it takes only a few seconds and I do it every minute , why does HA migration take hours , why does it seem to re-copy all the data before migrating ?

using ZFS

i think, HA and replication (zfs) are two different things here. So far as i know you have to use shared storage for this.

Search

Search

[SOLVED] slow migrations

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

spirit

Distinguished Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

fantaxp7

Renowned Member

Adam Koczarski

Well-Known Member

wigor

Well-Known Member

AllanM

Well-Known Member

MasterCATZ

Member

narrateourale

Well-Known Member

MasterCATZ

Member

Attachments

narrateourale

Well-Known Member

wigor

Well-Known Member