VM Freezes DURING live Migration, Runs fine after completion

crashnightmare · Apr 11, 2024

We have a AlmaLinux release 9.3 vm that has a 814GB hard drive and is relatively in use (runs at a 2-3.0 server load all day.

We have a two node cluster, both running local storage with identical mobo/cpu. Hardrives are different.

Last night, we wanted to run the kernel updates on one of the host and wanted to live migrate the vm so there would be no downtime. So we fired off the migration and once it started copying the blocks, the vm ran like molasses. It would ping, but not a single site would load. Additionally, we could not login usefully via ssh. Yes, it would connect and ask for a username and password, but it would just stall after that. The behavior in the console was similar - we would enter a user and a password and then after a minute or two of waiting the login prompt would reappear as if we did nothing.

We chose to wait the 35 minutes for the migration to finish rather than find out what bad things could happen if we stopped a migration task in the middle. Once the migration finished the VM was completely accessible and fine. It did not require a restart or anything.

[root@myvm ~]# w
10:51:27 up 105 days, 3:20, 2 users, load average: 2.81, 2.33, 2.19

I have attached the log of the migration.

I have tried doing extensive searching on the forums, google, and reddit to see if this issue was discussed already but everyone seems to be saying thier vm fails AFTER the migration completes and the cause is usually different cpu models.

The closest issue I have found is: https://bugzilla.kernel.org/show_bug.cgi?id=199727

The last comment someone recommends changing aio=threads and ensuring iothread=1. I currently have aio=default(io_uring) and iothread=1. I am wondering if this would work, or if anyone has any additional suggestions.

root@pve:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso

zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1

pbs: pbs
datastore tanktruenas
server backup.OMITTED.com
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs

pbs: pbslocal
datastore pbs_local_zfs
server OMITTED
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs

zfspool: backup-drive
pool backup-drive
content images,rootdir
mountpoint /backup-drive
nodes armory2

root@armory2:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso

zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1

pbs: pbs
datastore tanktruenas
server backup.OMITTED.com
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs

pbs: pbslocal
datastore pbs_local_zfs
server OMITTED
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs

zfspool: backup-drive
pool backup-drive
content images,rootdir
mountpoint /backup-drive
nodes armory2

fiona · Apr 12, 2024

Hi,
what kind of physical disks do you have? How did the load on the hosts look like during migration (especially IO wait)? Anything interesting in the system logs/journal?

You could try setting a migration speed limit on the storage (from man pvesm):

Code:

--bwlimit [clone=<LIMIT>] [,default=<LIMIT>] [,migration=<LIMIT>] [,move=<LIMIT>] [,restore=<LIMIT>]
           Set I/O bandwidth limit for various operations (in KiB/s).

crashnightmare · Apr 12, 2024

Hi Fiona,

Thanks for responding. Our storage is NVME on both nodes.

Hypervisor Host load looked low (2.0-3.0). VM Host was super high >100

I did not check iowait.

We have another large VM 1TB that migrates just fine between the two hosts. This vm is not production and has near zero load.

If I am understanding your comment, I would run the below to set to ~1gbps:

pvesm set local-zfs -bwlimit migration=976563

A couple of questions,

1) which hypervisor do I run this one? The source or the destination?
2) Can I run this while the migration is active to change the speed on the fly and see the effects?

fiona · Apr 15, 2024

crashnightmare said:
We have another large VM 1TB that migrates just fine between the two hosts. This vm is not production and has near zero load.

Hmm, so it's rather unlikely that it's just the storage. Is the network connection for migration the same that's used to communicate with the VM?

crashnightmare said:
If I am understanding your comment, I would run the below to set to ~1gbps:

pvesm set local-zfs -bwlimit migration=976563

This is 976563 * 1024 bytes per second. To get this many bits per second, you need to divide the multiplier by 8.

crashnightmare said:
1) which hypervisor do I run this one? The source or the destination?

The storage configuration is shared between all nodes in the cluster, so it does not matter.

crashnightmare said:
2) Can I run this while the migration is active to change the speed on the fly and see the effects?

No, the setting is applied when starting the disk migration.

crashnightmare · Apr 24, 2024

>> Is the network connection for migration the same that's used to communicate with the VM?
Yes...it's all one ethernet cable into a switch.

I will try the below command:
>> pvesm set local-zfs -bwlimit migration=7812504

And then let you know if it works. Do you think its saturation of the link causing the issue?

Search

Search

VM Freezes DURING live Migration, Runs fine after completion

crashnightmare

New Member

Attachments

fiona

Proxmox Staff Member

crashnightmare

New Member

fiona

Proxmox Staff Member

crashnightmare

New Member