VM Freezes DURING live Migration, Runs fine after completion

crashnightmare

New Member
Dec 15, 2023
3
0
1
We have a AlmaLinux release 9.3 vm that has a 814GB hard drive and is relatively in use (runs at a 2-3.0 server load all day.

1712846720104.png

We have a two node cluster, both running local storage with identical mobo/cpu. Hardrives are different.

1712846813745.png
1712846826736.png


Last night, we wanted to run the kernel updates on one of the host and wanted to live migrate the vm so there would be no downtime. So we fired off the migration and once it started copying the blocks, the vm ran like molasses. It would ping, but not a single site would load. Additionally, we could not login usefully via ssh. Yes, it would connect and ask for a username and password, but it would just stall after that. The behavior in the console was similar - we would enter a user and a password and then after a minute or two of waiting the login prompt would reappear as if we did nothing.

We chose to wait the 35 minutes for the migration to finish rather than find out what bad things could happen if we stopped a migration task in the middle. Once the migration finished the VM was completely accessible and fine. It did not require a restart or anything.

[root@myvm ~]# w
10:51:27 up 105 days, 3:20, 2 users, load average: 2.81, 2.33, 2.19

I have attached the log of the migration.

I have tried doing extensive searching on the forums, google, and reddit to see if this issue was discussed already but everyone seems to be saying thier vm fails AFTER the migration completes and the cause is usually different cpu models.

The closest issue I have found is: https://bugzilla.kernel.org/show_bug.cgi?id=199727

The last comment someone recommends changing aio=threads and ensuring iothread=1. I currently have aio=default(io_uring) and iothread=1. I am wondering if this would work, or if anyone has any additional suggestions.


root@pve:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso

zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1

pbs: pbs
datastore tanktruenas
server backup.OMITTED.com
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs

pbs: pbslocal
datastore pbs_local_zfs
server OMITTED
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs

zfspool: backup-drive
pool backup-drive
content images,rootdir
mountpoint /backup-drive
nodes armory2




root@armory2:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso

zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1

pbs: pbs
datastore tanktruenas
server backup.OMITTED.com
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs

pbs: pbslocal
datastore pbs_local_zfs
server OMITTED
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs

zfspool: backup-drive
pool backup-drive
content images,rootdir
mountpoint /backup-drive
nodes armory2
 

Attachments

  • task-pve-qmigrate-2024-04-10T23_10_39Z.log
    152.8 KB · Views: 1
Hi,
what kind of physical disks do you have? How did the load on the hosts look like during migration (especially IO wait)? Anything interesting in the system logs/journal?

You could try setting a migration speed limit on the storage (from man pvesm):
Code:
--bwlimit [clone=<LIMIT>] [,default=<LIMIT>] [,migration=<LIMIT>] [,move=<LIMIT>] [,restore=<LIMIT>]
           Set I/O bandwidth limit for various operations (in KiB/s).
 
Hi Fiona,

Thanks for responding. Our storage is NVME on both nodes.

Hypervisor Host load looked low (2.0-3.0). VM Host was super high >100

I did not check iowait.

We have another large VM 1TB that migrates just fine between the two hosts. This vm is not production and has near zero load.

If I am understanding your comment, I would run the below to set to ~1gbps:

pvesm set local-zfs -bwlimit migration=976563

A couple of questions,

1) which hypervisor do I run this one? The source or the destination?
2) Can I run this while the migration is active to change the speed on the fly and see the effects?
 
We have another large VM 1TB that migrates just fine between the two hosts. This vm is not production and has near zero load.
Hmm, so it's rather unlikely that it's just the storage. Is the network connection for migration the same that's used to communicate with the VM?

If I am understanding your comment, I would run the below to set to ~1gbps:

pvesm set local-zfs -bwlimit migration=976563
This is 976563 * 1024 bytes per second. To get this many bits per second, you need to divide the multiplier by 8.

1) which hypervisor do I run this one? The source or the destination?
The storage configuration is shared between all nodes in the cluster, so it does not matter.

2) Can I run this while the migration is active to change the speed on the fly and see the effects?
No, the setting is applied when starting the disk migration.
 
>> Is the network connection for migration the same that's used to communicate with the VM?
Yes...it's all one ethernet cable into a switch.

I will try the below command:
>> pvesm set local-zfs -bwlimit migration=7812504

And then let you know if it works. Do you think its saturation of the link causing the issue?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!