We have a AlmaLinux release 9.3 vm that has a 814GB hard drive and is relatively in use (runs at a 2-3.0 server load all day.
We have a two node cluster, both running local storage with identical mobo/cpu. Hardrives are different.
Last night, we wanted to run the kernel updates on one of the host and wanted to live migrate the vm so there would be no downtime. So we fired off the migration and once it started copying the blocks, the vm ran like molasses. It would ping, but not a single site would load. Additionally, we could not login usefully via ssh. Yes, it would connect and ask for a username and password, but it would just stall after that. The behavior in the console was similar - we would enter a user and a password and then after a minute or two of waiting the login prompt would reappear as if we did nothing.
We chose to wait the 35 minutes for the migration to finish rather than find out what bad things could happen if we stopped a migration task in the middle. Once the migration finished the VM was completely accessible and fine. It did not require a restart or anything.
[root@myvm ~]# w
10:51:27 up 105 days, 3:20, 2 users, load average: 2.81, 2.33, 2.19
I have attached the log of the migration.
I have tried doing extensive searching on the forums, google, and reddit to see if this issue was discussed already but everyone seems to be saying thier vm fails AFTER the migration completes and the cause is usually different cpu models.
The closest issue I have found is: https://bugzilla.kernel.org/show_bug.cgi?id=199727
The last comment someone recommends changing aio=threads and ensuring iothread=1. I currently have aio=default(io_uring) and iothread=1. I am wondering if this would work, or if anyone has any additional suggestions.
root@pve:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso
zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1
pbs: pbs
datastore tanktruenas
server backup.OMITTED.com
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs
pbs: pbslocal
datastore pbs_local_zfs
server OMITTED
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs
zfspool: backup-drive
pool backup-drive
content images,rootdir
mountpoint /backup-drive
nodes armory2
root@armory2:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso
zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1
pbs: pbs
datastore tanktruenas
server backup.OMITTED.com
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs
pbs: pbslocal
datastore pbs_local_zfs
server OMITTED
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs
zfspool: backup-drive
pool backup-drive
content images,rootdir
mountpoint /backup-drive
nodes armory2
We have a two node cluster, both running local storage with identical mobo/cpu. Hardrives are different.
Last night, we wanted to run the kernel updates on one of the host and wanted to live migrate the vm so there would be no downtime. So we fired off the migration and once it started copying the blocks, the vm ran like molasses. It would ping, but not a single site would load. Additionally, we could not login usefully via ssh. Yes, it would connect and ask for a username and password, but it would just stall after that. The behavior in the console was similar - we would enter a user and a password and then after a minute or two of waiting the login prompt would reappear as if we did nothing.
We chose to wait the 35 minutes for the migration to finish rather than find out what bad things could happen if we stopped a migration task in the middle. Once the migration finished the VM was completely accessible and fine. It did not require a restart or anything.
[root@myvm ~]# w
10:51:27 up 105 days, 3:20, 2 users, load average: 2.81, 2.33, 2.19
I have attached the log of the migration.
I have tried doing extensive searching on the forums, google, and reddit to see if this issue was discussed already but everyone seems to be saying thier vm fails AFTER the migration completes and the cause is usually different cpu models.
The closest issue I have found is: https://bugzilla.kernel.org/show_bug.cgi?id=199727
The last comment someone recommends changing aio=threads and ensuring iothread=1. I currently have aio=default(io_uring) and iothread=1. I am wondering if this would work, or if anyone has any additional suggestions.
root@pve:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso
zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1
pbs: pbs
datastore tanktruenas
server backup.OMITTED.com
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs
pbs: pbslocal
datastore pbs_local_zfs
server OMITTED
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs
zfspool: backup-drive
pool backup-drive
content images,rootdir
mountpoint /backup-drive
nodes armory2
root@armory2:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso
zfspool: local-zfs
pool rpool/data
content images,rootdir
sparse 1
pbs: pbs
datastore tanktruenas
server backup.OMITTED.com
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs
pbs: pbslocal
datastore pbs_local_zfs
server OMITTED
content backup
fingerprint OMITTED
prune-backups keep-all=1
username OMITTED@pbs
zfspool: backup-drive
pool backup-drive
content images,rootdir
mountpoint /backup-drive
nodes armory2