Migration results in hanged VM with CPU on 100%

aolujic

New Member
Aug 16, 2023
8
0
1
Hello all,
after latest upgrade of no-subscription 3-node cluster system, that contained only several pve packages (pve/ha manager and kernel among them), almost every migration of any virtual machine will result in successful migration (task finished fine) but in reality the (virtual) machine would be hanging and it's CPU would be on 100%

The only way to recover is to reset the virtual machine (qm reset <vmid> on the command line of the affected node). Please note that this system was running for over a year without ever experiencing something remotely similar to this issue. The direction of migration is also not important and this issue would appear regardless of source or destination pve host.

After the reset, affected VM would run without any issues and continue running stable for days.

The system is based on Ryzen 9 CPU's and there were no recent BIOS or microcode changes that can be tied to the problem. System runs under very low utilization anyway (it is a lab).

Any pointer on where to look for relevant information or how to rollback to previous versions of relevant packages would be appreciated.

Kind regards.
Sasha


proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
Hi,
please share the configuration of an affected VM qm config <ID>. What CPU models do your hosts have exactly? You can you try booting the older kernel to see if it's a kernel-related regression. Any messages in the system logs (of both source and target) around the time of the migration?
 
Hi,
please share the configuration of an affected VM qm config <ID>. What CPU models do your hosts have exactly? You can you try booting the older kernel to see if it's a kernel-related regression. Any messages in the system logs (of both source and target) around the time of the migration?
Hi, thanks Fiona, older kernel certainly sounds like good idea. CPU's involved in this three node cluster are:

CPU: 12-Core AMD Ryzen 9 3900X (-MT MCP-) speed/min/max: 3799/2200/3800 MHz
CPU: 12-Core AMD Ryzen 9 5900X (-MT MCP-) speed/min/max: 3580/2200/3700 MHz
CPU: 12-Core AMD Ryzen 9 3900X (-MT MCP-) speed/min/max: 3799/2200/3800 MHz

Nothing in the system logs that would cause suspicion at those times. Please note each migration is completely successful from Proxmox perspective. However, there is a common denominator - moving VM's from 5900X node to 3900X would always fail, other way around would work and between two 3900X nodes would also work without issues. Here is the configuration of one of the VM's I was trying this out.

boot: order=virtio1;ide2
cores: 4
cpu: host,flags=+aes
ide2: none,media=cdrom
memory: 8192
meta: creation-qemu=7.0.0,ctime=1667927204
name: sophos
net0: virtio=7A:90:42:E3:DE:4B,bridge=vmbr55,queues=4
net1: virtio=46:70:37:61:4B:71,bridge=vmbr10,queues=4
numa: 0
ostype: l26
rng0: source=/dev/urandom
smbios1: uuid=4d680b58-6369-42a5-94b4-6ca87172f454
sockets: 1
vga: virtio
virtio1: tier1:vm-114-disk-1,iothread=1,size=200G
vmgenid: ddf21a73-7b98-485f-83e1-f12dce88f195

Thanks for pointing me in the right direction (CPU differences). Anyway, I'll try rebooting with older kernel on all nodes to check if issue goes away.
Sasha
 
Hi, thanks Fiona, older kernel certainly sounds like good idea. CPU's involved in this three node cluster are:

CPU: 12-Core AMD Ryzen 9 3900X (-MT MCP-) speed/min/max: 3799/2200/3800 MHz
CPU: 12-Core AMD Ryzen 9 5900X (-MT MCP-) speed/min/max: 3580/2200/3700 MHz
CPU: 12-Core AMD Ryzen 9 3900X (-MT MCP-) speed/min/max: 3799/2200/3800 MHz

Nothing in the system logs that would cause suspicion at those times. Please note each migration is completely successful from Proxmox perspective. However, there is a common denominator - moving VM's from 5900X node to 3900X would always fail, other way around would work and between two 3900X nodes would also work without issues. Here is the configuration of one of the VM's I was trying this out.
If you have different physical CPUs, using the host CPU model will very often not work with live-migration. This is because features of the host CPU will be enabled that might not be compatible with the migration target's CPU.

You can select an AMD model compatible with all your CPUs to restrict the feature set QEMU will use: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_amd_cpu_types

See also the CPU Type section in https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_cpu
 
Dear Fiona,

indeed, very logical and expected, on the other hand this cluster never had same CPU (or Zen architecture for that matter) on all nodes, including a period where it had three different CPU's, but this issue didn't appear till now. In any case I will gladly follow your advice for increased compatibility.

I gather from documentation that common CPU type in my case would be "EPYC Rome" - Zen2 architecture but only "EPYC" works on 3900X.

Bringing all virtual machines to same CPU unfortunately didn't solve the issue. I will give older kernel a go when possible.

Best regards.
Sasha
 
I gather from documentation that common CPU type in my case would be "EPYC Rome" - Zen2 architecture but only "EPYC" works on 3900X.

Bringing all virtual machines to same CPU unfortunately didn't solve the issue. I will give older kernel a go when possible.
So the issue still persisted if you set the VM's CPU model to EPYC? Did you make sure to stop+start the VM to apply the change? If yes, and if it worked in the past, that might be indications that there's an actual issue here and not just host CPU incompatibility.
 
So the issue still persisted if you set the VM's CPU model to EPYC? Did you make sure to stop+start the VM to apply the change? If yes, and if it worked in the past, that might be indications that there's an actual issue here and not just host CPU incompatibility.
Yes, the issue persists when all VM's have "EPYC" CPU model. Unfortunately, the issue is present with previous kernel version 107-2 too.
 
Yes, the issue persists when all VM's have "EPYC" CPU model. Unfortunately, the issue is present with previous kernel version 107-2 too.
Hmm, did the upgrade also contain the pve-qemu-kvm package? You can check in /var/log/apt/history.log. Can you try using the qemu64 CPU model instead of EPYC? It's unfortunately less performant, because it's more generic.
 
Hmm, did the upgrade also contain the pve-qemu-kvm package? You can check in /var/log/apt/history.log. Can you try using the qemu64 CPU model instead of EPYC? It's unfortunately less performant, because it's more generic.
I have several machines that are optimized for Zen CPU's, I'm not sure if they would run correctly in qemu64 but I can experiment with it as a temporary workaround.

I've looked trough update history and the last time pve-qemu-kvm was upgraded was in April. It is theoretically possible that this upgrade was problematic but I haven't noticed the issue till now. Attached are logs from April till now. Question is how far to downgrade and which packages exactly (e.g. qemu-server was updated also in June)?

Best regards.
Sasha
 

Attachments

I have several machines that are optimized for Zen CPU's, I'm not sure if they would run correctly in qemu64 but I can experiment with it as a temporary workaround.
You can always create a dummy VM just for testing.

As @aaron pointed out, you might also want to try with the custom model defined here: https://forum.proxmox.com/threads/cant-use-epyc-rome-cpu-after-update.125336/#post-547510

Maybe one of the CPUs had the hardware errata regarding the xsaves feature, but the others didn't.
 
You can always create a dummy VM just for testing.

As @aaron pointed out, you might also want to try with the custom model defined here: https://forum.proxmox.com/threads/cant-use-epyc-rome-cpu-after-update.125336/#post-547510

Maybe one of the CPUs had the hardware errata regarding the xsaves feature, but the others didn't.
Hi Fiona,

it didn't help, migrating from 3900 to 5900 worked but to 5900 to 3900 resulted in CPU @100% for machine set to qemu64

Best regards.
Sasha
 
Kernel 6.2 solves the issue! All machines are migrating freely from any node to any node and back and forth :) Thanks for all the help and patience!
 
Kernel 6.2 solves the issue! All machines are migrating freely from any node to any node and back and forth :) Thanks for all the help and patience!
Happy to hear that :) Can you please edit the thread (right above the first post) and select the [SOLVED] prefix? That helps other users find solutions more quickly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!