[Proxmox 7.2-3 - CEPH 16.2.7] Migrating VMs hangs them (kernel panic on Linux, freeze on Windows)

Just tested kernel 5.15.53-1-pve on another 5 nodes cluster and all VM migrations where fine.

root@pve221:~# pveversion -v proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve) pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85) pve-kernel-helper: 7.2-12 pve-kernel-5.15: 7.2-10 pve-kernel-5.15.53-1-pve: 5.15.53-1 pve-kernel-5.15.39-3-pve-guest-fpu: 5.15.39-3 pve-kernel-5.15.39-3-pve: 5.15.39-3 pve-kernel-5.15.30-2-pve: 5.15.30-3 ceph: 16.2.9-pve1 ceph-fuse: 16.2.9-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-3 libpve-storage-perl: 7.2-8 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.5-1 proxmox-backup-file-restore: 2.2.5-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-2 pve-container: 4.2-2 pve-docs: 7.2-2 pve-edk2-firmware: 3.20220526-1 pve-firewall: 4.2-6 pve-firmware: 3.5-1 pve-ha-manager: 3.4.0 pve-i18n: 2.7-2 pve-qemu-kvm: 7.0.0-3 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-4 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.5-pve1
 
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
migration:
from Intel(R) Xeon(R) Silver 4310 CPU to Intel(R) Xeon(R) E-2234 CPU - VMs froze using 100% CPU
from Intel(R) Xeon(R) E-2234 CPU to Intel(R) Xeon(R) Silver 4310 CPU - OK
from Intel(R) Xeon(R) E-2234 CPU to Intel(R) Xeon(R) E-2234 CPU - OK
 
Last edited:
5.15.53-1-pve doesn't work for me. Migrating a VM from an i7-12700K to an i7-8700K did the typical 100% cpu thing.

Back to 5.15.39-3-pve-guest-fpu
 
Last edited:
Is it a safe solution for production?
Internal testing was fine of course and please see the thread for user feedback (one user had performance issues with GPU passthrough and one user had host crashed, but their setup does sound a bit exotic IMHO).

With kernels, it's always a good idea to test on similar hardware first (e.g. try it on a single cluster node). It's a huge moving piece of software after all ;)
 
  • Like
Reactions: IT ProCare
Hi,
there also is a package for the 5.19 kernel available now. AFAIK, it includes a fix for the FPU issue.
proxmox-ve: 7.2-1 (running kernel: 5.19.7-1-pve)
live migration, storage Ceph:
from Intel(R) Xeon(R) Silver 4310 CPU to Intel(R) Xeon(R) E-2234 CPU - OK
from Intel(R) Xeon(R) E-2234 CPU to Intel(R) Xeon(R) Silver 4310 CPU - OK
from Intel(R) Xeon(R) E-2234 CPU to Intel(R) Xeon(R) E-2234 CPU - OK
 
  • Like
Reactions: fiona
I can also confirm live migration is working between nodes running kernel 5.19.7-1
live migration is also working between nodes running kernel 5.15.39-4
live migration is working fine if we migrate a VM from a node running 5.19.7-1 to a node running 5.15.39-4

However, live migration does not work (VM freezes and eats CPU cycles) if we migrate a VM from a node running 5.15.39-4 to a node running 5.19.7-1

Is there a safe way to upgrade a cluster from 5.15.39-4 to 5.19.x without VM freezes/downtime?

BTW: we have tested both, local storage and ceph storage.
 
  • Like
Reactions: Pakillo77
I can also confirm live migration is working between nodes running kernel 5.19.7-1
live migration is also working between nodes running kernel 5.15.39-4
live migration is working fine if we migrate a VM from a node running 5.19.7-1 to a node running 5.15.39-4

However, live migration does not work (VM freezes and eats CPU cycles) if we migrate a VM from a node running 5.15.39-4 to a node running 5.19.7-1

Is there a safe way to upgrade a cluster from 5.15.39-4 to 5.19.x without VM freezes/downtime?

BTW: we have tested both, local storage and ceph storage.

Can confirm issue with 5.15.60-1 to 5.19.7-2 as well:

Live migration, storage Ceph
Testing VM = Default installation of Ubuntu 20.04.4

Intel E5-2670v3 = proxmox-ve: 7.2-1 (running kernel: 5.15.60-1-pve)
Intel Xeon Gold 6134 = proxmox-ve: 7.2-1 (running kernel: 5.19.7-2-pve)

VM is started in Intel E5-2670v3:
From Intel E5-2670v3 To Intel Xeon Gold 6134 = FAIL - VM freezes and eats CPU cycle

VM is started in Intel Xeon Gold 6134:
From Intel Xeon Gold 6134 To Intel E5-2670v3 = OK



Tried also with:

Intel E5-2670v3 = proxmox-ve: 7.2-1 (running kernel: 5.15.60-1-pve)
Intel Xeon Gold 6134 = proxmox-ve: 7.2-1 (running kernel: 5.13.19-6-pve)

VM is started in Intel E5-2670v3:
From Intel E5-2670v3 To Intel Xeon Gold 6134 = OK
From Intel Xeon Gold 6134 To Intel E5-2670v3 = OK

VM is started in Xeon Gold 6134:
From Intel Xeon Gold 6134 To Intel E5-2670v3 = OK
From Intel E5-2670v3 To Intel Xeon Gold 6134 = OK
 
Hi,
If the patch referenced by Thomas in
https://forum.proxmox.com/threads/p...on-linux-freeze-on-windows.109645/post-488479

is not included in pve-kernel-5.15.74-1, could it be possible to produced a 5.15.74 based patched kernel to test?

I think there were two bugs with live-migration and one is fixed by .74 in our end. Now we see PANIC: double fault, error_code: 0x0, that I think could be fixed by the patch reported in this thread.
I'd recommend to upgrade to kernel 5.19 if you are affected by the FPU bug. It's not straight-forward to backport the fix unfortunately.
 
Thanks for the recommendation. Is upgrading to kernel 5.19 better than staying with 5.13?

Last time I tried kernel 5.19 migrations from 5.13 were failing with a 50% rate (Sept 19th).

Cheers
 
Thanks for the recommendation. Is upgrading to kernel 5.19 better than staying with 5.13?

Last time I tried kernel 5.19 migrations from 5.13 were failing with a 50% rate (Sept 19th).

Cheers
Both source and target of the migration need to be on 5.19 to remedy the FPU bug IIRC.
 
So this means one can't upgrade from 5.13 to 5.19 without stopping the VMs in case cluster is affected by this bug? :)
 
Hello dear support forum,

my issues has been resolved by installing the kernel 5.19.

Do i have to pay attention to anything when i update in the future, or should i install the older kernel again if possible when the problem in the older kernel is fixed?

Regards
 
So this means one can't upgrade from 5.13 to 5.19 without stopping the VMs in case cluster is affected by this bug? :)
I haven't debugged the issue, but from what other users mention, I'm afraid that is the case.

Hello dear support forum,

my issues has been resolved by installing the kernel 5.19.

Do i have to pay attention to anything when i update in the future, or should i install the older kernel again if possible when the problem in the older kernel is fixed?

Regards
If you don't have any problems with the new kernel, I'd recommend staying on that one. While 5.15 is the default kernel right now, the opt-in 5.19 also gets a similar amount of attention.
 
  • Like
Reactions: Pakillo77

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!