[Proxmox 7.2-3 - CEPH 16.2.7] Migrating VMs hangs them (kernel panic on Linux, freeze on Windows)

Randell · Sep 7, 2022

Anyone try 5.15.53-1 yet?

udo · Sep 11, 2022

Randell said:
Anyone try 5.15.53-1 yet?

Hi,
I've tried this kernel for an big issue I have with two new host ( https://forum.proxmox.com/threads/i...ug-soft-lockup-inside-vms.113373/#post-496346 ).
Live migration to an host with this kernel looks better for the first tries (only some VMs), but just some minutes ago, an existing VM hang with 100% cpu usage again, after (or during) migrate an bigger VM to that node (120GB local storage).

Udo

Pakillo77 · Sep 12, 2022

Randell said:
Anyone try 5.15.53-1 yet?

I've tested it on one of our clusters (all Intel CPUs) and all migrations I've tried have worked fine.

Pakillo77 · Sep 13, 2022

Just tested kernel 5.15.53-1-pve on another 5 nodes cluster and all VM migrations where fine.

root@pve221:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-10
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-3-pve-guest-fpu: 5.15.39-3
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

IT ProCare · Sep 15, 2022

proxmox-ve: 7.2-1 (running kernel: 5.15.53-1-pve)
migration:
from Intel(R) Xeon(R) Silver 4310 CPU to Intel(R) Xeon(R) E-2234 CPU - VMs froze using 100% CPU
from Intel(R) Xeon(R) E-2234 CPU to Intel(R) Xeon(R) Silver 4310 CPU - OK
from Intel(R) Xeon(R) E-2234 CPU to Intel(R) Xeon(R) E-2234 CPU - OK

Randell · Sep 19, 2022

5.15.53-1-pve doesn't work for me. Migrating a VM from an i7-12700K to an i7-8700K did the typical 100% cpu thing.

Back to 5.15.39-3-pve-guest-fpu

fiona · Sep 19, 2022

Hi,
there also is a package for the 5.19 kernel available now. AFAIK, it includes a fix for the FPU issue.

IT ProCare · Sep 19, 2022

fiona said:
Hi,
there also is a package for the 5.19 kernel available now. AFAIK, it includes a fix for the FPU issue.

Is it a safe solution for production?

fiona · Sep 19, 2022

IT ProCare said:
Is it a safe solution for production?

Internal testing was fine of course and please see the thread for user feedback (one user had performance issues with GPU passthrough and one user had host crashed, but their setup does sound a bit exotic IMHO).

With kernels, it's always a good idea to test on similar hardware first (e.g. try it on a single cluster node). It's a huge moving piece of software after all

Randell · Sep 19, 2022

fiona said:
Hi,
there also is a package for the 5.19 kernel available now. AFAIK, it includes a fix for the FPU issue.

So far, seems to have resolved my migration issue:
https://forum.proxmox.com/threads/o...r-proxmox-ve-7-x-available.115090/post-499008

IT ProCare · Sep 24, 2022

fiona said:
Hi,
there also is a package for the 5.19 kernel available now. AFAIK, it includes a fix for the FPU issue.

proxmox-ve: 7.2-1 (running kernel: 5.19.7-1-pve)
live migration, storage Ceph:
from Intel(R) Xeon(R) Silver 4310 CPU to Intel(R) Xeon(R) E-2234 CPU - OK
from Intel(R) Xeon(R) E-2234 CPU to Intel(R) Xeon(R) Silver 4310 CPU - OK
from Intel(R) Xeon(R) E-2234 CPU to Intel(R) Xeon(R) E-2234 CPU - OK

bodo · Oct 5, 2022

I can also confirm live migration is working between nodes running kernel 5.19.7-1
live migration is also working between nodes running kernel 5.15.39-4
live migration is working fine if we migrate a VM from a node running 5.19.7-1 to a node running 5.15.39-4

However, live migration does not work (VM freezes and eats CPU cycles) if we migrate a VM from a node running 5.15.39-4 to a node running 5.19.7-1

Is there a safe way to upgrade a cluster from 5.15.39-4 to 5.19.x without VM freezes/downtime?

BTW: we have tested both, local storage and ceph storage.

allentkw · Oct 11, 2022

bodo said:
I can also confirm live migration is working between nodes running kernel 5.19.7-1
live migration is also working between nodes running kernel 5.15.39-4
live migration is working fine if we migrate a VM from a node running 5.19.7-1 to a node running 5.15.39-4

However, live migration does not work (VM freezes and eats CPU cycles) if we migrate a VM from a node running 5.15.39-4 to a node running 5.19.7-1

Is there a safe way to upgrade a cluster from 5.15.39-4 to 5.19.x without VM freezes/downtime?

BTW: we have tested both, local storage and ceph storage.

Can confirm issue with 5.15.60-1 to 5.19.7-2 as well:

Live migration, storage Ceph
Testing VM = Default installation of Ubuntu 20.04.4

Intel E5-2670v3 = proxmox-ve: 7.2-1 (running kernel: 5.15.60-1-pve)
Intel Xeon Gold 6134 = proxmox-ve: 7.2-1 (running kernel: 5.19.7-2-pve)

VM is started in Intel E5-2670v3:
From Intel E5-2670v3 To Intel Xeon Gold 6134 = FAIL - VM freezes and eats CPU cycle

VM is started in Intel Xeon Gold 6134:
From Intel Xeon Gold 6134 To Intel E5-2670v3 = OK

Tried also with:

Intel E5-2670v3 = proxmox-ve: 7.2-1 (running kernel: 5.15.60-1-pve)
Intel Xeon Gold 6134 = proxmox-ve: 7.2-1 (running kernel: 5.13.19-6-pve)

VM is started in Intel E5-2670v3:
From Intel E5-2670v3 To Intel Xeon Gold 6134 = OK
From Intel Xeon Gold 6134 To Intel E5-2670v3 = OK

VM is started in Xeon Gold 6134:
From Intel Xeon Gold 6134 To Intel E5-2670v3 = OK
From Intel E5-2670v3 To Intel Xeon Gold 6134 = OK

enlar · Nov 16, 2022

If the patch referenced by Thomas in
https://forum.proxmox.com/threads/p...on-linux-freeze-on-windows.109645/post-488479

is not included in pve-kernel-5.15.74-1, could it be possible to produced a 5.15.74 based patched kernel to test?

I think there were two bugs with live-migration and one is fixed by .74 in our end. Now we see PANIC: double fault, error_code: 0x0, that I think could be fixed by the patch reported in this thread.

fiona · Nov 16, 2022

Hi,

enlar said:
If the patch referenced by Thomas in
https://forum.proxmox.com/threads/p...on-linux-freeze-on-windows.109645/post-488479

is not included in pve-kernel-5.15.74-1, could it be possible to produced a 5.15.74 based patched kernel to test?

I think there were two bugs with live-migration and one is fixed by .74 in our end. Now we see PANIC: double fault, error_code: 0x0, that I think could be fixed by the patch reported in this thread.

I'd recommend to upgrade to kernel 5.19 if you are affected by the FPU bug. It's not straight-forward to backport the fix unfortunately.

enlar · Nov 16, 2022

Thanks for the recommendation. Is upgrading to kernel 5.19 better than staying with 5.13?

Last time I tried kernel 5.19 migrations from 5.13 were failing with a 50% rate (Sept 19th).

Cheers

fiona · Nov 16, 2022

enlar said:
Thanks for the recommendation. Is upgrading to kernel 5.19 better than staying with 5.13?

Last time I tried kernel 5.19 migrations from 5.13 were failing with a 50% rate (Sept 19th).

Cheers

Both source and target of the migration need to be on 5.19 to remedy the FPU bug IIRC.

enlar · Nov 16, 2022

So this means one can't upgrade from 5.13 to 5.19 without stopping the VMs in case cluster is affected by this bug?

qXt69WEV2a7fgbET · Nov 16, 2022

Hello dear support forum,

my issues has been resolved by installing the kernel 5.19.

Do i have to pay attention to anything when i update in the future, or should i install the older kernel again if possible when the problem in the older kernel is fixed?

Regards

fiona · Nov 18, 2022

enlar said:
So this means one can't upgrade from 5.13 to 5.19 without stopping the VMs in case cluster is affected by this bug?

I haven't debugged the issue, but from what other users mention, I'm afraid that is the case.

qXt69WEV2a7fgbET said:
Hello dear support forum,

my issues has been resolved by installing the kernel 5.19.

Do i have to pay attention to anything when i update in the future, or should i install the older kernel again if possible when the problem in the older kernel is fixed?

Regards

If you don't have any problems with the new kernel, I'd recommend staying on that one. While 5.15 is the default kernel right now, the opt-in 5.19 also gets a similar amount of attention.

[Proxmox 7.2-3 - CEPH 16.2.7] Migrating VMs hangs them (kernel panic on Linux, freeze on Windows)

Well-Known Member

Distinguished Member

Active Member

Active Member

Active Member

Well-Known Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Well-Known Member

Active Member

New Member

Active Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Active Member

Proxmox Staff Member

We value your privacy