[Proxmox 7.2-3 - CEPH 16.2.7] Migrating VMs hangs them (kernel panic on Linux, freeze on Windows)

GabrieleV · Jul 1, 2022

Hy,
same issue here. From an host with kernel package
pve-kernel-5.15.35-3-pve
live migrating from a newer to an older CPU, freezes the VM, despite the kernel version on the older CPU.

On the newer CPU hosts, I installed the older kernel package:
apt install pve-kernel-5.15.30-2-pve

On the older CPU host, I left the new kernel pve-kernel-5.15.35-3-pve

Now live migration works in every directions.

I'm waiting to know when the problem is fixed.

Pakillo77 · Jul 6, 2022

I've just updated one of our PVE clusters to the latest version and the problem persists as described many times above: live migration from a newer CPU to an older one freezes the VM.

root@pve222:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-6
pve-kernel-helper: 7.2-6
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-3
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-5
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-11
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

John245 · Jul 25, 2022

Any ETA when this will be fixed? This is very frustrating for the users paying for the enterprise repository.

---
John

DerDanilo · Jul 25, 2022

John245 said:
Any ETA when this will be fixed? This is very frustrating for the users paying for the enterprise repository.

---
John

I am saying this as a happy long time PVE user since V2 was released:
I haven't studied the updated TOCs in detail but I'd expect Proxmox GmbH to pay some sort of refund to the paying customers for the time being while the Kernel causes the Migration issues.
This breaks the product and it doesn't seem that Enterprise grade testing is really happening. It feels like many years ago when PVE had other issues that caused trouble and made recommending PVE as consultant difficult because of stability issues in the eyes of the end customer.

I hope this gets solved soon.
Thanks!

John245 · Jul 25, 2022

DerDanilo said:
I am saying this as a happy long time PVE user since V2 was released:
I haven't studied the updated TOCs in detail but I'd expect Proxmox GmbH to pay some sort of refund to the paying customers for the time being while the Kernel causes the Migration issues.
This breaks the product and it doesn't seem that Enterprise grade testing is really happening. It feels like many years ago when PVE had other issues that caused trouble and made recommending PVE as consultant difficult because of stability issues in the eyes of the end customer.

I hope this gets solved soon.
Thanks!

I agree that this should be solved soon. My subscription ends by the end of this year and I will not subscribe in case it is not fixed at that time.

PS. also testing if it is visible that I'm a Proxmox subscriber

--
John

toheine · Jul 25, 2022

DerDanilo said:
I am saying this as a happy long time PVE user since V2 was released:
I haven't studied the updated TOCs in detail but I'd expect Proxmox GmbH to pay some sort of refund to the paying customers for the time being while the Kernel causes the Migration issues.
This breaks the product and it doesn't seem that Enterprise grade testing is really happening. It feels like many years ago when PVE had other issues that caused trouble and made recommending PVE as consultant difficult because of stability issues in the eyes of the end customer.

I hope this gets solved soon.
Thanks!

I completely agree. If the Proxmox developers are reading this, which I assume they are, please solve this issue!

Pakillo77 · Jul 30, 2022

Just updated everything to the latest version and the problem is NOT fixed:

root@pveo03:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.39-2-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-7
pve-kernel-helper: 7.2-7
pve-kernel-5.15.39-2-pve: 5.15.39-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-7
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-11
pve-xtermjs: 4.16.0-1
pve-zsync: 2.2.3
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

This is... incredible.
I don't understand how Proxmox can tolerate such an important failure during so many time.

John245 · Jul 30, 2022

@t.lamprecht Is Proxmox seriously looking into this issue? Do we have an ETA for the fix?

I can understand that this issue is in the standard version but myself and others are seeing this serious issue in the Enterprise repository. Why are we paying while the main feature is broken?

---
John

t.lamprecht · Aug 2, 2022

We're looking into it and I already tried to backport that finicky fpu masking patch that is also linked here, I went for another try today and managed to get something that builds, boots and also seems to host VMs and allows migrating them, but it is definitively not battle tested enough for production, iow. like always we got the bootstrap problem, we need people to test it to (hopefully) declare it stable and release over the normal channel, but most only got production setups and thus want to wait out for other feedback and us to declare it stable.

All related packages are available at http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/

The SHA256 sums are:

Code:

4f15a1e9fdcfae212d6279f26cc76f76300655c6dbfc2b53cf990deaedb4159d  linux-tools-5.15_5.15.39-3_amd64.deb
85d029b6a27b541121a3a627c135caded4213fb97cb1c9a36ce1bf0eedf8da45  linux-tools-5.15-dbgsym_5.15.39-3_amd64.deb
bb2c7263e0699cb27cda64ac50d81f5956bf6a3de3e6ac228e9cb6f71f66b29b  pve-headers-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
cf2c24aea618a96fcff3a9a67a318feb04aa554bba0c40670cb51b2a117ae6fd  pve-kernel-libc-dev_5.15.39-3_amd64.deb

For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb one:

Code:

wget http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# verify the checksum
sha256sum pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

apt install ./pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# optionally ensure that this specific kernel gets booted via a pin
proxmox-boot-tool kernel pin 5.15.39-3-pve-guest-fpu

systemctl reboot

As said, boot and basic tests went fine with above, not recommended for production setups.

Note also that even if this test kernel above will fix the specific regression, this is a problem that will reappear from time to time for heterogeneous clusters, while different CPU vendors is def. making it more likely, bigger architectural leaps can sometimes be enough to trigger such issues between different generations from the same vendor, as a new kernel comes around and either it:

unlocks features from the newer generation that the older don't have, trickling into the guest state; like probably here happened with enabling Intel AMX (aka TMUL) in between the 5.15.30 and 5.15.35 kernel.
needs to be more restrictive, e.g., due to security issues

And the only guarantee that no such state leakage has any effect is to use a homogeneous cluster. All the steadily growing millions of HW combinations cannot realistic get tested on every kernel release, which often come along with time pressure of getting out a mitigation for a critical issue.
Platforms that state they can handle and guarantee that simply don't tell the truth or have a very limited set of HW supported in the first place. FWIW, having the complexity of the systems and the amount of change that happens on a smaller level in mind and the vast abstractions required to hide those away for VMs and migration, it works out quite well almost surprisingly often IMO.

tldr; if you don't want this to happen the easiest way is to:

use enterprise HW
use as identical HW as possible (cpu, mainboard, system vendor wise)
keep firmware and µcode updated and at the same level

All else may work out often, but simply cannot be guaranteed for.
This cannot be "bought" away in software, get fitting HW if you want a production setup.

Pakillo77 · Aug 2, 2022

t.lamprecht said:
We're looking into it and I already tried to backport that finicky fpu masking patch that is also linked here, I went for another try today and managed to get something that builds, boots and also seems to host VMs and allows migrating them, but it is definitively not battle tested enough for production, iow. like always we got the bootstrap problem, we need people to test it to (hopefully) declare it stable and release over the normal channel, but most only got production setups and thus want to wait out for other feedback and us to declare it stable.

All related packages are available at http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/

The SHA256 sums are:

Code:

4f15a1e9fdcfae212d6279f26cc76f76300655c6dbfc2b53cf990deaedb4159d linux-tools-5.15_5.15.39-3_amd64.deb 85d029b6a27b541121a3a627c135caded4213fb97cb1c9a36ce1bf0eedf8da45 linux-tools-5.15-dbgsym_5.15.39-3_amd64.deb bb2c7263e0699cb27cda64ac50d81f5956bf6a3de3e6ac228e9cb6f71f66b29b pve-headers-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb 5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1 pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb cf2c24aea618a96fcff3a9a67a318feb04aa554bba0c40670cb51b2a117ae6fd pve-kernel-libc-dev_5.15.39-3_amd64.deb

For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb one:

Code:

wget http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb # verify the checksum sha256sum pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb 5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1 pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb apt install ./pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb # optionally ensure that this specific kernel gets booted via a pin proxmox-boot-tool kernel pin 5.15.39-3-pve-guest-fpu systemctl reboot

As said, boot and basic tests went fine with above, not recommended for production setups.

Note also that even if this test kernel above will fix the specific regression, this is a problem that will reappear from time to time for heterogeneous clusters, while different CPU vendors is def. making it more likely, bigger architectural leaps can sometimes be enough to trigger such issues between different generations from the same vendor, as a new kernel comes around and either it:

unlocks features from the newer generation that the older don't have, trickling into the guest state; like probably here happened with enabling Intel AMX (aka TMUL) in between the 5.15.30 and 5.15.35 kernel.

needs to be more restrictive, e.g., due to security issues

And the only guarantee that no such state leakage has any effect is to use a homogeneous cluster. All the steadily growing millions of HW combinations cannot realistic get tested on every kernel release, which often come along with time pressure of getting out a mitigation for a critical issue.
Platforms that state they can handle and guarantee that simply don't tell the truth or have a very limited set of HW supported in the first place. FWIW, having the complexity of the systems and the amount of change that happens on a smaller level in mind and the vast abstractions required to hide those away for VMs and migration, it works out quite well almost surprisingly often IMO.

tldr; if you don't want this to happen the easiest way is to:

use enterprise HW

use as identical HW as possible (cpu, mainboard, system vendor wise)

keep firmware and µcode updated and at the same level

All else may work out often, but simply cannot be guaranteed for.
This cannot be "bought" away in software, get fitting HW if you want a production setup.

Ok, I'll try it right now.

And... understanding everything... I have to say that live migration involving different hw worked fine ever, till kernel > 5.15.30-2-pve.
And I really loved that.

Dexogen · Aug 9, 2022

t.lamprecht said:
For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

I have same problem with Dell PowerEdge R6525. My cluster consists of 12 servers: half with AMD 7542 (Zen2) and half with AMD 75F3 (Zen3). Migrating VM to host with older processor caused the virtual machine to hang. After installing this kernel (5.15.39-3-pve-guest-fpu), the problem was solved.

Pakillo77 · Aug 14, 2022

Just upgraded all Proxmox packages, including kernel 5.15.39-3-pve (regular one, not patched) and the problem remains, but in reverse.
Now I can migrate VMs from host with newer CPU to older ones, but not from older to newer.

root@pve222:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.39-3-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-8
pve-kernel-helper: 7.2-8
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.15.39-2-pve: 5.15.39-2
pve-kernel-5.15.39-1-pve: 5.15.39-1
pve-kernel-5.15.35-3-pve: 5.15.35-6
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.140-1-pve: 5.4.140-1
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-7
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-11
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

Pakillo77 · Aug 14, 2022

t.lamprecht said:
We're looking into it and I already tried to backport that finicky fpu masking patch that is also linked here, I went for another try today and managed to get something that builds, boots and also seems to host VMs and allows migrating them, but it is definitively not battle tested enough for production, iow. like always we got the bootstrap problem, we need people to test it to (hopefully) declare it stable and release over the normal channel, but most only got production setups and thus want to wait out for other feedback and us to declare it stable.

All related packages are available at http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/

The SHA256 sums are:

Code:

4f15a1e9fdcfae212d6279f26cc76f76300655c6dbfc2b53cf990deaedb4159d linux-tools-5.15_5.15.39-3_amd64.deb 85d029b6a27b541121a3a627c135caded4213fb97cb1c9a36ce1bf0eedf8da45 linux-tools-5.15-dbgsym_5.15.39-3_amd64.deb bb2c7263e0699cb27cda64ac50d81f5956bf6a3de3e6ac228e9cb6f71f66b29b pve-headers-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb 5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1 pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb cf2c24aea618a96fcff3a9a67a318feb04aa554bba0c40670cb51b2a117ae6fd pve-kernel-libc-dev_5.15.39-3_amd64.deb

For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb one:

Code:

wget http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb # verify the checksum sha256sum pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb 5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1 pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb apt install ./pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb # optionally ensure that this specific kernel gets booted via a pin proxmox-boot-tool kernel pin 5.15.39-3-pve-guest-fpu systemctl reboot

As said, boot and basic tests went fine with above, not recommended for production setups.

Note also that even if this test kernel above will fix the specific regression, this is a problem that will reappear from time to time for heterogeneous clusters, while different CPU vendors is def. making it more likely, bigger architectural leaps can sometimes be enough to trigger such issues between different generations from the same vendor, as a new kernel comes around and either it:

unlocks features from the newer generation that the older don't have, trickling into the guest state; like probably here happened with enabling Intel AMX (aka TMUL) in between the 5.15.30 and 5.15.35 kernel.

needs to be more restrictive, e.g., due to security issues

And the only guarantee that no such state leakage has any effect is to use a homogeneous cluster. All the steadily growing millions of HW combinations cannot realistic get tested on every kernel release, which often come along with time pressure of getting out a mitigation for a critical issue.
Platforms that state they can handle and guarantee that simply don't tell the truth or have a very limited set of HW supported in the first place. FWIW, having the complexity of the systems and the amount of change that happens on a smaller level in mind and the vast abstractions required to hide those away for VMs and migration, it works out quite well almost surprisingly often IMO.

tldr; if you don't want this to happen the easiest way is to:

use enterprise HW

use as identical HW as possible (cpu, mainboard, system vendor wise)

keep firmware and µcode updated and at the same level

All else may work out often, but simply cannot be guaranteed for.
This cannot be "bought" away in software, get fitting HW if you want a production setup.

Using this patched kernel every migration I've tried has worked as expected.

Thanks, @t.lamprecht

John245 · Aug 14, 2022

Pakillo77 said:
Using this patched kernel every migration I've tried has worked as expected.

Thanks, @t.lamprecht

@t.lamprecht for a week stable here. Running on a cluster with 3 Nodes.

Pakillo77 · Aug 24, 2022

t.lamprecht said:
We're looking into it and I already tried to backport that finicky fpu masking patch that is also linked here, I went for another try today and managed to get something that builds, boots and also seems to host VMs and allows migrating them, but it is definitively not battle tested enough for production, iow. like always we got the bootstrap problem, we need people to test it to (hopefully) declare it stable and release over the normal channel, but most only got production setups and thus want to wait out for other feedback and us to declare it stable.

All related packages are available at http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/

The SHA256 sums are:

Code:

4f15a1e9fdcfae212d6279f26cc76f76300655c6dbfc2b53cf990deaedb4159d linux-tools-5.15_5.15.39-3_amd64.deb 85d029b6a27b541121a3a627c135caded4213fb97cb1c9a36ce1bf0eedf8da45 linux-tools-5.15-dbgsym_5.15.39-3_amd64.deb bb2c7263e0699cb27cda64ac50d81f5956bf6a3de3e6ac228e9cb6f71f66b29b pve-headers-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb 5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1 pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb cf2c24aea618a96fcff3a9a67a318feb04aa554bba0c40670cb51b2a117ae6fd pve-kernel-libc-dev_5.15.39-3_amd64.deb

For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb one:

Code:

wget http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb # verify the checksum sha256sum pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb 5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1 pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb apt install ./pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb # optionally ensure that this specific kernel gets booted via a pin proxmox-boot-tool kernel pin 5.15.39-3-pve-guest-fpu systemctl reboot

As said, boot and basic tests went fine with above, not recommended for production setups.

Note also that even if this test kernel above will fix the specific regression, this is a problem that will reappear from time to time for heterogeneous clusters, while different CPU vendors is def. making it more likely, bigger architectural leaps can sometimes be enough to trigger such issues between different generations from the same vendor, as a new kernel comes around and either it:

unlocks features from the newer generation that the older don't have, trickling into the guest state; like probably here happened with enabling Intel AMX (aka TMUL) in between the 5.15.30 and 5.15.35 kernel.

needs to be more restrictive, e.g., due to security issues

And the only guarantee that no such state leakage has any effect is to use a homogeneous cluster. All the steadily growing millions of HW combinations cannot realistic get tested on every kernel release, which often come along with time pressure of getting out a mitigation for a critical issue.
Platforms that state they can handle and guarantee that simply don't tell the truth or have a very limited set of HW supported in the first place. FWIW, having the complexity of the systems and the amount of change that happens on a smaller level in mind and the vast abstractions required to hide those away for VMs and migration, it works out quite well almost surprisingly often IMO.

tldr; if you don't want this to happen the easiest way is to:

use enterprise HW

use as identical HW as possible (cpu, mainboard, system vendor wise)

keep firmware and µcode updated and at the same level

All else may work out often, but simply cannot be guaranteed for.
This cannot be "bought" away in software, get fitting HW if you want a production setup.

I don't know the reason, but now, using the same patched kernel, every migration fails again. All of them: Windows, Linux...

root@pve226:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.39-3-pve-guest-fpu)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-8
pve-kernel-helper: 7.2-8
pve-kernel-5.15.39-3-pve-guest-fpu: 5.15.39-3
pve-kernel-5.15.39-3-pve: 5.15.39-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-7
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.5-1
proxmox-backup-file-restore: 2.2.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.5-1
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-11
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

This is very annoying... no sense

luyo · Aug 27, 2022

Hello, I have:

root@pve-1:~# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-3-pve-guest-fpu)

with pve-no-subscription repository and it fix my the problem!

Thank you

tstrand · Aug 31, 2022

Will this patch also be added to the enterprise repository later?

nmmn · Aug 31, 2022

On enterprise we now have installed kernel 5.15.39-4. Proxmox Support told us, that this Kernel fixes the problem.

tstrand · Aug 31, 2022

Sadly this still doesn't work for me with kernel 5.15.39-4-pve

From:

Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz (latest bios available)

To:

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (latest bios available)

Causes the VM to hang with 100% CPU. Only fix is to reset (or cold migrate).

nmmn · Aug 31, 2022

OK, we had the bug when migrating from EPYC 7543P to EPYC 7543P (same CPU...) but that seems to be NOT gone

edit: Problem exist... 6 VMs freezes when we migrate them...

[Proxmox 7.2-3 - CEPH 16.2.7] Migrating VMs hangs them (kernel panic on Linux, freeze on Windows)

Renowned Member

Active Member

Member

Famous Member

Member

Member

Active Member

Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Active Member

​

Member

​

Active Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

We value your privacy