[Proxmox 7.2-3 - CEPH 16.2.7] Migrating VMs hangs them (kernel panic on Linux, freeze on Windows)

Hy,
same issue here. From an host with kernel package
pve-kernel-5.15.35-3-pve
live migrating from a newer to an older CPU, freezes the VM, despite the kernel version on the older CPU.

On the newer CPU hosts, I installed the older kernel package:
apt install pve-kernel-5.15.30-2-pve

On the older CPU host, I left the new kernel pve-kernel-5.15.35-3-pve

Now live migration works in every directions.

I'm waiting to know when the problem is fixed.



 
  • Like
Reactions: Pakillo77
I've just updated one of our PVE clusters to the latest version and the problem persists as described many times above: live migration from a newer CPU to an older one freezes the VM.

root@pve222:~# pveversion -v proxmox-ve: 7.2-1 (running kernel: 5.15.39-1-pve) pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85) pve-kernel-5.15: 7.2-6 pve-kernel-helper: 7.2-6 pve-kernel-5.15.39-1-pve: 5.15.39-1 pve-kernel-5.15.30-2-pve: 5.15.30-3 ceph: 16.2.9-pve1 ceph-fuse: 16.2.9-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-3 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-3 libpve-storage-perl: 7.2-5 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.3-1 proxmox-backup-file-restore: 2.2.3-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-1 pve-container: 4.2-1 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.4-2 pve-ha-manager: 3.3-4 pve-i18n: 2.7-2 pve-qemu-kvm: 6.2.0-11 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.4-pve1
 
  • Like
Reactions: Hugues-ST
Any ETA when this will be fixed? This is very frustrating for the users paying for the enterprise repository.

---
John
I am saying this as a happy long time PVE user since V2 was released:
I haven't studied the updated TOCs in detail but I'd expect Proxmox GmbH to pay some sort of refund to the paying customers for the time being while the Kernel causes the Migration issues.
This breaks the product and it doesn't seem that Enterprise grade testing is really happening. It feels like many years ago when PVE had other issues that caused trouble and made recommending PVE as consultant difficult because of stability issues in the eyes of the end customer.

I hope this gets solved soon.
Thanks!
 
I am saying this as a happy long time PVE user since V2 was released:
I haven't studied the updated TOCs in detail but I'd expect Proxmox GmbH to pay some sort of refund to the paying customers for the time being while the Kernel causes the Migration issues.
This breaks the product and it doesn't seem that Enterprise grade testing is really happening. It feels like many years ago when PVE had other issues that caused trouble and made recommending PVE as consultant difficult because of stability issues in the eyes of the end customer.

I hope this gets solved soon.
Thanks!
I agree that this should be solved soon. My subscription ends by the end of this year and I will not subscribe in case it is not fixed at that time.

PS. also testing if it is visible that I'm a Proxmox subscriber

--
John
 
  • Like
Reactions: Pakillo77
I am saying this as a happy long time PVE user since V2 was released:
I haven't studied the updated TOCs in detail but I'd expect Proxmox GmbH to pay some sort of refund to the paying customers for the time being while the Kernel causes the Migration issues.
This breaks the product and it doesn't seem that Enterprise grade testing is really happening. It feels like many years ago when PVE had other issues that caused trouble and made recommending PVE as consultant difficult because of stability issues in the eyes of the end customer.

I hope this gets solved soon.
Thanks!
I completely agree. If the Proxmox developers are reading this, which I assume they are, please solve this issue!
 
Just updated everything to the latest version and the problem is NOT fixed:
root@pveo03:~# pveversion -v proxmox-ve: 7.2-1 (running kernel: 5.15.39-2-pve) pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85) pve-kernel-5.15: 7.2-7 pve-kernel-helper: 7.2-7 pve-kernel-5.15.39-2-pve: 5.15.39-2 pve-kernel-5.15.30-2-pve: 5.15.30-3 ceph: 16.2.9-pve1 ceph-fuse: 16.2.9-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx3 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-3 libpve-storage-perl: 7.2-7 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.5-1 proxmox-backup-file-restore: 2.2.5-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-2 pve-container: 4.2-2 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.5-1 pve-ha-manager: 3.4.0 pve-i18n: 2.7-2 pve-qemu-kvm: 6.2.0-11 pve-xtermjs: 4.16.0-1 pve-zsync: 2.2.3 qemu-server: 7.2-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.5-pve1

This is... incredible.
I don't understand how Proxmox can tolerate such an important failure during so many time.
 
  • Like
Reactions: John245
We're looking into it and I already tried to backport that finicky fpu masking patch that is also linked here, I went for another try today and managed to get something that builds, boots and also seems to host VMs and allows migrating them, but it is definitively not battle tested enough for production, iow. like always we got the bootstrap problem, we need people to test it to (hopefully) declare it stable and release over the normal channel, but most only got production setups and thus want to wait out for other feedback and us to declare it stable.

All related packages are available at http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/

The SHA256 sums are:
Code:
4f15a1e9fdcfae212d6279f26cc76f76300655c6dbfc2b53cf990deaedb4159d  linux-tools-5.15_5.15.39-3_amd64.deb
85d029b6a27b541121a3a627c135caded4213fb97cb1c9a36ce1bf0eedf8da45  linux-tools-5.15-dbgsym_5.15.39-3_amd64.deb
bb2c7263e0699cb27cda64ac50d81f5956bf6a3de3e6ac228e9cb6f71f66b29b  pve-headers-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
cf2c24aea618a96fcff3a9a67a318feb04aa554bba0c40670cb51b2a117ae6fd  pve-kernel-libc-dev_5.15.39-3_amd64.deb

For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb one:

Code:
wget http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# verify the checksum
sha256sum pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

apt install ./pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# optionally ensure that this specific kernel gets booted via a pin
proxmox-boot-tool kernel pin 5.15.39-3-pve-guest-fpu

systemctl reboot

As said, boot and basic tests went fine with above, not recommended for production setups.

Note also that even if this test kernel above will fix the specific regression, this is a problem that will reappear from time to time for heterogeneous clusters, while different CPU vendors is def. making it more likely, bigger architectural leaps can sometimes be enough to trigger such issues between different generations from the same vendor, as a new kernel comes around and either it:
  • unlocks features from the newer generation that the older don't have, trickling into the guest state; like probably here happened with enabling Intel AMX (aka TMUL) in between the 5.15.30 and 5.15.35 kernel.
  • needs to be more restrictive, e.g., due to security issues
And the only guarantee that no such state leakage has any effect is to use a homogeneous cluster. All the steadily growing millions of HW combinations cannot realistic get tested on every kernel release, which often come along with time pressure of getting out a mitigation for a critical issue.
Platforms that state they can handle and guarantee that simply don't tell the truth or have a very limited set of HW supported in the first place. FWIW, having the complexity of the systems and the amount of change that happens on a smaller level in mind and the vast abstractions required to hide those away for VMs and migration, it works out quite well almost surprisingly often IMO.

tldr; if you don't want this to happen the easiest way is to:
  1. use enterprise HW
  2. use as identical HW as possible (cpu, mainboard, system vendor wise)
  3. keep firmware and µcode updated and at the same level
All else may work out often, but simply cannot be guaranteed for.
This cannot be "bought" away in software, get fitting HW if you want a production setup.
 
Last edited:
We're looking into it and I already tried to backport that finicky fpu masking patch that is also linked here, I went for another try today and managed to get something that builds, boots and also seems to host VMs and allows migrating them, but it is definitively not battle tested enough for production, iow. like always we got the bootstrap problem, we need people to test it to (hopefully) declare it stable and release over the normal channel, but most only got production setups and thus want to wait out for other feedback and us to declare it stable.

All related packages are available at http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/

The SHA256 sums are:
Code:
4f15a1e9fdcfae212d6279f26cc76f76300655c6dbfc2b53cf990deaedb4159d  linux-tools-5.15_5.15.39-3_amd64.deb
85d029b6a27b541121a3a627c135caded4213fb97cb1c9a36ce1bf0eedf8da45  linux-tools-5.15-dbgsym_5.15.39-3_amd64.deb
bb2c7263e0699cb27cda64ac50d81f5956bf6a3de3e6ac228e9cb6f71f66b29b  pve-headers-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
cf2c24aea618a96fcff3a9a67a318feb04aa554bba0c40670cb51b2a117ae6fd  pve-kernel-libc-dev_5.15.39-3_amd64.deb

For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb one:

Code:
wget http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# verify the checksum
sha256sum pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

apt install ./pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# optionally ensure that this specific kernel gets booted via a pin
proxmox-boot-tool kernel pin 5.15.39-3-pve-guest-fpu

systemctl reboot

As said, boot and basic tests went fine with above, not recommended for production setups.

Note also that even if this test kernel above will fix the specific regression, this is a problem that will reappear from time to time for heterogeneous clusters, while different CPU vendors is def. making it more likely, bigger architectural leaps can sometimes be enough to trigger such issues between different generations from the same vendor, as a new kernel comes around and either it:
  • unlocks features from the newer generation that the older don't have, trickling into the guest state; like probably here happened with enabling Intel AMX (aka TMUL) in between the 5.15.30 and 5.15.35 kernel.
  • needs to be more restrictive, e.g., due to security issues
And the only guarantee that no such state leakage has any effect is to use a homogeneous cluster. All the steadily growing millions of HW combinations cannot realistic get tested on every kernel release, which often come along with time pressure of getting out a mitigation for a critical issue.
Platforms that state they can handle and guarantee that simply don't tell the truth or have a very limited set of HW supported in the first place. FWIW, having the complexity of the systems and the amount of change that happens on a smaller level in mind and the vast abstractions required to hide those away for VMs and migration, it works out quite well almost surprisingly often IMO.

tldr; if you don't want this to happen the easiest way is to:
  1. use enterprise HW
  2. use as identical HW as possible (cpu, mainboard, system vendor wise)
  3. keep firmware and µcode updated and at the same level
All else may work out often, but simply cannot be guaranteed for.
This cannot be "bought" away in software, get fitting HW if you want a production setup.

Ok, I'll try it right now.

And... understanding everything... I have to say that live migration involving different hw worked fine ever, till kernel > 5.15.30-2-pve.
And I really loved that.
 
For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

I have same problem with Dell PowerEdge R6525. My cluster consists of 12 servers: half with AMD 7542 (Zen2) and half with AMD 75F3 (Zen3). Migrating VM to host with older processor caused the virtual machine to hang. After installing this kernel (5.15.39-3-pve-guest-fpu), the problem was solved.
 
Last edited:
Just upgraded all Proxmox packages, including kernel 5.15.39-3-pve (regular one, not patched) and the problem remains, but in reverse.
Now I can migrate VMs from host with newer CPU to older ones, but not from older to newer.

root@pve222:~# pveversion -v proxmox-ve: 7.2-1 (running kernel: 5.15.39-3-pve) pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85) pve-kernel-5.15: 7.2-8 pve-kernel-helper: 7.2-8 pve-kernel-5.13: 7.1-9 pve-kernel-5.15.39-3-pve: 5.15.39-3 pve-kernel-5.15.39-2-pve: 5.15.39-2 pve-kernel-5.15.39-1-pve: 5.15.39-1 pve-kernel-5.15.35-3-pve: 5.15.35-6 pve-kernel-5.15.35-2-pve: 5.15.35-5 pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-4-pve: 5.13.19-9 pve-kernel-5.13.19-3-pve: 5.13.19-7 pve-kernel-5.13.19-2-pve: 5.13.19-4 pve-kernel-5.4.140-1-pve: 5.4.140-1 ceph: 16.2.9-pve1 ceph-fuse: 16.2.9-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-3 libpve-storage-perl: 7.2-7 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.5-1 proxmox-backup-file-restore: 2.2.5-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-2 pve-container: 4.2-2 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.5-1 pve-ha-manager: 3.4.0 pve-i18n: 2.7-2 pve-qemu-kvm: 6.2.0-11 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.5-pve1
 
We're looking into it and I already tried to backport that finicky fpu masking patch that is also linked here, I went for another try today and managed to get something that builds, boots and also seems to host VMs and allows migrating them, but it is definitively not battle tested enough for production, iow. like always we got the bootstrap problem, we need people to test it to (hopefully) declare it stable and release over the normal channel, but most only got production setups and thus want to wait out for other feedback and us to declare it stable.

All related packages are available at http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/

The SHA256 sums are:
Code:
4f15a1e9fdcfae212d6279f26cc76f76300655c6dbfc2b53cf990deaedb4159d  linux-tools-5.15_5.15.39-3_amd64.deb
85d029b6a27b541121a3a627c135caded4213fb97cb1c9a36ce1bf0eedf8da45  linux-tools-5.15-dbgsym_5.15.39-3_amd64.deb
bb2c7263e0699cb27cda64ac50d81f5956bf6a3de3e6ac228e9cb6f71f66b29b  pve-headers-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
cf2c24aea618a96fcff3a9a67a318feb04aa554bba0c40670cb51b2a117ae6fd  pve-kernel-libc-dev_5.15.39-3_amd64.deb

For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb one:

Code:
wget http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# verify the checksum
sha256sum pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

apt install ./pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# optionally ensure that this specific kernel gets booted via a pin
proxmox-boot-tool kernel pin 5.15.39-3-pve-guest-fpu

systemctl reboot

As said, boot and basic tests went fine with above, not recommended for production setups.

Note also that even if this test kernel above will fix the specific regression, this is a problem that will reappear from time to time for heterogeneous clusters, while different CPU vendors is def. making it more likely, bigger architectural leaps can sometimes be enough to trigger such issues between different generations from the same vendor, as a new kernel comes around and either it:
  • unlocks features from the newer generation that the older don't have, trickling into the guest state; like probably here happened with enabling Intel AMX (aka TMUL) in between the 5.15.30 and 5.15.35 kernel.
  • needs to be more restrictive, e.g., due to security issues
And the only guarantee that no such state leakage has any effect is to use a homogeneous cluster. All the steadily growing millions of HW combinations cannot realistic get tested on every kernel release, which often come along with time pressure of getting out a mitigation for a critical issue.
Platforms that state they can handle and guarantee that simply don't tell the truth or have a very limited set of HW supported in the first place. FWIW, having the complexity of the systems and the amount of change that happens on a smaller level in mind and the vast abstractions required to hide those away for VMs and migration, it works out quite well almost surprisingly often IMO.

tldr; if you don't want this to happen the easiest way is to:
  1. use enterprise HW
  2. use as identical HW as possible (cpu, mainboard, system vendor wise)
  3. keep firmware and µcode updated and at the same level
All else may work out often, but simply cannot be guaranteed for.
This cannot be "bought" away in software, get fitting HW if you want a production setup.

Using this patched kernel every migration I've tried has worked as expected.

Thanks, @t.lamprecht

 
  • Like
Reactions: John245 and Neobin
We're looking into it and I already tried to backport that finicky fpu masking patch that is also linked here, I went for another try today and managed to get something that builds, boots and also seems to host VMs and allows migrating them, but it is definitively not battle tested enough for production, iow. like always we got the bootstrap problem, we need people to test it to (hopefully) declare it stable and release over the normal channel, but most only got production setups and thus want to wait out for other feedback and us to declare it stable.

All related packages are available at http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/

The SHA256 sums are:
Code:
4f15a1e9fdcfae212d6279f26cc76f76300655c6dbfc2b53cf990deaedb4159d  linux-tools-5.15_5.15.39-3_amd64.deb
85d029b6a27b541121a3a627c135caded4213fb97cb1c9a36ce1bf0eedf8da45  linux-tools-5.15-dbgsym_5.15.39-3_amd64.deb
bb2c7263e0699cb27cda64ac50d81f5956bf6a3de3e6ac228e9cb6f71f66b29b  pve-headers-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
cf2c24aea618a96fcff3a9a67a318feb04aa554bba0c40670cb51b2a117ae6fd  pve-kernel-libc-dev_5.15.39-3_amd64.deb

For most setups it's enough to get the pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb one:

Code:
wget http://download.proxmox.com/temp/pve-kernel-5.15.35-guest-fpu-mask/pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# verify the checksum
sha256sum pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb
5ba1c056d80f4cb7c05cbdf37524ee897302918a59e1dfdc2099e51e0e07efb1  pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

apt install ./pve-kernel-5.15.39-3-pve-guest-fpu_5.15.39-3_amd64.deb

# optionally ensure that this specific kernel gets booted via a pin
proxmox-boot-tool kernel pin 5.15.39-3-pve-guest-fpu

systemctl reboot

As said, boot and basic tests went fine with above, not recommended for production setups.

Note also that even if this test kernel above will fix the specific regression, this is a problem that will reappear from time to time for heterogeneous clusters, while different CPU vendors is def. making it more likely, bigger architectural leaps can sometimes be enough to trigger such issues between different generations from the same vendor, as a new kernel comes around and either it:
  • unlocks features from the newer generation that the older don't have, trickling into the guest state; like probably here happened with enabling Intel AMX (aka TMUL) in between the 5.15.30 and 5.15.35 kernel.
  • needs to be more restrictive, e.g., due to security issues
And the only guarantee that no such state leakage has any effect is to use a homogeneous cluster. All the steadily growing millions of HW combinations cannot realistic get tested on every kernel release, which often come along with time pressure of getting out a mitigation for a critical issue.
Platforms that state they can handle and guarantee that simply don't tell the truth or have a very limited set of HW supported in the first place. FWIW, having the complexity of the systems and the amount of change that happens on a smaller level in mind and the vast abstractions required to hide those away for VMs and migration, it works out quite well almost surprisingly often IMO.

tldr; if you don't want this to happen the easiest way is to:
  1. use enterprise HW
  2. use as identical HW as possible (cpu, mainboard, system vendor wise)
  3. keep firmware and µcode updated and at the same level
All else may work out often, but simply cannot be guaranteed for.
This cannot be "bought" away in software, get fitting HW if you want a production setup.
I don't know the reason, but now, using the same patched kernel, every migration fails again. All of them: Windows, Linux...

root@pve226:~# pveversion -v proxmox-ve: 7.2-1 (running kernel: 5.15.39-3-pve-guest-fpu) pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85) pve-kernel-5.15: 7.2-8 pve-kernel-helper: 7.2-8 pve-kernel-5.15.39-3-pve-guest-fpu: 5.15.39-3 pve-kernel-5.15.39-3-pve: 5.15.39-3 pve-kernel-5.15.30-2-pve: 5.15.30-3 ceph: 16.2.9-pve1 ceph-fuse: 16.2.9-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-3 libpve-storage-perl: 7.2-7 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.5-1 proxmox-backup-file-restore: 2.2.5-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-2 pve-container: 4.2-2 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.5-1 pve-ha-manager: 3.4.0 pve-i18n: 2.7-2 pve-qemu-kvm: 6.2.0-11 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.5-pve1

This is very annoying... no sense
 
Hello, I have:

root@pve-1:~# pveversion
pve-manager/7.2-7/d0dd0e85 (running kernel: 5.15.39-3-pve-guest-fpu)


with pve-no-subscription repository and it fix my the problem!

Thank you
 
Sadly this still doesn't work for me with kernel 5.15.39-4-pve

From:

Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz (latest bios available)

To:

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (latest bios available)

Causes the VM to hang with 100% CPU. Only fix is to reset (or cold migrate).
 
  • Like
Reactions: Pakillo77

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!