VM shutdown, KVM: entry failed, hardware error 0x80000021

Older Hardware and New 5.15 Kernel​

KVM: entry failed, hardware error 0x80000021​

Background​

With the 5.15 kernel, the two-dimensional paging (TDP) memory management unit (MMU) implementation got activated by default. The new implementation reduces the complexity of mapping the guest OS virtual memory address to the host's physical memory address and improves performance, especially during live migrations for VMs with a lot of memory and many CPU cores. However, the new TDP MMU feature has been shown to cause regressions on some (mostly) older hardware, likely due to assumptions about when the fallback is required not being met by that HW.

The problem manifests as crashes of the machine with a kernel (dmesg) or journalctl log entry with, among others, a line like this:

KVM: entry failed, hardware error 0x80000021

Normally there's also an assert error message logged from the QEMU process around the same time. Windows VMs are the most commonly affected in the user reports.

The affected models could not get pinpointed exactly, but it seems CPUs launched over 8 years ago are most likely triggering the issue. Note that there are known cases where updating to the latest available firmware (BIOS/EFI) and CPU microcode fixed the regression. Thus, before trying the workaround below, we recommend ensuring that you have the latest firmware and CPU microcode installed.

Workaround: Disable tdp_mmu​

The tdp_mmu kvm module option can be used to force disabling the usage of the two-dimensional paging (TDP) MMU.

  • You can either add that parameter to the PVE host's kernel command line as kvm.tdp_mmu=N, see this reference documentation section.
  • Alternatively, set the module option using a modprobe config, for example:
echo "options kvm tdp_mmu=N" >/etc/modprobe.d/kvm-disable-tdp-mmu.conf
To finish applying the workaround, always run update-initramfs -k all -u to update the initramfs for all kernels and then reboot the Proxmox VE host.

You can confirm that the change is active by checking that the output ofcat /sys/module/kvm/parameters/tdp_mmu is N.
Good afternoon, tell me how to change this parameter on Debian? To position - N.
When executing a command from the manual, it outputs:

root@Line-host:~# cat /sys/module/kvm/parameters/tdp_mmu Y root@Line-host:~# echo "options kvm tdp_mmu=N" >/etc/modprobe.d/kvm-disable-tdp-mmu.conf @Line-host:~# update-initramfs -k all -u update-initramfs: Generating /boot/initrd.img-5.15.39-1-pve Running hook script 'zz-proxmox-boot'.. Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace.. No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync. update-initramfs: Generating /boot/initrd.img-5.15.35-2-pve Running hook script 'zz-proxmox-boot'.. Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace.. No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync. update-initramfs: Generating /boot/initrd.img-5.13.19-6-pve Running hook script 'zz-proxmox-boot'.. Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace.. No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync. update-initramfs: Generating /boot/initrd.img-5.13.19-2-pve Running hook script 'zz-proxmox-boot'.. Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace.. No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
 
Hi all,

we got a new available kernel on pvetest including a patch series that may alleviate this issue without disabling tdp_mmu, it's in the pve-kernel-5.15.39-3-pve package with version 5.15.39-3.

At least the reproducer we got doesn't trigger a crash/kernel error on that kernel after running for more than 2 hours. For context, typical time required for triggering the issue was below half an hour, and even the rare occurrences when it took longer were below 2 hours.

The backport of the series consisting of six patches wasn't trivial, but it also wasn't hard, upstream may want to adapt the approach a bit so while we did not notice any regression and our reproducer runs now stable I wouldn't yet recommend this on production work loads. But, if you got a test instance that could trigger this it would be great to update to the newer kernel pve-kernel-5.15.39-3-pve package with version 5.15.39-3, drop disabling the tdp_mmu and reboot to see if that also is a valid workaround, we'd appreciate any feedback.

For the pvetest repo details see https://pve.proxmox.com/wiki/Package_Repositories#sysadmin_test_repo

FYI, the following command's output must contain the line below with a 1:1 match for your system to ensure that it actually runs the kernel with the alternative fix, with which one doesn't have to disable tdp_mmu.
Code:
dmesg | head -2
[    0.000000] Linux version 5.15.39-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #2 SMP PVE 5.15.39-3 (Wed, 27 Jul 2022 13:45:39 +0200) ()
 
Last edited:
  • Like
Reactions: TomFIT and mishki
So I think that the two-dimensional paging problem (TDP) is linked to those windows features ... it's probably all related to nesting virtualization that is causing problems to TDP in the 5.15.x linux kernel.
Nested virtualization also doesn't work anymore with kernel 5.15 unrelated to modern OS like Windows 11 or Server 2022, but also on rather dated OS like Windows 8.1 and XP running Microsoft Virtual PC inside it to run even older OS like Windows 95, 98, 2000 and the likes. This worked perfectly using 5.13 but not anymore on 5.15.
 
Nested virtualization also doesn't work anymore with kernel 5.15 unrelated to modern OS like Windows 11 or Server 2022, but also on rather dated OS like Windows 8.1 and XP running Microsoft Virtual PC inside it to run even older OS like Windows 95, 98, 2000 and the likes. This worked perfectly using 5.13 but not anymore on 5.15.
It works here on a E5-2620 v3, a 12th gen intel (i7-12700K alder lake) workstation and my old i9-9900K and a dual socket epic 7351 one run also fine w.r.t. nested virtualization, most of my colleagues also use nested virt., and I don't know of any problem there...
FWIW, with the aforementioned kernel I also tested a nested VM in hyper v (win 2022) on Proxmox VE just today, worked out fine.

In any way, unrelated to this as the issue def. is not related directly to nesting, just like it wasn't to specter mitigations, both just make it trigger more easily. SMM (secure machine mode) required for secure boot in UEFI is more likely to be the underlying issue.

Open a new thread for your nested virt issues, include vm config, cpu details and issue descriptions or other info possibly relevant for reproducing.
 
Last edited:
Hi all,

we got a new available kernel on pvetest including a patch series that may alleviate this issue without disabling tdp_mmu, it's in the pve-kernel-5.15.39-3-pve package with version 5.15.39-3.

At least the reproducer we got doesn't trigger a crash/kernel error on that kernel after running for more than 2 hours. For context, typical time required for triggering the issue was below half an hour, and even the rare occurrences when it took longer were below 2 hours.

The backport of the series consisting of six patches wasn't trivial, but it also wasn't hard, upstream may want to adapt the approach a bit so while we did not notice any regression and our reproducer runs now stable I wouldn't yet recommend this on production work loads. But, if you got a test instance that could trigger this it would be great to update to the newer kernel pve-kernel-5.15.39-3-pve package with version 5.15.39-3, drop disabling the tdp_mmu and reboot to see if that also is a valid workaround, we'd appreciate any feedback.

For the pvetest repo details see https://pve.proxmox.com/wiki/Package_Repositories#sysadmin_test_repo

FYI, the following command's output must match 1:1 for your system to ensure that it actually runs the kernel with the alternative fix, with which one doesn't have to disable tdp_mmu.
Code:
dmesg | head -1
[    0.000000] Linux version 5.15.39-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #2 SMP PVE 5.15.39-3 (Wed, 27 Jul 2022 13:45:39 +0200) ()

The new 5.15.39-3 test kernel finally fixes the problem for me without having to turn off tdp_mmu.

Bash:
# dmesg | head -2
[    0.000000] microcode: microcode updated early to revision 0xf0, date = 2021-11-15
[    0.000000] Linux version 5.15.39-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #2 SMP PVE 5.15.39-3 (Wed, 27 Jul 2022 13:45:39 +0200) ()

Bash:
# cat /sys/module/kvm/parameters/tdp_mmu
Y

Thank you for making this happen!
 
Installed and am testing. My knee-jerk reaction is this helps, but there are some performance issues. The impacted machines no longer crash under load, but their network performance now struggles to keep up. Local speedtests from the machine have swings in performance and throughput (kind of critical when running a file server in this case). I'm not sure if it is a CPU issue or a network issue.

But - concur - am not seeing the VM's crash under load anymore.

Will continue to keep an eye on this thread. Thank you for the work resolving this!
 
Last edited:
Installed and am testing. My knee-jerk reaction is this helps, but there are some performance issues. The impacted machines no longer crash under load, but their network performance now struggles to keep up. Local speedtests from the machine have swings in performance and throughput (kind of critical when running a file server in this case). I'm not sure if it is a CPU issue or a network issue.
That's probably the new retbleed mitigations, which came in with that kernel but aren't itself part of the fix for this specific issue. Check lscpu to see if that theory holds up, it'll show something like
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
if that HW issue got detected and the mitigation is active or Not affected if it isn't.
 
Last edited:
Sorry for dragging my feet on this. I've been testing to see what is working and what isn't.

1. Windows workloads have issues. I run several Windows workloads and find the odd network slowdown happens on both 2019 and 2022 servers. It looks like a bunch of interrupts stack up and overwhelm the VM, causing degradation in network performance. In fairness, I'm not sure if the issues existed before the patch. I'm also not sure if it is the patch or the network drivers (paravirtual, RedHat) causing the problem. The point of the patch (for me) was to get a 2022 _FILE_ server online which would crash under load. Other servers - like a domain controller - never experienced network loads like the file server, which was new and currently runs on Hyper-V boxes.

2. I don't believe Linux-y/Unix-y workloads to be impacted. I run a firewall (OPNSense/FreeBSD) through the servers as well and do not _THINK_ I see network performance issues (more below). I get near-line rates (1GBps) for the box running routing, Suricata (in IPS mode), and a firewall.

Note: While performance is near line rates, I do occasionally see dips that I can't attribute to anything. The dips COULD be the perf server on the other side, a network traffic blip, or networking issues similar to the Windows servers. I don't know.

3. While I don't see networking issues <edit> on Linux Workloads</edit>, I am seeing other issues I've not seen before. On the FreeBSD/OPNSense firewall, I am seeing the clock run backwards:

calcru: runtime went backwards from 1373029 usec to 763269 usec for pid 14053 (dpinger) calcru: runtime went backwards from 96102 usec to 53228 usec for pid 14053 (dpinger)

The system is using KVMCLOCK as the time counter. I'm also seeing something weird with the disk:

(da0:vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 01 c9 73 28 00 01 00 00 (da0:vtscsi0:0:0:0): CAM status: Command timeout (da0:vtscsi0:0:0:0): Retrying command, 3 more tries remain (da0:vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 01 ba 96 a8 00 00 40 00 (da0:vtscsi0:0:0:0): CAM status: Command timeout (da0:vtscsi0:0:0:0): Retrying command, 3 more tries remain

Neither of these issues happened before the update (I've run the FW for 5+ years without ever seeing this), so am leaning to the patch as the issue.

4. LSCPU does not report a RETBLEED vulnerability on the CPUs in the affected PROXMOX hosts. Obviously am not checking the guests.

Again, apologize for the delay, wanted to run tests as able.
 
Last edited:
Code:
calcru: runtime went backwards from 1373029 usec to 763269 usec for pid 14053 (dpinger)
calcru: runtime went backwards from 96102 usec to 53228 usec for pid 14053 (dpinger)

I also run OPNsense on proxmox and have been seeing these for as long as I remember. Just a friendly FYI, not sure if it affects anything and what causes it though.
 
  • Like
Reactions: basteagow
Hi all,

we got a new available kernel on pvetest including a patch series that may alleviate this issue without disabling tdp_mmu, it's in the pve-kernel-5.15.39-3-pve package with version 5.15.39-3.

At least the reproducer we got doesn't trigger a crash/kernel error on that kernel after running for more than 2 hours. For context, typical time required for triggering the issue was below half an hour, and even the rare occurrences when it took longer were below 2 hours.

The backport of the series consisting of six patches wasn't trivial, but it also wasn't hard, upstream may want to adapt the approach a bit so while we did not notice any regression and our reproducer runs now stable I wouldn't yet recommend this on production work loads. But, if you got a test instance that could trigger this it would be great to update to the newer kernel pve-kernel-5.15.39-3-pve package with version 5.15.39-3, drop disabling the tdp_mmu and reboot to see if that also is a valid workaround, we'd appreciate any feedback.

For the pvetest repo details see https://pve.proxmox.com/wiki/Package_Repositories#sysadmin_test_repo

FYI, the following command's output must contain the line below with a 1:1 match for your system to ensure that it actually runs the kernel with the alternative fix, with which one doesn't have to disable tdp_mmu.
Code:
dmesg | head -2
[    0.000000] Linux version 5.15.39-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #2 SMP PVE 5.15.39-3 (Wed, 27 Jul 2022 13:45:39 +0200) ()
5.15.39-3 i don´t get on enterprise repository. Will it be pushed soon?
 
Hallo, I don't use Proxmox yet but got the same issue here:
Server is on Ubuntu 22.04 wich kernel 5.15.0-40-generic. CPU is Xeon(R) CPU E5-2640 v4. Only one VM is affectec, running win2k22 with sqlserver 2019. error happend 2 times in 8 Days. No workaround anabled yet. This Message is for you to determine the affectet CPUs. And I also use
<type arch='x86_64' machine='pc-q35-5.2'>hvm</type>
<loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE_4M.ms.fd</loader>
<nvram template='/usr/share/OVMF/OVMF_VARS_4M.ms.fd'>/var/lib/libvirt/qemu/nvram/win2k22-sqlserver_VARS.fd</nvram>
With pc-q35-6.? i can't use the network neighter virtio and e1000.
 
Server is on Ubuntu 22.04 wich kernel 5.15.0-40-generic. CPU is Xeon(R) CPU E5-2640 v4. Only one VM is affectec, running win2k22 with sqlserver 2019. error happend 2 times in 8 Days. No workaround anabled yet. This Message is for you to determine the affectet CPUs. And I also use
I'd guess the error happens either inside the Ubuntu host, or at least due to it, as IIRC Ubuntu do not have any mitigations yet backported for this issue unlike the Proxmox kernel, so its kvm stack is still susceptible. Either use PVE in the nested VM or report the issue to Ubuntu, recommending the following, in development patch series:
https://lore.kernel.org/kvm/20220803155011.43721-1-mlevitsk@redhat.com/
 
Hello

I have the same issue with Windows 2k22 server. i ve read that 5.15.39-3 kernel seems to solve this.

will it be release soon on enterprise repository ?

regards

Vincent
 
Last edited:
Hi,

experience with the package residing about two weeks on testing and one on no-subscription went good so far, so from that POV it would be now good to go for the enterprise.

But, upstream send out a slightly revised fix with some development feedback addressed, which we'd like to check out first. As the closer we're with the patch series that actually goes in upstream, the less friction potential is there on the longer run.

Testing that version with our reproducer will take a few hours, if that goes well we'll move that package a bit faster to the open repos, and as the change is relatively small compared to the last kernel build, it should get into the enterprise repo no later than start of next week, naturally only if nothing comes up. Depending on upstream feedback and also further feedback from the community here, we may be able to short cut that, but no promises here and if we're not very sure about impact potential we'll lean towards the slower and safer option. Anyhow, we'll keep you posted if that new kernel is available in public repos.
 
  • Like
Reactions: nikybiasion
Hi,

experience with the package residing about two weeks on testing and one on no-subscription went good so far, so from that POV it would be now good to go for the enterprise.

But, upstream send out a slightly revised fix with some development feedback addressed, which we'd like to check out first. As the closer we're with the patch series that actually goes in upstream, the less friction potential is there on the longer run.

Testing that version with our reproducer will take a few hours, if that goes well we'll move that package a bit faster to the open repos, and as the change is relatively small compared to the last kernel build, it should get into the enterprise repo no later than start of next week, naturally only if nothing comes up. Depending on upstream feedback and also further feedback from the community here, we may be able to short cut that, but no promises here and if we're not very sure about impact potential we'll lean towards the slower and safer option. Anyhow, we'll keep you posted if that new kernel is available in public repos.

Thanks for your answers... and for the caution you take ;-)
 
The updated version of the fix is now available on pvetest as package pve-kernel-5.15.39-4-pve in version 5.15.39-4, our reproducer was still fixed and in addition some slightly dubious kernel logs, like:

Code:
QEMU[2775]: kvm: Could not update PFLASH: Stale file handle
kernel: kvm: vcpu 1: requested 191999 ns lapic timer period limited to 200000 ns

disappeared with the new version, so even a slight improvement.
 
  • Like
Reactions: itNGO
looks ok on a first glance?
you do need to reboot the PVE-node for the new parameter-setting to take effect.
Hi,
I installed the new kernel a week ago and everything works fine; I also renabled the tdp_mmu.
Several VMs with Win2022 and Debian and there aren't any error about KVM.
At the moment on my hosts I have theese pve-kernel version installed:
pve-kernel-5.13.19-6-pve/stable,now 5.13.19-15 amd64 [installed] pve-kernel-5.15.30-2-pve/stable,now 5.15.30-3 amd64 [installed] pve-kernel-5.15.39-1-pve/stable,now 5.15.39-1 amd64 [installed,automatic] pve-kernel-5.15.39-3-pve/stable,now 5.15.39-3 amd64 [installed,automatic]

It is safe to remove the versions 5.13.19-6 and 5.15.30-2?
 
I installed the new kernel a week ago and everything works fine; I also renabled the tdp_mmu.
Several VMs with Win2022 and Debian and there aren't any error about KVM.
If you still have the time @t.lamprecht uploaded a new kernel 5.15.39-4 yesterday - which contains the same fix, but in a version that is more likely to get merged upstream - so testing would be much appreciated!

It is safe to remove the versions 5.13.19-6 and 5.15.30-2?
In general I'd say as long as there is a kernel on the machine that does boot and work - you can of course remove old ones (you can always reinstall them later if you really want to) - but make sure that you have a running kernel left :)
 
  • Like
Reactions: t.lamprecht