VM shutdown, KVM: entry failed, hardware error 0x80000021

We have a Huawei RH2288 v3 rack server with 2x Xeon E5-2680 v4 (Broadwell-EP) running PVE 7.1

BIOS microcode version is 0xb000038

We have an ubuntu server VM have this problem, but Windows server 2019, 2022 do not ran into this.

In detail: this VM will shutdown because of that assertion fail after up about 11-14 hours (or in midnight, not sure which condition is critical).

We have tried these with no luck:

- Tried different pve-kernel releases: 5.13.19-2 5.15.35(latest), 5.13.19-6(mentioned above, after 28 hour, in midnight, it crashed).
- Update microcode to 0xb000040 (intel marked as "with caveats" because of BDX90 (see https://www.intel.com/content/dam/w...cification-updates/xeon-e7-v4-spec-update.pdf))
- `mitigation=off` kernel arguments
- disable nested kvm via kvm module options

So with information in this thread, we guess:

- This problem only occurs in intel broadwell and haswell cpus
- This problem is a kernel (kvm) bug
- After about 12h running / at midnight with low load, the guest kernel try to do some (cpu or power) state change and trigger this bug

Should we just comment that assertion out to "workaround" it?

Further read:

- https://www.reddit.com/r/VFIO/comments/s1k5yg/win10_guest_crashes_after_a_few_minutes/
- https://access.redhat.com/solutions/5749121
- https://gitlab.com/qemu-project/qemu/-/issues/1047 (may not relative)
- https://bugzilla.kernel.org/show_bug.cgi?id=216003 (may not relative)
 
Ok guys. It's work for me 2+ days no crashes.

I want to share My fixing steps with you:

Note: My host : PVE 7.2 (new kernel 5.15.35-1-pve), my guest what crashes randomly: windows server 2022

1. step install Intel-Microcode install steps here: https://wiki.debian.org/Microcode

2. step: set mitigations=off
Code:
nano /etc/default/grub
find&change this line:
  GRUB_CMDLINE_LINUX_DEFAULT="quiet mitigations=off"

update-grub2
reboot

solved my problem.

Please Note that mitigations=off coming with some security issues for the host . look here: https://unix.stackexchange.com/questions/554908/disable-spectre-and-meltdown-mitigations

Another way is downgrade PVE kernel to 5.13 but after some days it can be problem. because PVE's maybe stop old kernel support next releases.


maybe can help you.

Edit: if set mitigations=off, crash during live backup. now best way is old kernel for me

I am running a Windows Server 2022 Standard and have tried the settings of @kyesil for my host. Additionally I have changed the machine version to 6.2 and I am running kernel 5.15.35-2-pve.
1655112367424.png

So far my Windows Server 2022 Standard is running without any crashes for 5 days. Before performing these changes the VM crashed twice a day, for now it is running stable without any issues.

Just be aware, when changing the machine version, my Intel Intel E1000 network device was changed after starting the VM. So I had to assign my IP settings in the network adapter settings again.
 
  • Like
Reactions: rursache
I am on 5.15.35-2 did an apt update and its all up to date.
Did you manually install 5.15.35-5 ?
It seams like, as @nick.kopas is a Proxmox Subscriber, that new kernel versions are available for subscribers alreay.
At the moment the most recent version I am getting (as a non-subscriber) is also 5.15.35-2.
 
5.15.35-5 (which is available in the non-subscription) also comes with this
Code:
  * update to Ubuntu-5.15.0-36.37
Which includes part of this http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_5.15.0-36.37/changelog
Which includes a lot of KVM fixes
Code:
    - KVM: s390: vsie/gmap: reduce gmap_rmap overhead
    - KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded
    - KVM: x86/pmu: Use different raw event masks for AMD and Intel
    - KVM: SVM: Fix kvm_cache_regs.h inclusions for is_guest_mode()
    - KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
    - KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
    - KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
    - KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255
    - KVM: avoid NULL pointer dereference in kvm_dirty_ring_push
    - powerpc/kvm: Fix kvm_use_magic_page
    - KVM: PPC: Fix vmx/vsx mixup in mmio emulation
    - KVM: PPC: Book3S HV: Check return value of kvmppc_radix_init
    - KVM: x86: Fix emulation in writing cr8
    - KVM: x86/emulator: Defer not-present segment check in
    - KVM: x86: Reinitialize context if host userspace toggles EFER.LME
    - KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root()
    - KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU
    - KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU
    - KVM: x86: hyper-v: Drop redundant 'ex' parameter from kvm_hv_send_ipi()
    - KVM: x86: hyper-v: Drop redundant 'ex' parameter from kvm_hv_flush_tlb()
    - KVM: x86: hyper-v: Fix the maximum number of sparse banks for XMM fast TLB
    - KVM: x86: hyper-v: HVCALL_SEND_IPI_EX is an XMM fast hypercall
    - KVM: x86: Check lapic_in_kernel() before attempting to set a SynIC irq
    - KVM: x86: Avoid theoretical NULL pointer dereference in
      kvm_irq_delivery_to_apic_fast()
    - KVM: x86: Forbid VMM to set SYNIC/STIMER MSRs when SynIC wasn't activated
    - KVM: Prevent module exit until all VMs are freed
    - KVM: x86: fix sending PV IPI
    - KVM: SVM: fix panic on out-of-bounds guest IRQ
    - KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
    - hv_netvsc: Add check for kvmalloc_array
    - x86/fpu: Move KVMs FPU swapping to FPU core
    - x86/fpu: Replace KVMs home brewed FPU copy from user
    - x86/fpu: Replace KVMs home brewed FPU copy to user
    - x86/fpu: Replace KVMs xstate component clearing
    - x86/KVM: Convert to fpstate
    - x86/fpu: Use fpstate in fpu_copy_kvm_uabi_to_fpstate()
    - x86/fpu: Prepare for sanitizing KVM FPU code
    - x86/fpu: Provide infrastructure for KVM FPU cleanup
    - x86/kvm: Convert FPU handling to a single swap buffer
    - x86/fpu: Remove old KVM FPU interface

So, I'm not sure if it fixes our issue but it may help if any of those issue actually is what we are experiencing...
 
It seams like, as @nick.kopas is a Proxmox Subscriber, that new kernel versions are available for subscribers alreay.
At the moment the most recent version I am getting (as a non-subscriber) is also 5.15.35-2.

Just clearing up a little confusion regarding the version numbers...

Code:
root@pve:~# dpkg --list | grep pve-kernel
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                 Version                        Architecture Description
+++-====================================-==============================-==============-============================================>
ii  pve-firmware                         3.4-2                          all          Binary firmware code for the pve-kernel
ii  pve-kernel-5.13                      7.1-9                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.13.19-2-pve             5.13.19-4                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.13.19-6-pve             5.13.19-15                     amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15                      7.2-4                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.15.30-2-pve             5.15.30-3                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15.35-1-pve             5.15.35-3                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15.35-2-pve             5.15.35-5                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-helper                    7.2-4                          all          Function for various kernel maintenance tasks.
 
Last edited:
Ummm... OK. I take that back. My Windows 11 VM just crashed.

@proteus FTW!
I have the same win11 vm crashed issue, the log is almost the same, but I have already installed the latest kernel, so I am wondering the issue is non related to the kernel or the latest one not fix the bug

Rich (BB code):
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: KVM: entry failed, hardware error 0x80000021
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: If you're running a guest on an Intel machine without unrestricted mode
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: support, the failure can be most likely due to the guest entering an invalid
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: state for Intel VT. For example, the guest maybe running in big real mode
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: which is not supported on less recent Intel processors.
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: EAX=00000000 EBX=62487250 ECX=00000000 EDX=51f2e920
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: ESI=51ea15a0 EDI=5a5547e0 EBP=c9da37a0 ESP=0e7e0fb0
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: EIP=00008000 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=1 HLT=0
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: ES =0000 00000000 ffffffff 00809300
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: CS =c000 7ffc0000 ffffffff 00809300
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: SS =0000 00000000 ffffffff 00809300
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: DS =0000 00000000 ffffffff 00809300
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: FS =0000 00000000 ffffffff 00809300
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: GS =0000 00000000 ffffffff 00809300
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: LDT=0000 00000000 000fffff 00000000
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: TR =0040 0e7d5000 00000067 00008b00
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: GDT=     0e7d6fb0 00000057
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: IDT=     00000000 00000000
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: CR0=00050032 CR2=6e124b60 CR3=9c50c000 CR4=00000000
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: DR6=00000000ffff4ff0 DR7=0000000000000400
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: EFER=0000000000000000
Jun 15 06:28:15 HPL-PVE-HOST QEMU[1472]: Code=kvm: ../hw/core/cpu-sysemu.c:77: cpu_asidx_from_attrs: Assertion `ret < cpu->num_ases && ret >= 0' failed.
Jun 15 06:28:15 HPL-PVE-HOST kernel: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
Jun 15 06:28:15 HPL-PVE-HOST kernel: vmbr0: port 2(tap101i0) entered disabled state
Jun 15 06:28:15 HPL-PVE-HOST kernel: vmbr0: port 2(tap101i0) entered disabled state
Jun 15 06:28:15 HPL-PVE-HOST systemd[1]: 101.scope: Succeeded.
Jun 15 06:28:15 HPL-PVE-HOST systemd[1]: 101.scope: Consumed 1h 33min 50.809s CPU time.
Jun 15 06:28:16 HPL-PVE-HOST qmeventd[347140]: Starting cleanup for 101
Jun 15 06:28:16 HPL-PVE-HOST qmeventd[347140]: Finished cleanup for 101

here is the kernel

Rich (BB code):
root@HPL-PVE-HOST:~# dpkg --list | grep pve-kernel
ii  pve-firmware                         3.4-2                          all          Binary firmware code for the pve-kernel
ii  pve-kernel-5.15                      7.2-4                          all          Latest Proxmox VE Kernel Image
ii  pve-kernel-5.15.30-2-pve             5.15.30-3                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15.35-1-pve             5.15.35-3                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-5.15.35-2-pve             5.15.35-5                      amd64        The Proxmox PVE Kernel Image
ii  pve-kernel-helper                    7.2-4                          all          Function for various kernel maintenance tasks.
root@HPL-PVE-HOST:~#
 
I have the same win11 vm crashed issue, the log is almost the same, but I have already installed the latest kernel, so I am wondering the issue is non related to the kernel or the latest one not fix the bug
You can try installing and pinning the 5.13 kernel...

Code:
apt install pve-kernel-5.13.19-6-pve
proxmox-boot-tool kernel pin 5.13.19-6-pve
 
You can try installing and pinning the 5.13 kernel...

Code:
apt install pve-kernel-5.13.19-6-pve
proxmox-boot-tool kernel pin 5.13.19-6-pve
Thanks for the reply, I will try it. Is this kernel fixed something or we have evidence that identify the issue is related to 5.15 kernel?
 
Sorry, a few days have passed now ... we are waiting for an official fix, can you tell us something please? it seems to me really serious that for a problem like this you have to wait so long for a fixl, what problems are there? why don't you tell us anything? thank you
As the thread shows it seems that disabling mitigations does not fix the issues in all instances - so that path sadly does not seem the (only one) to a solution.

Currently we're working on finding the commit which introduced the issue.
The main issue with this for us is that we cannot reliably trigger the situation:
* We have one (older, with outdated BIOS) host where it occurs sporadically as @tom pointed out earlier - not one single other machine shows that issue here
* It triggers very seldomly - about once every 10 (sometimes it's far more ) Windows installs - so each test takes quite a while.

One thing that would help us here tremendously is if someone having this issue reproducibly (meaning it occurs deterministically on a VM of theirs, when certain actions are performed) - could explain how they arrive there and/or share the VM with us (I'm hoping that the issue does occur on our host as well then)


One further idea that came up in our discussions here:
* could you try to change the CPU type from host to something else (even the same family as your hardware is - e.g. on a SkyLake system don't set the cpu to 'host' but to 'SkyLake') - and see if the issue persists?

I understand that it's frustrating - but as explained we're taking the issue seriously and are putting quite an effort into finding the cause and with that hopefully the path to a fix - we're not waiting with a fix in our hands...
 
Small update... I just applied the latest kernel update (5.15.35-5) and it seems to have resolved my issues running backups. I've leaving the 5.13.19-15 kernel unpinned for now and will see how it goes.
That's good - however - the changes between the 5.15.35 kernels are in areas not touching that code - so it seems a bit odd?
(the changes are a fix for mounting QNAP NFS shares, one for aquantia NICs, some fix for ECC support on a particular AMD/Ryzen system, and the fix for cve-2022-1966 (netfilter+namespace relate))

In any case keep us posted if the issue indeed is gone for you!
 
It seams like, as @nick.kopas is a Proxmox Subscriber, that new kernel versions are available for subscribers alreay.
At the moment the most recent version I am getting (as a non-subscriber) is also 5.15.35-2.
Just for the record - @nick.kopas cleared most of it up I hope by posting the dpkg ouput:
* packages in the public pve-repositories transition from pvetest->pve-no-subscription->pve-enterprise
Thus you will never have a newer version on enterprise than on no-subscription
(and as a result better tested software on enterprise)

I hope this explains it
 
That's good - however - the changes between the 5.15.35 kernels are in areas not touching that code - so it seems a bit odd?
Yeah, my celebration was premature... It crashed again. :(

Previously, running an image backup from within Windows to a network share reliably triggered the crash. It would seem that wasn't as reliable as I thought.
 
  • Like
Reactions: rursache
After changing power plan in windows from balanced to performance I reached 3 days uptime, I will update if there are changes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!