VM shutdown, KVM: entry failed, hardware error 0x80000021

fransesco

New Member
Jun 16, 2022
4
1
3
I went to 5.15.35-3-pve and crashes seem to occur again. I thought momentarily that the power mode change on Windows VM would have solved this, but apparently not.

I am also wondering if this is causing a memory leak. I got surprised about the host machine memory use after seeing my Windows VM being dead few times and being restarted manually from the console. A reboot of the host brought this to the expeted level. To be followed-up...
 
Last edited:
  • Like
Reactions: rursache

gramels

New Member
May 9, 2022
4
2
3
I downgraded yesterday to
Code:
Linux pve2 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200) x86_64
and since then the Debian based guest stopped crashing (so far).
After my last upgrade this vm crashed reliably at each 1-2 backups which run every 2nd hour.
 
  • Like
Reactions: rursache

piggie-mickie

New Member
Jan 22, 2022
3
0
1
55
Before I downgraded the kernel to 5.13.19-6, I got this issue regularly during backup on an Exchange 2016 server on Win2016, in the last days also during day (no running backup task).

My home lab is a Z390 chipset (Gigabyte) with Intel I7 8700 (coffeelake), 128 GB RAM and 2x2 TB NVME. There are 24 VM running on it, but only the Exchange VM and sometimes a Windows 11 insider VM were affected by this issue. All other machines have never crashed so far (Windows 2008R2, 2016/19 and 22, Win7, Win11, FreeBSD, Ubuntu, Debian, Nested ESXi 6.7 with MacOS and Android VM).

Now with the older kernel version all is running stable for the last week.
 

gramels

New Member
May 9, 2022
4
2
3
I downgraded yesterday to
Code:
Linux pve2 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200) x86_64
and since then the Debian based guest stopped crashing (so far).
After my last upgrade this vm crashed reliably at each 1-2 backups which run every 2nd hour.
correct: it still crashes, just less often
 
  • Like
Reactions: rursache

nikybiasion

Member
May 28, 2012
12
1
23
Same problem here with Intel Xeon Silver 4210R, guest with Windows 2022, q35 and tpm
It crash about once a week

proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 

eazit86

New Member
May 1, 2021
7
0
1
36
So we had this problem since last year and this is our workaround we used;

In Windows 2019+;
Control Panel > Exploit Protection
Control Flow Guard > Off
Mandatory ASLR > Off
Bottom-up ASLR > Off

Never had a crash anymore after we turned those off.

Processors we use;
E5-2690 v1/v2/v4 & gold xeons.
 

itNGO

Well-Known Member
Jun 12, 2020
568
123
48
44
Germany
it-ngo.com
So we had this problem since last year and this is our workaround we used;

In Windows 2019+;
Control Panel > Exploit Protection
Control Flow Guard > Off
Mandatory ASLR > Off
Bottom-up ASLR > Off

Never had a crash anymore after we turned those off.

Processors we use;
E5-2690 v1/v2/v4 & gold xeons.
Or just uninstall that Windows Defender Garbage. At least on Server OS you can still do that...
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,383
1,186
164
So we had this problem since last year and this is our workaround we used;
Could it be a different issue? - this issue here was introduced with pve-kernel-5.15 (which was made public end of last year - and only opt-in until 7.2 was released (in May 2022)
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,383
1,186
164
Could someone who is affected by the `KVM: entry failed, hardware error 0x80000021` issue please try setting the:
tdp_mmu module parameter for kvm to 'N' - In my (quite long and cumbersome tests) it seems to resolve the issue on the one box where we can reproduce it)

To do so you can either:
* edit the kernel command-line (see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline) and add:
Code:
kvm.tdp_mmu=N
(so that it looks something like: root=/dev/mapper/pve-root ro quiet kvm.tdp_mmu=N)
** reboot

* add it to /etc/modprobe.d/kvm.conf:
** add the following line to the file
Code:
options kvm tdp_mmu=N
** run update-initramfs -k all -u
** reboot

In both cases you can verify that the parameter is indeed set to 'N' by checking the current kvm module state:
Code:
cat /sys/module/kvm/parameters/tdp_mmu
N

I would assume the issue to be gone with this setting and the most recent (or actually any) pve-kernel from the 5.15 series
 
Last edited by a moderator:
  • Like
Reactions: rursache and Trilom

D0minik

New Member
Jun 28, 2022
4
3
3
Unfortunately It did not work. I still got KVM: entry failed, hardware error 0x80000021.

Code:
# uname -r
5.15.35-3-pve
# cat /sys/module/kvm/parameters/tdp_mmu
N

I made one machine with Windows Server 2016 and cloned it ten times, this way I can verify if the error still occurs quickly, since the VMs appear to fail randomly.

Hardware:
2x Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
768 GB RAM

We tested on the same server Hyper-V and observed no problems with VMs.
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,383
1,186
164
Unfortunately It did not work. I still got KVM: entry failed, hardware error 0x80000021.
hm - ok - maybe this issue has multiple causes (here setting this module parameter seems to reliably fix the issue (over 72 Windows installs - where on average it took 3-10 to get it to occur)

How did you set the parameter?

Thanks for testing in any case!

EDIT: if possible could you also share the journal since boot - of this run? (journalctl -b > journal_since_boot.txt) - Thanks!
 
Last edited:

D0minik

New Member
Jun 28, 2022
4
3
3
I added the parameter to /etc/default/grub
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet kvm.tdp_mmu=N"

Then update-grub and reboot.

In my case I have one VM with installed Windows, cloned and then some of the instances randomly work for some time or fail almost instantly after start. At least one of ten fails almost instantly on start, usually more.
 
  • Like
Reactions: basteagow

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,383
1,186
164
In my case I have one VM with installed Windows, cloned and then some of the instances randomly work for some time or fail almost instantly after start. At least one of ten fails almost instantly on start, usually more.
If you have a somewhat working reproducer (meaning that upon starting 10 win2k16 clones at least one fails) - and a test-setup - could you maybe try whether the issue also happens for you with pve-kernel-5.15.5-1-pve?

(and if possible to provide the journal since boot)
Big Thanks!
 

D0minik

New Member
Jun 28, 2022
4
3
3
Unfortunately the problem still occurs.

Code:
# uname -r
5.15.5-1-pve

I included the journal output in the attachment. I removed IPs/hostnames/SNs from it, they should not be relevant anyway.
 

Attachments

  • journal.txt
    267.1 KB · Views: 8

D0minik

New Member
Jun 28, 2022
4
3
3
The problem happens on older kernels as well: 5.10.6-1-pve, 5.13.14-1-pve, 5.13.19-6-pve

I have just found out that if I disable the Hyper-Threading in the BIOS the Windows VMs do not seem to crash anymore. There was no problem with VMs on Hyper-V while the HT was on. Also, Linux VMs seems fine as well, so I doubt there is something wrong with the hardware.
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,383
1,186
164
The problem happens on older kernels as well: 5.10.6-1-pve, 5.13.14-1-pve, 5.13.19-6-pve

I have just found out that if I disable the Hyper-Threading in the BIOS the Windows VMs do not seem to crash anymore. There was no problem with VMs on Hyper-V while the HT was on. Also, Linux VMs seems fine as well, so I doubt there is something wrong with the hardware.
ahh - ok - then this is a different issue - sadly the error-message is not too specific (you can find reports from 8 years ago with the same symptom)

One thing that might help in these occasions - is to make sure that all components have the latest firmware installed and also to install the intel-microcode package:
https://wiki.debian.org/Microcode
(in case you have not done this already)

I hope this helps!
 

macpip

Member
Oct 9, 2019
13
8
8
30
Here too

vm 121
unable to complete Windows Server 2022 (Std with desktop environment) installation, starting from 2 different DVD images containing the M$ evaluation release. On about 9 of 10 attempts the VM crashes before asking the Administrator Password.
No matter if:
  • I use virtio network card or e1000 emulation
  • I install virtio balloon drivers
  • I use 0.1.215 or 0.1.217 virtio driver ISO
I've tried all the combination of the previous variables

Code:
root@pve02:~# uname -a
Linux pve02 5.15.35-3-pve #1 SMP PVE 5.15.35-6 (Fri, 17 Jun 2022 13:42:35 +0200) x86_64 GNU/Linux
Code:
root@pve02:~# cat /etc/pve/qemu-server/121.conf  
agent: 1
balloon: 8192
bios: ovmf
boot: order=ide2;scsi0;ide0
cores: 8
efidisk0: local-zfs:vm-121-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: local:iso/virtio-win-0.1.215-2.iso,media=cdrom,size=528322K
ide2: local:iso/SW_DVD9_Win_Server_STD_CORE_2022__64Bit_Italian_DC_STD_MLF_X22-74296.ISO,media=cdrom,size=5374654K
machine: pc-q35-6.2
memory: 16384
meta: creation-qemu=6.2.0,ctime=1656427858
name: XXX01
net0: e1000=XXX,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: local-zfs:vm-121-disk-1,cache=writeback,discard=on,size=100G,ssd=1
scsi1: local-zfs:vm-121-disk-2,cache=writeback,discard=on,size=110G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
sockets: 1
startup: order=40,up=60,down=260
tpmstate0: local-zfs:vm-121-disk-3,size=4M,version=v2.0
vga: qxl
vmgenid: XXX

vm 122
On the same node vm 122 is working with no apparent issues
Code:
root@pve02:~# cat /etc/pve/qemu-server/122.conf 
## Windows Server 2022 Standard 
#
...
#
agent: 1
bios: ovmf
boot: order=scsi0;ide0
cores: 6
efidisk0: local-zfs:vm-122-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: local:iso/virtio-win-0.1.215-2.iso,media=cdrom,size=528322K
machine: pc-q35-6.1
memory: 8192
meta: creation-qemu=6.1.1,ctime=1647964974
name: XXX02
net0: virtio=XXX,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
parent: PreAgg
protection: 1
scsi0: local-zfs:vm-122-disk-1,cache=writeback,discard=on,size=100G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
sockets: 1
startup: order=50,up=60,down=360
tpmstate0: local-zfs:vm-122-disk-2,size=4M,version=v2.0
vga: qxl
vmgenid: XXX

[Snapshots]
...
runningcpu: kvm64,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep
runningmachine: pc-q35-6.1+pve0
...

vm 921
On the same node, suddenly crashes (minimum minutes, maximum 1-2 days)
When it woks, it shows a singoular behaviors:
  • if the VM is configured with min 8 GB ram, max 16 GB ram with balloon, the Windows and pve show, in a few minutes, about 95% of ram utilization (not usable)
  • if the VM is configured with 16 GB ram with no baloon, pve shows about 14 GB ram utilization for this vm, and windows shows a reasonable amount of ram utilization
  • Unable to install some Windows updates
  • Unable to activate Windows
  • No apparent relationship between activation/updates, workload and crshes
Code:
root@pve02:~# cat /etc/pve/qemu-server/921.conf   
## Windows Server 2022 Standard
...
#
agent: 1
bios: ovmf
boot: order=ide0;scsi0
cores: 8
efidisk0: local-zfs:vm-921-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: none,media=cdrom
machine: pc-q35-6.1
memory: 16384
meta: creation-qemu=6.1.1,ctime=1647964974
name: XXX03
net0: virtio=XXX,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
parent: Aggiornamenti_OK
protection: 1
scsi0: local-zfs:vm-921-disk-1,cache=writeback,discard=on,size=100G,ssd=1
scsi1: local-zfs:vm-921-disk-2,cache=writeback,discard=on,size=110G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
sockets: 1
startup: order=40,up=60,down=360
tpmstate0: local-zfs:vm-921-disk-3,size=4M,version=v2.0
vga: qxl
vmgenid: XXX

[Sbap1]
#
agent: 1
bios: ovmf
boot: order=ide0;scsi0
cores: 8
efidisk0: local-zfs:vm-921-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: none,media=cdrom
machine: pc-q35-6.1
memory: 16384
meta: creation-qemu=6.1.1,ctime=1647964974
name: XXX03
net0: virtio=XXX,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
parent: PreLiveCare
protection: 1
runningcpu: kvm64,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep
runningmachine: pc-q35-6.1+pve0
scsi0: local-zfs:vm-921-disk-1,cache=writeback,discard=on,size=100G,ssd=1
scsi1: local-zfs:vm-921-disk-2,cache=writeback,discard=on,size=110G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
snaptime: 1656356550
sockets: 1
startup: order=40,up=60,down=360
tpmstate0: local-zfs:vm-921-disk-3,size=4M,version=v2.0
vga: qxl
vmgenid: c9b5b7b4-0c2b-4add-9ce4-6a39d00c5e85
vmstate: local-zfs:vm-921-state-Aggiornamenti_OK

[Snap2]
...

Node CPU

Code:
root@pve02:~# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
Stepping:                        7
CPU MHz:                         2200.000
CPU max MHz:                     3200.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4400.00
Virtualization:                  VT-x
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        12 MiB
L3 cache:                        16.5 MiB
NUMA node0 CPU(s):               0-23
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl x
                                 topology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rd
                                 rand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 sm
                                 ep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dt
                                 herm ida arat pln pts hwp hwp_act_window hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

On an other pve cluster on different HW, ZFS, a freebsd VM has been crashing 3-4 times in about 15 days with
KVM: entry failed, hardware error 0x80000021
generally just after the vzdump of other vm rinning on another node. In about a couple of years this vm did never crash.

I'm going to pin kernel 5.13.19-6-pve as soon as possible on node pve02 on the first cluster, and let you know if I'll be able to install Windows Server 2022 Std.

Just a question: is it safe to revert to kernel 5.13.19-6-pve on a single node of the cluster?
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,501
1,751
164
South Tyrol/Italy
shop.proxmox.com
On an other pve cluster on different HW, ZFS, a freebsd VM has been crashing 3-4 times in about 15 days with
KVM: entry failed, hardware error 0x80000021
generally just after the vzdump of other vm rinning on another node. In about a couple of years this vm did never crash.
Did you tried the workaround that addresses our reproducer for this? Namely:
Could someone who is affected by the `KVM: entry failed, hardware error 0x80000021` issue please try setting the:
tdp_mmu module parameter for kvm to 'N' - In my (quite long and cumbersome tests) it seems to resolve the issue on the one box where we can reproduce it)

To do so you can either:
* edit the kernel command-line (see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline) and add:
Code:
kvm.tdp_mmu=N
(so that it looks something like: root=/dev/mapper/pve-root ro quiet kvm.tdp_mmu=N)
** reboot

* add it to /etc/modprobe.d/kvm.conf:
** add the following line to the file
Code:
options kvm tdp_mmu=N
** run update-initramfs -k all -u
** reboot

In both cases you can verify that the parameter is indeed set to 'N' by checking the current kvm module state:
Code:
cat /sys/module/kvm/parameters/tdp_mmu
N
I would assume the issue to be gone with this setting and the most recent (or actually any) pve-kernel from the 5.15 series

Just a question: is it safe to revert to kernel 5.13.19-6-pve on a single node of the cluster?
In terms of cluster stability? yes
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!