VM shutdown, KVM: entry failed, hardware error 0x80000021

nikybiasion · Jun 27, 2022

Same problem here with Intel Xeon Silver 4210R, guest with Windows 2022, q35 and tpm
It crash about once a week

proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

eazit86 · Jun 27, 2022

So we had this problem since last year and this is our workaround we used;

In Windows 2019+;
Control Panel > Exploit Protection
Control Flow Guard > Off
Mandatory ASLR > Off
Bottom-up ASLR > Off

Never had a crash anymore after we turned those off.

Processors we use;
E5-2690 v1/v2/v4 & gold xeons.

itNGO · Jun 28, 2022

eazit86 said:
So we had this problem since last year and this is our workaround we used;

In Windows 2019+;
Control Panel > Exploit Protection
Control Flow Guard > Off
Mandatory ASLR > Off
Bottom-up ASLR > Off

Never had a crash anymore after we turned those off.

Processors we use;
E5-2690 v1/v2/v4 & gold xeons.

Or just uninstall that Windows Defender Garbage. At least on Server OS you can still do that...

Stoiko Ivanov · Jun 28, 2022

eazit86 said:
So we had this problem since last year and this is our workaround we used;

Could it be a different issue? - this issue here was introduced with pve-kernel-5.15 (which was made public end of last year - and only opt-in until 7.2 was released (in May 2022)

gramels · Jun 28, 2022

Stoiko Ivanov said:
Could it be a different issue? - this issue here was introduced with pve-kernel-5.15 (which was made public end of last year - and only opt-in until 7.2 was released (in May 2022)

that timeline fits my observation

Stoiko Ivanov · Jun 28, 2022

Could someone who is affected by the `KVM: entry failed, hardware error 0x80000021` issue please try setting the:
tdp_mmu module parameter for kvm to 'N' - In my (quite long and cumbersome tests) it seems to resolve the issue on the one box where we can reproduce it)

To do so you can either:
* edit the kernel command-line (see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline) and add:

Code:

kvm.tdp_mmu=N

(so that it looks something like: root=/dev/mapper/pve-root ro quiet kvm.tdp_mmu=N)
** reboot

* add it to /etc/modprobe.d/kvm.conf:
** add the following line to the file

Code:

options kvm tdp_mmu=N

** run update-initramfs -k all -u
** reboot

In both cases you can verify that the parameter is indeed set to 'N' by checking the current kvm module state:

Code:

cat /sys/module/kvm/parameters/tdp_mmu
N

I would assume the issue to be gone with this setting and the most recent (or actually any) pve-kernel from the 5.15 series

D0minik · Jun 28, 2022

Unfortunately It did not work. I still got KVM: entry failed, hardware error 0x80000021.

Code:

# uname -r
5.15.35-3-pve
# cat /sys/module/kvm/parameters/tdp_mmu
N

I made one machine with Windows Server 2016 and cloned it ten times, this way I can verify if the error still occurs quickly, since the VMs appear to fail randomly.

Hardware:
2x Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
768 GB RAM

We tested on the same server Hyper-V and observed no problems with VMs.

Stoiko Ivanov · Jun 28, 2022

D0minik said:
Unfortunately It did not work. I still got KVM: entry failed, hardware error 0x80000021.

hm - ok - maybe this issue has multiple causes (here setting this module parameter seems to reliably fix the issue (over 72 Windows installs - where on average it took 3-10 to get it to occur)

How did you set the parameter?

Thanks for testing in any case!

EDIT: if possible could you also share the journal since boot - of this run? (journalctl -b > journal_since_boot.txt) - Thanks!

D0minik · Jun 28, 2022

I added the parameter to /etc/default/grub

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet kvm.tdp_mmu=N"

Then update-grub and reboot.

In my case I have one VM with installed Windows, cloned and then some of the instances randomly work for some time or fail almost instantly after start. At least one of ten fails almost instantly on start, usually more.

Stoiko Ivanov · Jun 28, 2022

D0minik said:
In my case I have one VM with installed Windows, cloned and then some of the instances randomly work for some time or fail almost instantly after start. At least one of ten fails almost instantly on start, usually more.

If you have a somewhat working reproducer (meaning that upon starting 10 win2k16 clones at least one fails) - and a test-setup - could you maybe try whether the issue also happens for you with pve-kernel-5.15.5-1-pve?

(and if possible to provide the journal since boot)
Big Thanks!

D0minik · Jun 28, 2022

Unfortunately the problem still occurs.

Code:

# uname -r
5.15.5-1-pve

I included the journal output in the attachment. I removed IPs/hostnames/SNs from it, they should not be relevant anyway.

Stoiko Ivanov · Jun 29, 2022

D0minik said:
Unfortunately the problem still occurs.

hmpf - but I did expect it.

just for confirmation the issue does not occur with kernel series 5.13 for you?

D0minik · Jun 29, 2022

The problem happens on older kernels as well: 5.10.6-1-pve, 5.13.14-1-pve, 5.13.19-6-pve

I have just found out that if I disable the Hyper-Threading in the BIOS the Windows VMs do not seem to crash anymore. There was no problem with VMs on Hyper-V while the HT was on. Also, Linux VMs seems fine as well, so I doubt there is something wrong with the hardware.

Stoiko Ivanov · Jun 29, 2022

D0minik said:
The problem happens on older kernels as well: 5.10.6-1-pve, 5.13.14-1-pve, 5.13.19-6-pve

I have just found out that if I disable the Hyper-Threading in the BIOS the Windows VMs do not seem to crash anymore. There was no problem with VMs on Hyper-V while the HT was on. Also, Linux VMs seems fine as well, so I doubt there is something wrong with the hardware.

ahh - ok - then this is a different issue - sadly the error-message is not too specific (you can find reports from 8 years ago with the same symptom)

One thing that might help in these occasions - is to make sure that all components have the latest firmware installed and also to install the intel-microcode package:
https://wiki.debian.org/Microcode
(in case you have not done this already)

I hope this helps!

macpip · Jun 29, 2022

Here too

vm 121
unable to complete Windows Server 2022 (Std with desktop environment) installation, starting from 2 different DVD images containing the M$ evaluation release. On about 9 of 10 attempts the VM crashes before asking the Administrator Password.
No matter if:

I use virtio network card or e1000 emulation
I install virtio balloon drivers
I use 0.1.215 or 0.1.217 virtio driver ISO

I've tried all the combination of the previous variables

Code:

root@pve02:~# uname -a
Linux pve02 5.15.35-3-pve #1 SMP PVE 5.15.35-6 (Fri, 17 Jun 2022 13:42:35 +0200) x86_64 GNU/Linux

Code:

root@pve02:~# cat /etc/pve/qemu-server/121.conf  
agent: 1
balloon: 8192
bios: ovmf
boot: order=ide2;scsi0;ide0
cores: 8
efidisk0: local-zfs:vm-121-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: local:iso/virtio-win-0.1.215-2.iso,media=cdrom,size=528322K
ide2: local:iso/SW_DVD9_Win_Server_STD_CORE_2022__64Bit_Italian_DC_STD_MLF_X22-74296.ISO,media=cdrom,size=5374654K
machine: pc-q35-6.2
memory: 16384
meta: creation-qemu=6.2.0,ctime=1656427858
name: XXX01
net0: e1000=XXX,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: local-zfs:vm-121-disk-1,cache=writeback,discard=on,size=100G,ssd=1
scsi1: local-zfs:vm-121-disk-2,cache=writeback,discard=on,size=110G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
sockets: 1
startup: order=40,up=60,down=260
tpmstate0: local-zfs:vm-121-disk-3,size=4M,version=v2.0
vga: qxl
vmgenid: XXX

vm 122
On the same node vm 122 is working with no apparent issues

Code:

root@pve02:~# cat /etc/pve/qemu-server/122.conf 
## Windows Server 2022 Standard 
#
...
#
agent: 1
bios: ovmf
boot: order=scsi0;ide0
cores: 6
efidisk0: local-zfs:vm-122-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: local:iso/virtio-win-0.1.215-2.iso,media=cdrom,size=528322K
machine: pc-q35-6.1
memory: 8192
meta: creation-qemu=6.1.1,ctime=1647964974
name: XXX02
net0: virtio=XXX,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
parent: PreAgg
protection: 1
scsi0: local-zfs:vm-122-disk-1,cache=writeback,discard=on,size=100G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
sockets: 1
startup: order=50,up=60,down=360
tpmstate0: local-zfs:vm-122-disk-2,size=4M,version=v2.0
vga: qxl
vmgenid: XXX

[Snapshots]
...
runningcpu: kvm64,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep
runningmachine: pc-q35-6.1+pve0
...

vm 921
On the same node, suddenly crashes (minimum minutes, maximum 1-2 days)
When it woks, it shows a singoular behaviors:

if the VM is configured with min 8 GB ram, max 16 GB ram with balloon, the Windows and pve show, in a few minutes, about 95% of ram utilization (not usable)
if the VM is configured with 16 GB ram with no baloon, pve shows about 14 GB ram utilization for this vm, and windows shows a reasonable amount of ram utilization
Unable to install some Windows updates
Unable to activate Windows
No apparent relationship between activation/updates, workload and crshes

Code:

root@pve02:~# cat /etc/pve/qemu-server/921.conf   
## Windows Server 2022 Standard
...
#
agent: 1
bios: ovmf
boot: order=ide0;scsi0
cores: 8
efidisk0: local-zfs:vm-921-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: none,media=cdrom
machine: pc-q35-6.1
memory: 16384
meta: creation-qemu=6.1.1,ctime=1647964974
name: XXX03
net0: virtio=XXX,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
parent: Aggiornamenti_OK
protection: 1
scsi0: local-zfs:vm-921-disk-1,cache=writeback,discard=on,size=100G,ssd=1
scsi1: local-zfs:vm-921-disk-2,cache=writeback,discard=on,size=110G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
sockets: 1
startup: order=40,up=60,down=360
tpmstate0: local-zfs:vm-921-disk-3,size=4M,version=v2.0
vga: qxl
vmgenid: XXX

[Sbap1]
#
agent: 1
bios: ovmf
boot: order=ide0;scsi0
cores: 8
efidisk0: local-zfs:vm-921-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: none,media=cdrom
machine: pc-q35-6.1
memory: 16384
meta: creation-qemu=6.1.1,ctime=1647964974
name: XXX03
net0: virtio=XXX,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
parent: PreLiveCare
protection: 1
runningcpu: kvm64,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep
runningmachine: pc-q35-6.1+pve0
scsi0: local-zfs:vm-921-disk-1,cache=writeback,discard=on,size=100G,ssd=1
scsi1: local-zfs:vm-921-disk-2,cache=writeback,discard=on,size=110G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=XXX
snaptime: 1656356550
sockets: 1
startup: order=40,up=60,down=360
tpmstate0: local-zfs:vm-921-disk-3,size=4M,version=v2.0
vga: qxl
vmgenid: c9b5b7b4-0c2b-4add-9ce4-6a39d00c5e85
vmstate: local-zfs:vm-921-state-Aggiornamenti_OK

[Snap2]
...

Node CPU

Code:

root@pve02:~# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
Stepping:                        7
CPU MHz:                         2200.000
CPU max MHz:                     3200.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4400.00
Virtualization:                  VT-x
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        12 MiB
L3 cache:                        16.5 MiB
NUMA node0 CPU(s):               0-23
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl x
                                 topology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rd
                                 rand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 sm
                                 ep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dt
                                 herm ida arat pln pts hwp hwp_act_window hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

On an other pve cluster on different HW, ZFS, a freebsd VM has been crashing 3-4 times in about 15 days with
KVM: entry failed, hardware error 0x80000021
generally just after the vzdump of other vm rinning on another node. In about a couple of years this vm did never crash.

I'm going to pin kernel 5.13.19-6-pve as soon as possible on node pve02 on the first cluster, and let you know if I'll be able to install Windows Server 2022 Std.

Just a question: is it safe to revert to kernel 5.13.19-6-pve on a single node of the cluster?

t.lamprecht · Jun 29, 2022

macpip said:
On an other pve cluster on different HW, ZFS, a freebsd VM has been crashing 3-4 times in about 15 days with
KVM: entry failed, hardware error 0x80000021
generally just after the vzdump of other vm rinning on another node. In about a couple of years this vm did never crash.

Did you tried the workaround that addresses our reproducer for this? Namely:

Stoiko Ivanov said:
Could someone who is affected by the `KVM: entry failed, hardware error 0x80000021` issue please try setting the:
tdp_mmu module parameter for kvm to 'N' - In my (quite long and cumbersome tests) it seems to resolve the issue on the one box where we can reproduce it)

To do so you can either:
* edit the kernel command-line (see https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline) and add:

Code:

kvm.tdp_mmu=N

(so that it looks something like: root=/dev/mapper/pve-root ro quiet kvm.tdp_mmu=N)
** reboot

* add it to /etc/modprobe.d/kvm.conf:
** add the following line to the file

Code:

options kvm tdp_mmu=N

** run update-initramfs -k all -u
** reboot

In both cases you can verify that the parameter is indeed set to 'N' by checking the current kvm module state:

Code:

cat /sys/module/kvm/parameters/tdp_mmu N

I would assume the issue to be gone with this setting and the most recent (or actually any) pve-kernel from the 5.15 series

macpip said:
Just a question: is it safe to revert to kernel 5.13.19-6-pve on a single node of the cluster?

In terms of cluster stability? yes

macpip · Jun 29, 2022

t.lamprecht said:
Did you tried the workaround that addresses our reproducer for this?

Thank you! Have I to do this on all thee nodes, or may I just test the node where the affected VM runs?

Stoiko Ivanov · Jun 29, 2022

macpip said:
Thank you! Have I to do this on all thee nodes, or may I just test the node where the affected VM runs?

If the clusternodes have similar/identical hardware I would recommend disabling tdp_mmu on all of them (by setting it in the kernelcommandline or in /etc/modprobe.d/ and rebooting afterwards) .

macpip · Jun 29, 2022

t.lamprecht said:
Did you tried the workaround that addresses our reproducer for this? Namely:

As you suggested, I've tried the workaround:

vi /etc/kernel/cmdline
  root=ZFS=rpool/ROOT/pve-1 boot=zfs kvm.tdp_mmu=N
proxmox-boot-tool refresh
reboot
cat /sys/module/kvm/parameters/tdp_mmu
  N

The installation completed at the first attempt with no issues!!!

engelant · Jul 2, 2022

Could it be that this patch here is related?
Seems to be merged in v5.18.

VM shutdown, KVM: entry failed, hardware error 0x80000021

Renowned Member

Member

Famous Member

Proxmox Staff Member

Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Attachments

Proxmox Staff Member

New Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Renowned Member

We value your privacy