Thank you! Have I to do this on all thee nodes, or may I just test the node where the affected VM runs?Did you tried the workaround that addresses our reproducer for this?
Thank you! Have I to do this on all thee nodes, or may I just test the node where the affected VM runs?Did you tried the workaround that addresses our reproducer for this?
If the clusternodes have similar/identical hardware I would recommend disabling tdp_mmu on all of them (by setting it in the kernelcommandline or in /etc/modprobe.d/ and rebooting afterwards) .Thank you! Have I to do this on all thee nodes, or may I just test the node where the affected VM runs?
As you suggested, I've tried the workaround:Did you tried the workaround that addresses our reproducer for this? Namely:
vi /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs kvm.tdp_mmu=N
proxmox-boot-tool refresh
reboot
cat /sys/module/kvm/parameters/tdp_mmu
N
Giving it a go... Unpinned kernelCould someone who is affected by the `KVM: entry failed, hardware error 0x80000021` issue please try setting the:
tdp_mmu module parameter for kvm to 'N'
5.13.19-6-pve
and set kvm.tdp_mmu=N
.We tried internal builds of 5.18 a while ago to test for a possible fix in newer kernel versions, and booting with 5.18 wouldn't fix (or improve) triggering our reproducer, so I'd figure no.
cpuid_data is full, no space for cpuid(eax:0x8000001d,ecx:0x3e)
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
Jul 01 19:07:26 matrasprox QEMU[79174]: KVM: entry failed, hardware error 0x80000021
Jul 01 19:07:26 matrasprox kernel: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
Jul 01 19:07:26 matrasprox QEMU[79174]: If you're running a guest on an Intel machine without unrestricted mode
Jul 01 19:07:26 matrasprox QEMU[79174]: support, the failure can be most likely due to the guest entering an invalid
Jul 01 19:07:26 matrasprox QEMU[79174]: state for Intel VT. For example, the guest maybe running in big real mode
Jul 01 19:07:26 matrasprox QEMU[79174]: which is not supported on less recent Intel processors.
Jul 01 19:07:26 matrasprox QEMU[79174]: EAX=000022e2 EBX=63d9e180 ECX=00000001 EDX=00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: ESI=bc81b140 EDI=63daa340 EBP=00000000 ESP=65453d40
Jul 01 19:07:26 matrasprox QEMU[79174]: EIP=00008000 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=1 HLT=0
Jul 01 19:07:26 matrasprox QEMU[79174]: ES =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: CS =b600 7ffb6000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: SS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: DS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: FS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: GS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: LDT=0000 00000000 000fffff 00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: TR =0040 63dad000 00000067 00008b00
Jul 01 19:07:26 matrasprox QEMU[79174]: GDT= 63daefb0 00000057
Jul 01 19:07:26 matrasprox QEMU[79174]: IDT= 00000000 00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: CR0=00050032 CR2=39f33dcc CR3=001ae000 CR4=00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Jul 01 19:07:26 matrasprox QEMU[79174]: DR6=00000000ffff0ff0 DR7=0000000000000400
Jul 01 19:07:26 matrasprox QEMU[79174]: EFER=0000000000000000
Jul 01 19:07:26 matrasprox QEMU[79174]: Code=kvm: ../hw/core/cpu-sysemu.c:77: cpu_asidx_from_attrs: Assertion `ret < cpu->num_ases && ret >= 0' failed.
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 2(tap1010i0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 2(tap1010i0) entered disabled state
Jul 01 19:07:26 matrasprox systemd[1]: 1010.scope: Succeeded.
Jul 01 19:07:26 matrasprox systemd[1]: 1010.scope: Consumed 4min 45.838s CPU time.
Jul 01 19:07:26 matrasprox qmeventd[81331]: Starting cleanup for 1010
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 1(fwln1010i0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: vmbr0: port 2(fwpr1010p0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: device fwln1010i0 left promiscuous mode
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 1(fwln1010i0) entered disabled state
Jul 01 19:07:27 matrasprox kernel: device fwpr1010p0 left promiscuous mode
Jul 01 19:07:27 matrasprox kernel: vmbr0: port 2(fwpr1010p0) entered disabled state
Jul 01 19:07:27 matrasprox qmeventd[81331]: Finished cleanup for 1010
You can still revert to older Kernel, which is a valid workaround and wait for the fix without having problems until it is released.Hi,
Unfortunately yesterday on a new proxmox VE 7.2.4 with only one VM a new Windows server 2022 same problem ... crash of the VM found turned off for no reason, the server is a new Dell T340 with Perc 330 controller .... now I install the old kernel and I pray it won't happen again as they are servers in production! Why can't this serious problem be solved?
I attach below package versions and part of the syslog, if you need anything else let me know!
... Maybe I repeat myself but this problem is more than a month that we have it from all our customers who have an updated Proxmox VE and who have Windows 2022 servers ... and there are 4 different installations on different servers ... like can we fix it apart from using the old kernel? thank you.
Code:proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve) pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc) pve-kernel-5.15: 7.2-4 pve-kernel-helper: 7.2-4 pve-kernel-5.15.35-2-pve: 5.15.35-5 pve-kernel-5.15.30-2-pve: 5.15.30-3 pve-kernel-5.13.19-6-pve: 5.13.19-15 ceph-fuse: 15.2.16-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-2 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-2 libpve-storage-perl: 7.2-4 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.12-1 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.3-1 proxmox-backup-file-restore: 2.2.3-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-1 pve-container: 4.2-1 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.4-2 pve-ha-manager: 3.3-4 pve-i18n: 2.7-2 pve-qemu-kvm: 6.2.0-10 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-3 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.4-pve1
Code:Jul 01 19:07:26 matrasprox QEMU[79174]: KVM: entry failed, hardware error 0x80000021 Jul 01 19:07:26 matrasprox kernel: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state. Jul 01 19:07:26 matrasprox QEMU[79174]: If you're running a guest on an Intel machine without unrestricted mode Jul 01 19:07:26 matrasprox QEMU[79174]: support, the failure can be most likely due to the guest entering an invalid Jul 01 19:07:26 matrasprox QEMU[79174]: state for Intel VT. For example, the guest maybe running in big real mode Jul 01 19:07:26 matrasprox QEMU[79174]: which is not supported on less recent Intel processors. Jul 01 19:07:26 matrasprox QEMU[79174]: EAX=000022e2 EBX=63d9e180 ECX=00000001 EDX=00000000 Jul 01 19:07:26 matrasprox QEMU[79174]: ESI=bc81b140 EDI=63daa340 EBP=00000000 ESP=65453d40 Jul 01 19:07:26 matrasprox QEMU[79174]: EIP=00008000 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=1 HLT=0 Jul 01 19:07:26 matrasprox QEMU[79174]: ES =0000 00000000 ffffffff 00809300 Jul 01 19:07:26 matrasprox QEMU[79174]: CS =b600 7ffb6000 ffffffff 00809300 Jul 01 19:07:26 matrasprox QEMU[79174]: SS =0000 00000000 ffffffff 00809300 Jul 01 19:07:26 matrasprox QEMU[79174]: DS =0000 00000000 ffffffff 00809300 Jul 01 19:07:26 matrasprox QEMU[79174]: FS =0000 00000000 ffffffff 00809300 Jul 01 19:07:26 matrasprox QEMU[79174]: GS =0000 00000000 ffffffff 00809300 Jul 01 19:07:26 matrasprox QEMU[79174]: LDT=0000 00000000 000fffff 00000000 Jul 01 19:07:26 matrasprox QEMU[79174]: TR =0040 63dad000 00000067 00008b00 Jul 01 19:07:26 matrasprox QEMU[79174]: GDT= 63daefb0 00000057 Jul 01 19:07:26 matrasprox QEMU[79174]: IDT= 00000000 00000000 Jul 01 19:07:26 matrasprox QEMU[79174]: CR0=00050032 CR2=39f33dcc CR3=001ae000 CR4=00000000 Jul 01 19:07:26 matrasprox QEMU[79174]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 Jul 01 19:07:26 matrasprox QEMU[79174]: DR6=00000000ffff0ff0 DR7=0000000000000400 Jul 01 19:07:26 matrasprox QEMU[79174]: EFER=0000000000000000 Jul 01 19:07:26 matrasprox QEMU[79174]: Code=kvm: ../hw/core/cpu-sysemu.c:77: cpu_asidx_from_attrs: Assertion `ret < cpu->num_ases && ret >= 0' failed. Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 2(tap1010i0) entered disabled state Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 2(tap1010i0) entered disabled state Jul 01 19:07:26 matrasprox systemd[1]: 1010.scope: Succeeded. Jul 01 19:07:26 matrasprox systemd[1]: 1010.scope: Consumed 4min 45.838s CPU time. Jul 01 19:07:26 matrasprox qmeventd[81331]: Starting cleanup for 1010 Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 1(fwln1010i0) entered disabled state Jul 01 19:07:26 matrasprox kernel: vmbr0: port 2(fwpr1010p0) entered disabled state Jul 01 19:07:26 matrasprox kernel: device fwln1010i0 left promiscuous mode Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 1(fwln1010i0) entered disabled state Jul 01 19:07:27 matrasprox kernel: device fwpr1010p0 left promiscuous mode Jul 01 19:07:27 matrasprox kernel: vmbr0: port 2(fwpr1010p0) entered disabled state Jul 01 19:07:27 matrasprox qmeventd[81331]: Finished cleanup for 1010
And here attached the conf of VM and the error in Event Viewer on Windows Server 2022 after boot
Hi,Hi all, we have many PVE servers. Recently I have upgraded all of them to latest PVE 7.2.4, 7.2.5 among with the latest kernel 5.15.
All the servers have Xeon(R) Silver 41XX processors, and we have NO issue with VM destroy mentioned here. Servers are mostly Supermicro or HP.
But we also have PVE cluster with Intel(R) Xeon(R) Gold 5218 CPU and on this cluster we were forced to try downgrade to 5.13.19-6-pve kernel, because the same error mentioned here ... Servers are Supermicro ...
And we have one Intel(R) Xeon(R) Gold 6226R CPU Supermicro server, which is also seems to be rock stable ...
Can this help ? Do anyone have Intel Xeon(R) Silver 41XX CPU with this issue ?
Thanks
OK. To be more specific, we use for example 4210, which is the same "Cascade Lake " like your 4215 ... Ok. Thanks for the info ...Hi,
We experience the same issue on a NODE with this CPU: Intel(R) Xeon(R) Silver 4215 CPU
Hi!Hi all, we have many PVE servers. Recently I have upgraded all of them to latest PVE 7.2.4, 7.2.5 among with the latest kernel 5.15.
All the servers have Xeon(R) Silver 41XX processors, and we have NO issue with VM destroy mentioned here. Servers are mostly Supermicro or HP.
But we also have PVE cluster with Intel(R) Xeon(R) Gold 5218 CPU and on this cluster we were forced to try downgrade to 5.13.19-6-pve kernel, because the same error mentioned here ... Servers are Supermicro ...
And we have one Intel(R) Xeon(R) Gold 6226R CPU Supermicro server, which is also seems to be rock stable ...
Can this help ? Do anyone have Intel Xeon(R) Silver 41XX CPU with this issue ?
Thanks
Nodes of my previous posts where the problem happened frequently:Hi all, we have many PVE servers. Recently I have upgraded all of them to latest PVE 7.2.4, 7.2.5 among with the latest kernel 5.15.
All the servers have Xeon(R) Silver 41XX processors, and we have NO issue with VM destroy mentioned here. Servers are mostly Supermicro or HP.
But we also have PVE cluster with Intel(R) Xeon(R) Gold 5218 CPU and on this cluster we were forced to try downgrade to 5.13.19-6-pve kernel, because the same error mentioned here ... Servers are Supermicro ...
And we have one Intel(R) Xeon(R) Gold 6226R CPU Supermicro server, which is also seems to be rock stable ...
Can this help ? Do anyone have Intel Xeon(R) Silver 41XX CPU with this issue ?
Thanks
Model name: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
kvm.tdp_mmu=N
~# uname -a
Linux pve02 5.15.35-3-pve #1 SMP PVE 5.15.35-6 (Fri, 17 Jun 2022 13:42:35 +0200) x86_64 GNU/Linux
~# cat /sys/module/kvm/parameters/tdp_mmu
N
kvm.tdp_mmu=N
did never stop again.Model name: Intel(R) Xeon(R) Bronze 3106 CPU @ 1.70GHz
small update: Last week I have unpinned kernel 5.13.19-6-pve and switched back to the last kernel version but together with the option kvm tdp_mmu=N and rebooted as suggested.Before I downgraded the kernel to 5.13.19-6, I got this issue regularly during backup on an Exchange 2016 server on Win2016, in the last days also during day (no running backup task).
My home lab is a Z390 chipset (Gigabyte) with Intel I7 8700 (coffeelake), 128 GB RAM and 2x2 TB NVME. There are 24 VM running on it, but only the Exchange VM and sometimes a Windows 11 insider VM were affected by this issue. All other machines have never crashed so far (Windows 2008R2, 2016/19 and 22, Win7, Win11, FreeBSD, Ubuntu, Debian, Nested ESXi 6.7 with MacOS and Android VM).
Now with the older kernel version all is running stable for the last week.
After reading this and confirming that we have on our older Intel CPUs the Microcode-Addon installed and set " /etc/modprobe.d/intel-microcode-blacklist.conf" to NOT blacklist, we detected that we had one node where it was missed to configure it right.Herewith confirmation as well that running with TDP_MMU disabled resolves this problem for us. It appears to primarily affect Windows 2019 and Windows 2022 hosts. Also took us a while to identify as most VMs got hit by this after hours, when they are more idle than during office hours. The restarts were attributed to VMs restarting to install Windows updates whereas searching for 'hardware error' revealed the intermittend and random pattern:
View attachment 38725
for f in /var/log/syslog*; do zgrep 'hardware error' $f; done | sort -k1M -k2n -k3
I see the following notes in the PVE 6 to 7 upgrade notes, not sure how long this has been there for:
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
View attachment 38724
I can only presume that this issue became more prevalent with the latest series of Intel microcode updates which were released in response to additional Spectre vulnerabilities and mitigations from May 2022...