VM shutdown, KVM: entry failed, hardware error 0x80000021

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,235
1,139
164
Thank you! Have I to do this on all thee nodes, or may I just test the node where the affected VM runs?
If the clusternodes have similar/identical hardware I would recommend disabling tdp_mmu on all of them (by setting it in the kernelcommandline or in /etc/modprobe.d/ and rebooting afterwards) .
 

macpip

Member
Oct 9, 2019
8
6
8
30
Did you tried the workaround that addresses our reproducer for this? Namely:
As you suggested, I've tried the workaround:

vi /etc/kernel/cmdline root=ZFS=rpool/ROOT/pve-1 boot=zfs kvm.tdp_mmu=N proxmox-boot-tool refresh reboot cat /sys/module/kvm/parameters/tdp_mmu N

The installation completed at the first attempt with no issues!!!
 
  • Like
Reactions: tom and t.lamprecht
Apr 3, 2022
107
37
28
Could someone who is affected by the `KVM: entry failed, hardware error 0x80000021` issue please try setting the:
tdp_mmu module parameter for kvm to 'N'
Giving it a go... Unpinned kernel 5.13.19-6-pve and set kvm.tdp_mmu=N.
I don't have a reliable way to recreate the issue, but my Windows 11 VM would usually dump on me within a few days time. I'll report back!
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,345
1,648
164
South Tyrol/Italy
shop.proxmox.com
Could it be that this patch here is related?
Seems to be merged in v5.18.
We tried internal builds of 5.18 a while ago to test for a possible fix in newer kernel versions, and booting with 5.18 wouldn't fix (or improve) triggering our reproducer, so I'd figure no.

I also stumbled upon this patch when searching LKML for related things, but from the commit message it also reads like the issue fixed there would emit the message
cpuid_data is full, no space for cpuid(eax:0x8000001d,ecx:0x3e)
which I cannot remeber to have seen, neither in our tests nor in posted logs here.
 
  • Like
Reactions: engelant and itNGO
Apr 27, 2016
12
10
23
53
www.altrove.info
Hi,
Unfortunately yesterday on a new proxmox VE 7.2.4 with only one VM a new Windows server 2022 same problem ... crash of the VM found turned off for no reason, the server is a new Dell T340 with Perc 330 controller .... now I install the old kernel and I pray it won't happen again as they are servers in production! Why can't this serious problem be solved?
I attach below package versions and part of the syslog, if you need anything else let me know!
... Maybe I repeat myself but this problem is more than a month that we have it from all our customers who have an updated Proxmox VE and who have Windows 2022 servers ... and there are 4 different installations on different servers ... like can we fix it apart from using the old kernel? thank you.


Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1



Code:
Jul 01 19:07:26 matrasprox QEMU[79174]: KVM: entry failed, hardware error 0x80000021
Jul 01 19:07:26 matrasprox kernel: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
Jul 01 19:07:26 matrasprox QEMU[79174]: If you're running a guest on an Intel machine without unrestricted mode
Jul 01 19:07:26 matrasprox QEMU[79174]: support, the failure can be most likely due to the guest entering an invalid
Jul 01 19:07:26 matrasprox QEMU[79174]: state for Intel VT. For example, the guest maybe running in big real mode
Jul 01 19:07:26 matrasprox QEMU[79174]: which is not supported on less recent Intel processors.
Jul 01 19:07:26 matrasprox QEMU[79174]: EAX=000022e2 EBX=63d9e180 ECX=00000001 EDX=00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: ESI=bc81b140 EDI=63daa340 EBP=00000000 ESP=65453d40
Jul 01 19:07:26 matrasprox QEMU[79174]: EIP=00008000 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=1 HLT=0
Jul 01 19:07:26 matrasprox QEMU[79174]: ES =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: CS =b600 7ffb6000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: SS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: DS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: FS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: GS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: LDT=0000 00000000 000fffff 00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: TR =0040 63dad000 00000067 00008b00
Jul 01 19:07:26 matrasprox QEMU[79174]: GDT=     63daefb0 00000057
Jul 01 19:07:26 matrasprox QEMU[79174]: IDT=     00000000 00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: CR0=00050032 CR2=39f33dcc CR3=001ae000 CR4=00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Jul 01 19:07:26 matrasprox QEMU[79174]: DR6=00000000ffff0ff0 DR7=0000000000000400
Jul 01 19:07:26 matrasprox QEMU[79174]: EFER=0000000000000000
Jul 01 19:07:26 matrasprox QEMU[79174]: Code=kvm: ../hw/core/cpu-sysemu.c:77: cpu_asidx_from_attrs: Assertion `ret < cpu->num_ases && ret >= 0' failed.
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 2(tap1010i0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 2(tap1010i0) entered disabled state
Jul 01 19:07:26 matrasprox systemd[1]: 1010.scope: Succeeded.
Jul 01 19:07:26 matrasprox systemd[1]: 1010.scope: Consumed 4min 45.838s CPU time.
Jul 01 19:07:26 matrasprox qmeventd[81331]: Starting cleanup for 1010
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 1(fwln1010i0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: vmbr0: port 2(fwpr1010p0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: device fwln1010i0 left promiscuous mode
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 1(fwln1010i0) entered disabled state
Jul 01 19:07:27 matrasprox kernel: device fwpr1010p0 left promiscuous mode
Jul 01 19:07:27 matrasprox kernel: vmbr0: port 2(fwpr1010p0) entered disabled state
Jul 01 19:07:27 matrasprox qmeventd[81331]: Finished cleanup for 1010


And here attached the conf of VM and the error in Event Viewer on Windows Server 2022 after boot
 

Attachments

  • Schermata 2022-07-02 alle 17.42.22.jpg
    Schermata 2022-07-02 alle 17.42.22.jpg
    240.2 KB · Views: 11
  • Schermata 2022-07-02 alle 17.42.38.jpg
    Schermata 2022-07-02 alle 17.42.38.jpg
    234 KB · Views: 11
  • Schermata 2022-07-02 alle 17.41.37.jpg
    Schermata 2022-07-02 alle 17.41.37.jpg
    347.4 KB · Views: 11
Last edited:
  • Like
Reactions: rursache

itNGO

Well-Known Member
Jun 12, 2020
557
120
48
44
Germany
it-ngo.com
Hi,
Unfortunately yesterday on a new proxmox VE 7.2.4 with only one VM a new Windows server 2022 same problem ... crash of the VM found turned off for no reason, the server is a new Dell T340 with Perc 330 controller .... now I install the old kernel and I pray it won't happen again as they are servers in production! Why can't this serious problem be solved?
I attach below package versions and part of the syslog, if you need anything else let me know!
... Maybe I repeat myself but this problem is more than a month that we have it from all our customers who have an updated Proxmox VE and who have Windows 2022 servers ... and there are 4 different installations on different servers ... like can we fix it apart from using the old kernel? thank you.


Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1



Code:
Jul 01 19:07:26 matrasprox QEMU[79174]: KVM: entry failed, hardware error 0x80000021
Jul 01 19:07:26 matrasprox kernel: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
Jul 01 19:07:26 matrasprox QEMU[79174]: If you're running a guest on an Intel machine without unrestricted mode
Jul 01 19:07:26 matrasprox QEMU[79174]: support, the failure can be most likely due to the guest entering an invalid
Jul 01 19:07:26 matrasprox QEMU[79174]: state for Intel VT. For example, the guest maybe running in big real mode
Jul 01 19:07:26 matrasprox QEMU[79174]: which is not supported on less recent Intel processors.
Jul 01 19:07:26 matrasprox QEMU[79174]: EAX=000022e2 EBX=63d9e180 ECX=00000001 EDX=00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: ESI=bc81b140 EDI=63daa340 EBP=00000000 ESP=65453d40
Jul 01 19:07:26 matrasprox QEMU[79174]: EIP=00008000 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=1 HLT=0
Jul 01 19:07:26 matrasprox QEMU[79174]: ES =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: CS =b600 7ffb6000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: SS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: DS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: FS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: GS =0000 00000000 ffffffff 00809300
Jul 01 19:07:26 matrasprox QEMU[79174]: LDT=0000 00000000 000fffff 00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: TR =0040 63dad000 00000067 00008b00
Jul 01 19:07:26 matrasprox QEMU[79174]: GDT=     63daefb0 00000057
Jul 01 19:07:26 matrasprox QEMU[79174]: IDT=     00000000 00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: CR0=00050032 CR2=39f33dcc CR3=001ae000 CR4=00000000
Jul 01 19:07:26 matrasprox QEMU[79174]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Jul 01 19:07:26 matrasprox QEMU[79174]: DR6=00000000ffff0ff0 DR7=0000000000000400
Jul 01 19:07:26 matrasprox QEMU[79174]: EFER=0000000000000000
Jul 01 19:07:26 matrasprox QEMU[79174]: Code=kvm: ../hw/core/cpu-sysemu.c:77: cpu_asidx_from_attrs: Assertion `ret < cpu->num_ases && ret >= 0' failed.
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 2(tap1010i0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 2(tap1010i0) entered disabled state
Jul 01 19:07:26 matrasprox systemd[1]: 1010.scope: Succeeded.
Jul 01 19:07:26 matrasprox systemd[1]: 1010.scope: Consumed 4min 45.838s CPU time.
Jul 01 19:07:26 matrasprox qmeventd[81331]: Starting cleanup for 1010
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 1(fwln1010i0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: vmbr0: port 2(fwpr1010p0) entered disabled state
Jul 01 19:07:26 matrasprox kernel: device fwln1010i0 left promiscuous mode
Jul 01 19:07:26 matrasprox kernel: fwbr1010i0: port 1(fwln1010i0) entered disabled state
Jul 01 19:07:27 matrasprox kernel: device fwpr1010p0 left promiscuous mode
Jul 01 19:07:27 matrasprox kernel: vmbr0: port 2(fwpr1010p0) entered disabled state
Jul 01 19:07:27 matrasprox qmeventd[81331]: Finished cleanup for 1010


And here attached the conf of VM and the error in Event Viewer on Windows Server 2022 after boot
You can still revert to older Kernel, which is a valid workaround and wait for the fix without having problems until it is released.

5.13.19-6 helps and it does not hurt to use it for several more months...

No need to cry, this can take months to fix. Even largest companies have problems getting some problems fixed after 6 months.... Remember Print Nightmare?
 

Ad3t0

New Member
Jun 15, 2022
3
4
3
Hello,

If I still get this error randomly even after 3 days uptime for Windows Server 2022 pinning the 5.13.19-6 kernel running on a Dell R620 what does that mean for me?

Thanks for anyone's help!
 
  • Like
Reactions: rursache

Petr Svacina

Member
Oct 1, 2018
31
9
13
45
Pinning kernel:

Code:
proxmox-boot-tool kernel pin 5.13.19-6-pve

Did not force the kernel 5.13.19-6-pve to be booted first .. So check, If you REALY run this kernel after reboot ...
 
  • Like
Reactions: basteagow

Ad3t0

New Member
Jun 15, 2022
3
4
3
Kernel Version

Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15 (Tue, 29 Mar 2022 15:59:50 +0200)
PVE Manager Version

pve-manager/7.2-5/12f1e639

This is what I see under the summary for my booted node
 

Petr Svacina

Member
Oct 1, 2018
31
9
13
45
Hi all, we have many PVE servers. Recently I have upgraded all of them to latest PVE 7.2.4, 7.2.5 among with the latest kernel 5.15.
All the servers have Xeon(R) Silver 41XX processors, and we have NO issue with VM destroy mentioned here. Servers are mostly Supermicro or HP.

But we also have PVE cluster with Intel(R) Xeon(R) Gold 5218 CPU and on this cluster we were forced to try downgrade to 5.13.19-6-pve kernel, because the same error mentioned here ... Servers are Supermicro ...

And we have one Intel(R) Xeon(R) Gold 6226R CPU Supermicro server, which is also seems to be rock stable ...

Can this help ? Do anyone have Intel Xeon(R) Silver 41XX CPU with this issue ?

Thanks
 
Last edited:
  • Like
Reactions: rursache

alfred.johansen

New Member
Jun 3, 2022
2
0
1
Hi all, we have many PVE servers. Recently I have upgraded all of them to latest PVE 7.2.4, 7.2.5 among with the latest kernel 5.15.
All the servers have Xeon(R) Silver 41XX processors, and we have NO issue with VM destroy mentioned here. Servers are mostly Supermicro or HP.

But we also have PVE cluster with Intel(R) Xeon(R) Gold 5218 CPU and on this cluster we were forced to try downgrade to 5.13.19-6-pve kernel, because the same error mentioned here ... Servers are Supermicro ...

And we have one Intel(R) Xeon(R) Gold 6226R CPU Supermicro server, which is also seems to be rock stable ...

Can this help ? Do anyone have Intel Xeon(R) Silver 41XX CPU with this issue ?

Thanks
Hi,

We experience the same issue on a NODE with this CPU: Intel(R) Xeon(R) Silver 4215 CPU
 
Apr 27, 2016
12
10
23
53
www.altrove.info
Hi all, we have many PVE servers. Recently I have upgraded all of them to latest PVE 7.2.4, 7.2.5 among with the latest kernel 5.15.
All the servers have Xeon(R) Silver 41XX processors, and we have NO issue with VM destroy mentioned here. Servers are mostly Supermicro or HP.

But we also have PVE cluster with Intel(R) Xeon(R) Gold 5218 CPU and on this cluster we were forced to try downgrade to 5.13.19-6-pve kernel, because the same error mentioned here ... Servers are Supermicro ...

And we have one Intel(R) Xeon(R) Gold 6226R CPU Supermicro server, which is also seems to be rock stable ...

Can this help ? Do anyone have Intel Xeon(R) Silver 41XX CPU with this issue ?

Thanks
Hi!
I checked on the last 3 Proxmox servers in which I have access and they are two of the Intel Silver 4208 and instead the third is an E-2236, all three servers have the problem only on Windows 2022 servers, no problems with other VMs, I attach screen shots of the 3 CPUs

have a nice day!
 

Attachments

  • Server_1.jpg
    Server_1.jpg
    51.4 KB · Views: 12
  • Server_3.jpg
    Server_3.jpg
    57.1 KB · Views: 12
  • Server_2.jpg
    Server_2.jpg
    52.1 KB · Views: 12

macpip

Member
Oct 9, 2019
8
6
8
30
Hi all, we have many PVE servers. Recently I have upgraded all of them to latest PVE 7.2.4, 7.2.5 among with the latest kernel 5.15.
All the servers have Xeon(R) Silver 41XX processors, and we have NO issue with VM destroy mentioned here. Servers are mostly Supermicro or HP.

But we also have PVE cluster with Intel(R) Xeon(R) Gold 5218 CPU and on this cluster we were forced to try downgrade to 5.13.19-6-pve kernel, because the same error mentioned here ... Servers are Supermicro ...

And we have one Intel(R) Xeon(R) Gold 6226R CPU Supermicro server, which is also seems to be rock stable ...

Can this help ? Do anyone have Intel Xeon(R) Silver 41XX CPU with this issue ?

Thanks
Nodes of my previous posts where the problem happened frequently:
Model name: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz


Work around applied on 2022.09.29 20:30.
kvm.tdp_mmu=N

Code:
~# uname -a
Linux pve02 5.15.35-3-pve #1 SMP PVE 5.15.35-6 (Fri, 17 Jun 2022 13:42:35 +0200) x86_64 GNU/Linux
~# cat /sys/module/kvm/parameters/tdp_mmu
N

As you can see from the attached image, VM 921 stopped frequently (see all the start without a stop). After kvm.tdp_mmu=N did never stop again.
Furthermore a new Windows 2022 Server Std. that I was not able to install before the work around, installed ad is working fine.

On another node in another cluster with CPU
Model name: Intel(R) Xeon(R) Bronze 3106 CPU @ 1.70GHz
and work around not aplied, I've been experiencing the issue just about 3 times in the last 20 days on a freebsd VM.

I hope the info would help. Please tell me if I can do something to help resolving the issue.
 

Attachments

  • Schermata_2022.07.04_18.15.28.png
    Schermata_2022.07.04_18.15.28.png
    64.2 KB · Views: 24

piggie-mickie

New Member
Jan 22, 2022
3
0
1
54
Before I downgraded the kernel to 5.13.19-6, I got this issue regularly during backup on an Exchange 2016 server on Win2016, in the last days also during day (no running backup task).

My home lab is a Z390 chipset (Gigabyte) with Intel I7 8700 (coffeelake), 128 GB RAM and 2x2 TB NVME. There are 24 VM running on it, but only the Exchange VM and sometimes a Windows 11 insider VM were affected by this issue. All other machines have never crashed so far (Windows 2008R2, 2016/19 and 22, Win7, Win11, FreeBSD, Ubuntu, Debian, Nested ESXi 6.7 with MacOS and Android VM).

Now with the older kernel version all is running stable for the last week.
small update: Last week I have unpinned kernel 5.13.19-6-pve and switched back to the last kernel version but together with the option kvm tdp_mmu=N and rebooted as suggested.
So far no issues, no VM crashed.
A small side effect: after enabling the option kvm tdp_mmu=N I had to switch the CPU type of my VMs from Skylake-Client to older IvyBridge, since newer aren't supported with this setting.
 
Jun 8, 2016
341
65
48
46
Johannesburg, South Africa
Herewith confirmation as well that running with TDP_MMU disabled resolves this problem for us. It appears to primarily affect Windows 2019 and Windows 2022 hosts. Also took us a while to identify as most VMs got hit by this after hours, when they are more idle than during office hours. The restarts were attributed to VMs restarting to install Windows updates whereas searching for 'hardware error' revealed the intermittend and random pattern:
1657108892302.png
for f in /var/log/syslog*; do zgrep 'hardware error' $f; done | sort -k1M -k2n -k3


I see the following notes in the PVE 6 to 7 upgrade notes, not sure how long this has been there for:
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

1657108585332.png


I can only presume that this issue became more prevalent with the latest series of Intel microcode updates which were released in response to additional Spectre vulnerabilities and mitigations from May 2022...
 
  • Like
Reactions: itNGO

itNGO

Well-Known Member
Jun 12, 2020
557
120
48
44
Germany
it-ngo.com
Herewith confirmation as well that running with TDP_MMU disabled resolves this problem for us. It appears to primarily affect Windows 2019 and Windows 2022 hosts. Also took us a while to identify as most VMs got hit by this after hours, when they are more idle than during office hours. The restarts were attributed to VMs restarting to install Windows updates whereas searching for 'hardware error' revealed the intermittend and random pattern:
View attachment 38725
for f in /var/log/syslog*; do zgrep 'hardware error' $f; done | sort -k1M -k2n -k3


I see the following notes in the PVE 6 to 7 upgrade notes, not sure how long this has been there for:
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

View attachment 38724


I can only presume that this issue became more prevalent with the latest series of Intel microcode updates which were released in response to additional Spectre vulnerabilities and mitigations from May 2022...
After reading this and confirming that we have on our older Intel CPUs the Microcode-Addon installed and set " /etc/modprobe.d/intel-microcode-blacklist.conf" to NOT blacklist, we detected that we had one node where it was missed to configure it right.

We will unpin Kernel, correct the settings and give it another go tonight with latest Enterprise-Repository-Kernel....
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!