Proxmox freeze on Dell Optiplex 3000 with or without VM

Raindeer

New Member
Sep 28, 2022
9
3
3
Hi Proxmox community,

I experience these random freezes multiple times per day on new Dell Optiplex 3000 (12th Gen Intel(R) Core(TM) i5-12500T). I have tried different kernels and BIOS settings but it doesn't help. I also updated newest 09/02/2022 BIOS (1.5.2) and disabled power states, WLAN, Bluetooth etc. but still Proxmox crashes with or without VM's running on it.
It was stable on june and july, then I was travelling on august and when I came back I did update/upgrade and it started to crash.
When it freeze / crash, it doesn't even ping, I have to do hard reset.

Version
pve-manager/7.2-11/b76d3178 (running kernel: 5.15.35-3-pve)

root@prox:~# dmesg | grep microcode
[ 1.066361] microcode: sig=0x90675, pf=0x1, revision=0x1e
[ 1.066628] microcode: Microcode Update Driver: v2.2.

Kernels: (5.15 and 5.19 freezes same way)
pve-kernel-5.15.30-2-pve/stable,now 5.15.30-3 amd64 [installed]
pve-kernel-5.15.35-3-pve/stable,now 5.15.35-6 amd64 [installed,auto-removable]
pve-kernel-5.15.39-1-pve/stable,now 5.15.39-1 amd64 [installed,auto-removable]
pve-kernel-5.15.53-1-pve/stable,now 5.15.53-1 amd64 [installed,auto-removable]
pve-kernel-5.15.60-1-pve/stable,now 5.15.60-1 amd64 [installed,automatic]
pve-kernel-5.15/stable,now 7.2-11 all [installed]
pve-kernel-5.19.7-1-pve/stable,now 5.19.7-1 amd64 [installed,automatic]
pve-kernel-5.19/stable,now 7.2-11 all [installed]
pve-kernel-helper/stable,now 7.2-12 all [installed]

lspci:
00:00.0 Host bridge: Intel Corporation Device 4650 (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 4690 (rev 0c)
00:04.0 Signal processing controller: Intel Corporation Device 461d (rev 05)
00:08.0 System peripheral: Intel Corporation Device 464f (rev 05)
00:14.0 USB controller: Intel Corporation Device 7ae0 (rev 11)
00:14.2 RAM memory: Intel Corporation Device 7aa7 (rev 11)
00:16.0 Communication controller: Intel Corporation Device 7ae8 (rev 11)
00:17.0 SATA controller: Intel Corporation Device 7ae2 (rev 11)
00:1a.0 PCI bridge: Intel Corporation Device 7ac8 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Device 7aba (rev 11)
00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
00:1f.3 Audio device: Intel Corporation Device 7ad0 (rev 11)
00:1f.4 SMBus: Intel Corporation Device 7aa3 (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 7aa4 (rev 11)
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/980PRO
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 1b)

I also get screen flickering with error on login console: [drm] *ERROR* CPU pipe A FIFO underrun: transcoder

Interesting things from logs (not sure if they are related)
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
hpet_acpi_add: no address or irqs in _CRS
secureboot: Secure boot could not be determined (mode 0)
ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
ep 28 11:31:01 prox kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
Sep 28 11:31:01 prox kernel: device-mapper: uevent: version 1.0.3
Sep 28 11:31:01 prox kernel: device-mapper: ioctl: 4.47.0-ioctl (2022-07-28) initialised: dm-devel@redhat.com
Sep 28 11:31:01 prox kernel: platform eisa.0: Probing EISA bus 0
Sep 28 11:31:01 prox kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Sep 28 11:31:01 prox kernel: acpi PNP0C14:01: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: wmi_bus wmi_bus-PNP0C14:02: WQBC data block query control method not found
Sep 28 11:31:01 prox kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: version 3.0
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: AHCI 0001.0301 32 slots 4 ports 6 Gbps 0x50 impl SATA mode
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: flags: 64bit ncq sntf pm clo only pio slum part ems deso sadm sds
Sep 28 11:31:01 prox kernel: r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
Sep 28 11:31:01 prox kernel: spl: loading out-of-tree module taints kernel.
Sep 28 11:31:01 prox kernel: znvpair: module license 'CDDL' taints kernel.
Sep 28 11:31:01 prox kernel: Disabling lock debugging due to kernel taint
Sep 28 11:31:02 prox kernel: cfg80211: Loading compiled-in X.509 certificates for regulatory database
Sep 28 11:31:02 prox kernel: cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
Sep 28 11:31:02 prox kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Sep 28 11:31:02 prox kernel: cfg80211: failed to load regulatory.db
Sep 28 11:31:02 prox kernel: Creating 1 MTD partitions on "0000:00:1f.5":
Sep 28 11:31:02 prox kernel: 0x000000000000-0x000003000000 : "BIOS"
Sep 28 11:31:02 prox kernel: mtd: partition "BIOS" extends beyond the end of device "0000:00:1f.5" -- size truncated to 0x1000000
Sep 28 11:31:02 prox kernel: bluetooth hci0: Direct firmware load for mediatek/BT_RAM_CODE_MT7961_1_2_hdr.bin failed with error -2
Sep 28 11:31:02 prox kernel: Bluetooth: hci0: Failed to load firmware file (-2)
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
ep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware(s) can be downloaded from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 0.0
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC is uninitialized
Sep 28 11:31:02 prox kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1ba
ep 28 11:31:04 prox kernel: kauditd_printk_skb: 4 callbacks suppressed
Sep 28 11:31:04 prox kernel: audit: type=1400 audit(1664382664.161:15): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/lxc-start" pid=978 comm="apparmor_parser"

Let me know If there is more information I could provide to help solve this problem? BTW Is it possible to try older 5.13 kernel with proxmox 7.2? if, how?

thanks
 
I would suggest to test your hardware. Memtest86 and else to rule out hardware issues.
I forget to mention that I did full Dell BIOS HW diagnostics and Memtest86 like first thing and they both were OK.

I'm running now 5.13 kernel and see it that crash too.
 
Last edited:
Now it's been 1 day and 19 hours, still no crash with 5.13 kernel! Also with VM's everything runs fine.

So there is some problem with 5.15 and 5.19 kernels because it used to crash multiple times per day.

Any ideas how to proceed? just use this old version and wait if future kernels works better?
 
Make sure the system journal is persistent and see if you can find any kind of log message just before the crashes. Then you have something to search for and maybe someone somewhere already found a work-around.

Wild guess: try using the intel_iommu=off (also disabled passthrough) or iommu=pt (uses the same mapping for non-passed through devices as when IOMMU is off) kernel parameters, as 5.15 started enabling IOMMU by default on Intel (but maybe newer versions reverted that change).
 
  • Like
Reactions: ebiss
With 5.13 kernel it was up 2 days 22 hours withour crashing. I added intel_iommu=off to /etc/default/grub to a line:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=off" and did update-grub + reboot, but it still crashed.
I changed it now to iommu=pt

More info

root@prox:~# efibootmgr -v BootCurrent: 0000 Timeout: 2 seconds BootOrder: 0000,0005 Boot0000* proxmox HD(2,GPT,039b21cb-93cd-4b60-83f9-8d8852bc8bee,0x800,0x100000)/File(\EFI\proxmox\grubx64.efi) Boot0005* UEFI WDC WDS100T1R0A-68A4W0 213010802475 HD(2,GPT,039b21cb-93cd-4b60-83f9-8d8852bc8bee,0x800,0x100000)/File(\EFI\Boot\BootX64.efi)N.....YM....R,Y.

root@prox:~# dmesg | grep DMAR [ 0.006037] ACPI: DMAR 0x0000000061761000 000088 (v02 INTEL Dell Inc 00000002 01000013) [ 0.006067] ACPI: Reserving DMAR table memory at [mem 0x61761000-0x61761087] [ 0.067569] DMAR: IOMMU disabled [ 0.171377] DMAR: Host address width 39 [ 0.171377] DMAR: DRHD base: 0x000000fed90000 flags: 0x0 [ 0.171381] DMAR: dmar0: reg_base_addr fed90000 ver 4:0 cap 1c0000c40660462 ecap 29a00f0505e [ 0.171382] DMAR: DRHD base: 0x000000fed91000 flags: 0x1 [ 0.171383] DMAR: dmar1: reg_base_addr fed91000 ver 5:0 cap d2008c40660462 ecap f050da [ 0.171386] DMAR: RMRR base: 0x0000006c000000 end: 0x000000703fffff [ 0.171388] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1 [ 0.171389] DMAR-IR: HPET id 0 under DRHD base 0xfed91000 [ 0.171389] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping. [ 0.172900] DMAR-IR: Enabled IRQ remapping in x2apic mode [ 0.352244] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics

I hope I did it correctly?
 
With 5.13 kernel it was up 2 days 22 hours withour crashing. I added intel_iommu=off to /etc/default/grub to a line:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=off" and did update-grub + reboot, but it still crashed.
I changed it now to iommu=pt

More info

root@prox:~# efibootmgr -v BootCurrent: 0000 Timeout: 2 seconds BootOrder: 0000,0005 Boot0000* proxmox HD(2,GPT,039b21cb-93cd-4b60-83f9-8d8852bc8bee,0x800,0x100000)/File(\EFI\proxmox\grubx64.efi) Boot0005* UEFI WDC WDS100T1R0A-68A4W0 213010802475 HD(2,GPT,039b21cb-93cd-4b60-83f9-8d8852bc8bee,0x800,0x100000)/File(\EFI\Boot\BootX64.efi)N.....YM....R,Y.

root@prox:~# dmesg | grep DMAR [ 0.006037] ACPI: DMAR 0x0000000061761000 000088 (v02 INTEL Dell Inc 00000002 01000013) [ 0.006067] ACPI: Reserving DMAR table memory at [mem 0x61761000-0x61761087] [ 0.067569] DMAR: IOMMU disabled [ 0.171377] DMAR: Host address width 39 [ 0.171377] DMAR: DRHD base: 0x000000fed90000 flags: 0x0 [ 0.171381] DMAR: dmar0: reg_base_addr fed90000 ver 4:0 cap 1c0000c40660462 ecap 29a00f0505e [ 0.171382] DMAR: DRHD base: 0x000000fed91000 flags: 0x1 [ 0.171383] DMAR: dmar1: reg_base_addr fed91000 ver 5:0 cap d2008c40660462 ecap f050da [ 0.171386] DMAR: RMRR base: 0x0000006c000000 end: 0x000000703fffff [ 0.171388] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1 [ 0.171389] DMAR-IR: HPET id 0 under DRHD base 0xfed91000 [ 0.171389] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping. [ 0.172900] DMAR-IR: Enabled IRQ remapping in x2apic mode [ 0.352244] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics

I hope I did it correctly?
I can't tell from this, but IOMMU does look disabled. Check if cat /proc/cmline matches the kernel parameters you wanted to set.
As I said, it was just a guess and it appears that an enabled IOMMU was not the problem (as disabling did not resolve it).

EDIT: I have no clue where to look next.
 
Last edited:
Still crashes, I've tried intel_iommu=off , iommu=pt and intel_iommu=off parameters on 5.15.60-1-pve and 5.19.7-1-pve kernels.

Example of params:
root@prox:~# cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.19.7-1-pve root=/dev/mapper/pve-root ro quiet iommu=pt

What should I look next?
 
Last edited:
I guess I just use 5.13 kernel until I can try 6.x kernel. Can't figure out what cause the crashes. Dell hardware is new and OK and work properly. Everything was fine before I updated proxmox to 7.2.
I also reset all BIOS settings and tested disabling features but nothins seems to work except running kernel 5.13.

Here are errors rows from proxmox journalctl logs if someone can tell what is wrong:

Oct 04 14:17:55 prox kernel: secureboot: Secure boot could not be determined (mode 0) Oct 04 14:17:55 prox kernel: secureboot: Secure boot could not be determined (mode 0) Oct 04 14:17:55 prox kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance' Oct 04 14:17:55 prox kernel: pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref] Oct 04 14:17:55 prox kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log. Oct 04 14:17:55 prox kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard Oct 04 14:17:55 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 1 Oct 04 14:17:55 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 2 Oct 04 14:17:55 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 3 Oct 04 14:17:55 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 4 Oct 04 14:17:55 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 5 Oct 04 14:17:55 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 6 Oct 04 14:17:55 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 7 Oct 04 14:17:55 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 8 Oct 04 14:17:55 prox kernel: acpi PNP0C14:01: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00) Oct 04 14:17:55 prox kernel: wmi_bus wmi_bus-PNP0C14:02: WQBC data block query control method not found Oct 04 14:17:55 prox kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00) Oct 04 14:17:55 prox kernel: acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00) Oct 04 14:17:55 prox kernel: r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control Oct 04 14:17:55 prox kernel: acpi PNP0C14:04: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00) Oct 04 14:17:55 prox kernel: acpi PNP0C14:05: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00) Oct 04 14:17:55 prox kernel: acpi PNP0C14:06: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00) Oct 04 14:17:55 prox kernel: acpi PNP0C14:07: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00) Oct 04 14:17:55 prox kernel: acpi PNP0C14:08: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00) Oct 04 14:17:55 prox kernel: usb: port power management may be unreliable Oct 04 14:17:56 prox kernel: spl: loading out-of-tree module taints kernel. Oct 04 14:17:56 prox kernel: znvpair: module license 'CDDL' taints kernel. Oct 04 14:17:56 prox kernel: Disabling lock debugging due to kernel taint Oct 04 14:17:56 prox kernel: mtd: partition "BIOS" extends beyond the end of device "0000:00:1f.5" -- size truncated to 0x1000000 Oct 04 14:17:56 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2 Oct 04 14:17:58 prox kernel: kauditd_printk_skb: 4 callbacks suppressed

All are yellow lines, except this is red:
Oct 04 14:17:56 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
 
I think I finally solve the problem or at least found a workaround.

If I go BIOS settings and disable C-Power states, it does not crash anymore. I test this on 5.19 kernel.

On BIOS:
Performance / C-States:
Switch called Enable C-State control (Feature enables the CPU's ability to enter and exit low power states)
Set this to Off (default is on)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!