Hi Proxmox community,
I experience these random freezes multiple times per day on new Dell Optiplex 3000 (12th Gen Intel(R) Core(TM) i5-12500T). I have tried different kernels and BIOS settings but it doesn't help. I also updated newest 09/02/2022 BIOS (1.5.2) and disabled power states, WLAN, Bluetooth etc. but still Proxmox crashes with or without VM's running on it.
It was stable on june and july, then I was travelling on august and when I came back I did update/upgrade and it started to crash.
When it freeze / crash, it doesn't even ping, I have to do hard reset.
Version
pve-manager/7.2-11/b76d3178 (running kernel: 5.15.35-3-pve)
root@prox:~# dmesg | grep microcode
[ 1.066361] microcode: sig=0x90675, pf=0x1, revision=0x1e
[ 1.066628] microcode: Microcode Update Driver: v2.2.
Kernels: (5.15 and 5.19 freezes same way)
pve-kernel-5.15.30-2-pve/stable,now 5.15.30-3 amd64 [installed]
pve-kernel-5.15.35-3-pve/stable,now 5.15.35-6 amd64 [installed,auto-removable]
pve-kernel-5.15.39-1-pve/stable,now 5.15.39-1 amd64 [installed,auto-removable]
pve-kernel-5.15.53-1-pve/stable,now 5.15.53-1 amd64 [installed,auto-removable]
pve-kernel-5.15.60-1-pve/stable,now 5.15.60-1 amd64 [installed,automatic]
pve-kernel-5.15/stable,now 7.2-11 all [installed]
pve-kernel-5.19.7-1-pve/stable,now 5.19.7-1 amd64 [installed,automatic]
pve-kernel-5.19/stable,now 7.2-11 all [installed]
pve-kernel-helper/stable,now 7.2-12 all [installed]
lspci:
00:00.0 Host bridge: Intel Corporation Device 4650 (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 4690 (rev 0c)
00:04.0 Signal processing controller: Intel Corporation Device 461d (rev 05)
00:08.0 System peripheral: Intel Corporation Device 464f (rev 05)
00:14.0 USB controller: Intel Corporation Device 7ae0 (rev 11)
00:14.2 RAM memory: Intel Corporation Device 7aa7 (rev 11)
00:16.0 Communication controller: Intel Corporation Device 7ae8 (rev 11)
00:17.0 SATA controller: Intel Corporation Device 7ae2 (rev 11)
00:1a.0 PCI bridge: Intel Corporation Device 7ac8 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Device 7aba (rev 11)
00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
00:1f.3 Audio device: Intel Corporation Device 7ad0 (rev 11)
00:1f.4 SMBus: Intel Corporation Device 7aa3 (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 7aa4 (rev 11)
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/980PRO
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 1b)
I also get screen flickering with error on login console: [drm] *ERROR* CPU pipe A FIFO underrun: transcoder
Interesting things from logs (not sure if they are related)
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
hpet_acpi_add: no address or irqs in _CRS
secureboot: Secure boot could not be determined (mode 0)
ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
ep 28 11:31:01 prox kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
Sep 28 11:31:01 prox kernel: device-mapper: uevent: version 1.0.3
Sep 28 11:31:01 prox kernel: device-mapper: ioctl: 4.47.0-ioctl (2022-07-28) initialised: dm-devel@redhat.com
Sep 28 11:31:01 prox kernel: platform eisa.0: Probing EISA bus 0
Sep 28 11:31:01 prox kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Sep 28 11:31:01 prox kernel: acpi PNP0C14:01: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: wmi_bus wmi_bus-PNP0C14:02: WQBC data block query control method not found
Sep 28 11:31:01 prox kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: version 3.0
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: AHCI 0001.0301 32 slots 4 ports 6 Gbps 0x50 impl SATA mode
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: flags: 64bit ncq sntf pm clo only pio slum part ems deso sadm sds
Sep 28 11:31:01 prox kernel: r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
Sep 28 11:31:01 prox kernel: spl: loading out-of-tree module taints kernel.
Sep 28 11:31:01 prox kernel: znvpair: module license 'CDDL' taints kernel.
Sep 28 11:31:01 prox kernel: Disabling lock debugging due to kernel taint
Sep 28 11:31:02 prox kernel: cfg80211: Loading compiled-in X.509 certificates for regulatory database
Sep 28 11:31:02 prox kernel: cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
Sep 28 11:31:02 prox kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Sep 28 11:31:02 prox kernel: cfg80211: failed to load regulatory.db
Sep 28 11:31:02 prox kernel: Creating 1 MTD partitions on "0000:00:1f.5":
Sep 28 11:31:02 prox kernel: 0x000000000000-0x000003000000 : "BIOS"
Sep 28 11:31:02 prox kernel: mtd: partition "BIOS" extends beyond the end of device "0000:00:1f.5" -- size truncated to 0x1000000
Sep 28 11:31:02 prox kernel: bluetooth hci0: Direct firmware load for mediatek/BT_RAM_CODE_MT7961_1_2_hdr.bin failed with error -2
Sep 28 11:31:02 prox kernel: Bluetooth: hci0: Failed to load firmware file (-2)
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
ep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware(s) can be downloaded from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 0.0
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC is uninitialized
Sep 28 11:31:02 prox kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1ba
ep 28 11:31:04 prox kernel: kauditd_printk_skb: 4 callbacks suppressed
Sep 28 11:31:04 prox kernel: audit: type=1400 audit(1664382664.161:15): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/lxc-start" pid=978 comm="apparmor_parser"
Let me know If there is more information I could provide to help solve this problem? BTW Is it possible to try older 5.13 kernel with proxmox 7.2? if, how?
thanks
I experience these random freezes multiple times per day on new Dell Optiplex 3000 (12th Gen Intel(R) Core(TM) i5-12500T). I have tried different kernels and BIOS settings but it doesn't help. I also updated newest 09/02/2022 BIOS (1.5.2) and disabled power states, WLAN, Bluetooth etc. but still Proxmox crashes with or without VM's running on it.
It was stable on june and july, then I was travelling on august and when I came back I did update/upgrade and it started to crash.
When it freeze / crash, it doesn't even ping, I have to do hard reset.
Version
pve-manager/7.2-11/b76d3178 (running kernel: 5.15.35-3-pve)
root@prox:~# dmesg | grep microcode
[ 1.066361] microcode: sig=0x90675, pf=0x1, revision=0x1e
[ 1.066628] microcode: Microcode Update Driver: v2.2.
Kernels: (5.15 and 5.19 freezes same way)
pve-kernel-5.15.30-2-pve/stable,now 5.15.30-3 amd64 [installed]
pve-kernel-5.15.35-3-pve/stable,now 5.15.35-6 amd64 [installed,auto-removable]
pve-kernel-5.15.39-1-pve/stable,now 5.15.39-1 amd64 [installed,auto-removable]
pve-kernel-5.15.53-1-pve/stable,now 5.15.53-1 amd64 [installed,auto-removable]
pve-kernel-5.15.60-1-pve/stable,now 5.15.60-1 amd64 [installed,automatic]
pve-kernel-5.15/stable,now 7.2-11 all [installed]
pve-kernel-5.19.7-1-pve/stable,now 5.19.7-1 amd64 [installed,automatic]
pve-kernel-5.19/stable,now 7.2-11 all [installed]
pve-kernel-helper/stable,now 7.2-12 all [installed]
lspci:
00:00.0 Host bridge: Intel Corporation Device 4650 (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 4690 (rev 0c)
00:04.0 Signal processing controller: Intel Corporation Device 461d (rev 05)
00:08.0 System peripheral: Intel Corporation Device 464f (rev 05)
00:14.0 USB controller: Intel Corporation Device 7ae0 (rev 11)
00:14.2 RAM memory: Intel Corporation Device 7aa7 (rev 11)
00:16.0 Communication controller: Intel Corporation Device 7ae8 (rev 11)
00:17.0 SATA controller: Intel Corporation Device 7ae2 (rev 11)
00:1a.0 PCI bridge: Intel Corporation Device 7ac8 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Device 7aba (rev 11)
00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
00:1f.3 Audio device: Intel Corporation Device 7ad0 (rev 11)
00:1f.4 SMBus: Intel Corporation Device 7aa3 (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 7aa4 (rev 11)
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/980PRO
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 1b)
I also get screen flickering with error on login console: [drm] *ERROR* CPU pipe A FIFO underrun: transcoder
Interesting things from logs (not sure if they are related)
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
hpet_acpi_add: no address or irqs in _CRS
secureboot: Secure boot could not be determined (mode 0)
ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
ep 28 11:31:01 prox kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
Sep 28 11:31:01 prox kernel: device-mapper: uevent: version 1.0.3
Sep 28 11:31:01 prox kernel: device-mapper: ioctl: 4.47.0-ioctl (2022-07-28) initialised: dm-devel@redhat.com
Sep 28 11:31:01 prox kernel: platform eisa.0: Probing EISA bus 0
Sep 28 11:31:01 prox kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Sep 28 11:31:01 prox kernel: acpi PNP0C14:01: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: wmi_bus wmi_bus-PNP0C14:02: WQBC data block query control method not found
Sep 28 11:31:01 prox kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: version 3.0
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: AHCI 0001.0301 32 slots 4 ports 6 Gbps 0x50 impl SATA mode
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: flags: 64bit ncq sntf pm clo only pio slum part ems deso sadm sds
Sep 28 11:31:01 prox kernel: r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
Sep 28 11:31:01 prox kernel: spl: loading out-of-tree module taints kernel.
Sep 28 11:31:01 prox kernel: znvpair: module license 'CDDL' taints kernel.
Sep 28 11:31:01 prox kernel: Disabling lock debugging due to kernel taint
Sep 28 11:31:02 prox kernel: cfg80211: Loading compiled-in X.509 certificates for regulatory database
Sep 28 11:31:02 prox kernel: cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
Sep 28 11:31:02 prox kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Sep 28 11:31:02 prox kernel: cfg80211: failed to load regulatory.db
Sep 28 11:31:02 prox kernel: Creating 1 MTD partitions on "0000:00:1f.5":
Sep 28 11:31:02 prox kernel: 0x000000000000-0x000003000000 : "BIOS"
Sep 28 11:31:02 prox kernel: mtd: partition "BIOS" extends beyond the end of device "0000:00:1f.5" -- size truncated to 0x1000000
Sep 28 11:31:02 prox kernel: bluetooth hci0: Direct firmware load for mediatek/BT_RAM_CODE_MT7961_1_2_hdr.bin failed with error -2
Sep 28 11:31:02 prox kernel: Bluetooth: hci0: Failed to load firmware file (-2)
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
ep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware(s) can be downloaded from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 0.0
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC is uninitialized
Sep 28 11:31:02 prox kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1ba
ep 28 11:31:04 prox kernel: kauditd_printk_skb: 4 callbacks suppressed
Sep 28 11:31:04 prox kernel: audit: type=1400 audit(1664382664.161:15): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/lxc-start" pid=978 comm="apparmor_parser"
Let me know If there is more information I could provide to help solve this problem? BTW Is it possible to try older 5.13 kernel with proxmox 7.2? if, how?
thanks