Hi Proxmox community,
I also experience these random
freezes multiple times per day on new Dell Optiplex 3000 (12th Gen Intel(R) Core(TM) i5-12500T). I have tried different kernels and BIOS settings but it doesn't help. I also updated newest BIOS (1.5.2) and disabled power states, WLAN, Bluetooth etc. but still Proxmox crashes with or without VM's running on it.
It was stable on june and july, then I was travelling on august and when I came back I did update/upgrade and it started to crash.
When it freeze / crash, it doesn't even ping, I have to do hard reset.
Version
pve-manager/7.2-11/b76d3178 (running kernel: 5.15.35-3-pve)
root@prox:~# dmesg | grep microcode
[ 1.066361] microcode: sig=0x90675, pf=0x1, revision=0x1e
[ 1.066628] microcode: Microcode Update Driver: v2.2.
Kernels: (5.15 and 5.19 freezes same way)
pve-kernel-5.15.30-2-pve/stable,now 5.15.30-3 amd64 [installed]
pve-kernel-5.15.35-3-pve/stable,now 5.15.35-6 amd64 [installed,auto-removable]
pve-kernel-5.15.39-1-pve/stable,now 5.15.39-1 amd64 [installed,auto-removable]
pve-kernel-5.15.53-1-pve/stable,now 5.15.53-1 amd64 [installed,auto-removable]
pve-kernel-5.15.60-1-pve/stable,now 5.15.60-1 amd64 [installed,automatic]
pve-kernel-5.15/stable,now 7.2-11 all [installed]
pve-kernel-5.19.7-1-pve/stable,now 5.19.7-1 amd64 [installed,automatic]
pve-kernel-5.19/stable,now 7.2-11 all [installed]
pve-kernel-helper/stable,now 7.2-12 all [installed]
lspci:
00:00.0 Host bridge: Intel Corporation Device 4650 (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 4690 (rev 0c)
00:04.0 Signal processing controller: Intel Corporation Device 461d (rev 05)
00:08.0 System peripheral: Intel Corporation Device 464f (rev 05)
00:14.0 USB controller: Intel Corporation Device 7ae0 (rev 11)
00:14.2 RAM memory: Intel Corporation Device 7aa7 (rev 11)
00:16.0 Communication controller: Intel Corporation Device 7ae8 (rev 11)
00:17.0 SATA controller: Intel Corporation Device 7ae2 (rev 11)
00:1a.0 PCI bridge: Intel Corporation Device 7ac8 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Device 7aba (rev 11)
00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
00:1f.3 Audio device: Intel Corporation Device 7ad0 (rev 11)
00:1f.4 SMBus: Intel Corporation Device 7aa3 (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 7aa4 (rev 11)
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/980PRO
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 1b)
I also get screen flickering with error on login console: [drm] *ERROR* CPU pipe A FIFO underrun: transcoder
Interesting things from logs (not sure if they are related)
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
hpet_acpi_add: no address or irqs in _CRS
secureboot: Secure boot could not be determined (mode 0)
ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
ep 28 11:31:01 prox kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
Sep 28 11:31:01 prox kernel: device-mapper: uevent: version 1.0.3
Sep 28 11:31:01 prox kernel: device-mapper: ioctl: 4.47.0-ioctl (2022-07-28) initialised:
dm-devel@redhat.com
Sep 28 11:31:01 prox kernel: platform eisa.0: Probing EISA bus 0
Sep 28 11:31:01 prox kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Sep 28 11:31:01 prox kernel: acpi PNP0C14:01: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: wmi_bus wmi_bus-PNP0C14:02: WQBC data block query control method not found
Sep 28 11:31:01 prox kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: version 3.0
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: AHCI 0001.0301 32 slots 4 ports 6 Gbps 0x50 impl SATA mode
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: flags: 64bit ncq sntf pm clo only pio slum part ems deso sadm sds
Sep 28 11:31:01 prox kernel: r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
Sep 28 11:31:01 prox kernel: spl: loading out-of-tree module taints kernel.
Sep 28 11:31:01 prox kernel: znvpair: module license 'CDDL' taints kernel.
Sep 28 11:31:01 prox kernel: Disabling lock debugging due to kernel taint
Sep 28 11:31:02 prox kernel: cfg80211: Loading compiled-in X.509 certificates for regulatory database
Sep 28 11:31:02 prox kernel: cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
Sep 28 11:31:02 prox kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Sep 28 11:31:02 prox kernel: cfg80211: failed to load regulatory.db
Sep 28 11:31:02 prox kernel: Creating 1 MTD partitions on "0000:00:1f.5":
Sep 28 11:31:02 prox kernel: 0x000000000000-0x000003000000 : "BIOS"
Sep 28 11:31:02 prox kernel: mtd: partition "BIOS" extends beyond the end of device "0000:00:1f.5" -- size truncated to 0x1000000
Sep 28 11:31:02 prox kernel: bluetooth hci0: Direct firmware load for mediatek/BT_RAM_CODE_MT7961_1_2_hdr.bin failed with error -2
Sep 28 11:31:02 prox kernel: Bluetooth: hci0: Failed to load firmware file (-2)
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see
https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
ep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see
https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware(s) can be downloaded from
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 0.0
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC is uninitialized
Sep 28 11:31:02 prox kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1ba
ep 28 11:31:04 prox kernel: kauditd_printk_skb: 4 callbacks suppressed
Sep 28 11:31:04 prox kernel: audit: type=1400 audit(1664382664.161:15): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/lxc-start" pid=978 comm="apparmor_parser"
Let me know If there is more information I could provide to help solve this problem? BTW Is it possible to try older 5.13 kernel with proxmox 7.2? if, how?
thanks