Proxmox server Mistry reboot - CPU80: Core temperature is above threshold, CPU clock throttled

linuxteam

New Member
Nov 29, 2024
7
0
1
Hi There,

I have a proxmox node (Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz) and noticed that my proxmox syslog (running Proxmox PVE 8.2.2 ) often outputs the following set of critical errors throughout the day and suddenly got rebooted twice so we checked logs and we are not able to trace any reboot/reset event so raised request with our oem and they are telling that there is no hardware related issues.

2025-11-15T00:59:04.489054+05:30 proxmox-soc1 pvestatd[2748900]: Use of uninitialized value $size in int at /usr/share/perl5/PVE/Storage/LVMPlugin.pm line 133.
2025-11-15T00:59:04.489134+05:30 proxmox-soc1 pvestatd[2748900]: Use of uninitialized value $free in int at /usr/share/perl5/PVE/Storage/LVMPlugin.pm line 133.
2025-11-15T00:59:04.489159+05:30 proxmox-soc1 pvestatd[2748900]: Use of uninitialized value $lvcount in int at /usr/share/perl5/PVE/Storage/LVMPlugin.pm line 133.

[Sat Nov 15 01:07:17 2025] i40e 0000:5e:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
[Sat Nov 15 01:07:17 2025] i40e 0000:5e:00.2: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
[Sat Nov 15 01:07:17 2025] i40e 0000:5e:00.2: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
[Sat Nov 15 01:07:17 2025] i40e 0000:5e:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
[Sat Nov 15 01:07:17 2025] i40e 0000:5e:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
[Sat Nov 15 01:07:17 2025] i40e 0000:5e:00.2: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF

[Sat Nov 15 01:32:36 2025] CPU69: Core temperature is above threshold, cpu clock is throttled (total events = 4)
[Sat Nov 15 01:32:36 2025] CPU13: Core temperature is above threshold, cpu clock is throttled (total events = 4)
[Sat Nov 15 01:32:36 2025] CPU107: Core temperature is above threshold, cpu clock is throttled (total events = 1)
[Sat Nov 15 01:32:36 2025] CPU51: Core temperature is above threshold, cpu clock is throttled (total events = 1)
[Sat Nov 15 01:32:38 2025] CPU31: Core temperature is above threshold, cpu clock is throttled (total events = 16)
[Sat Nov 15 01:32:38 2025] CPU87: Core temperature is above threshold, cpu clock is throttled (total events = 16)
[Sat Nov 15 01:32:38 2025] CPU100: Core temperature is above threshold, cpu clock is throttled (total events = 1)
[Sat Nov 15 01:32:38 2025] CPU44: Core temperature is above threshold, cpu clock is throttled (total events = 1)
[Sat Nov 15 01:32:39 2025] CPU32: Core temperature is above threshold, cpu clock is throttled (total events = 8)
[Sat Nov 15 01:32:39 2025] CPU88: Core temperature is above threshold, cpu clock is throttled (total events = 8)
[Sat Nov 15 01:32:41 2025] CPU71: Core temperature is above threshold, cpu clock is throttled (total events = 6)
[Sat Nov 15 01:32:41 2025] CPU15: Core temperature is above threshold, cpu clock is throttled (total events = 6)
[Sat Nov 15 01:32:42 2025] CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 8)
[Sat Nov 15 01:32:42 2025] CPU60: Core temperature is above threshold, cpu clock is throttled (total events = 8)
[Sat Nov 15 01:32:43 2025] CPU78: Core temperature is above threshold, cpu clock is throttled (total events = 15)
[Sat Nov 15 01:32:43 2025] CPU22: Core temperature is above threshold, cpu clock is throttled (total events = 15)
[Sat Nov 15 01:32:43 2025] CPU61: Core temperature is above threshold, cpu clock is throttled (total events = 25)
[Sat Nov 15 01:32:43 2025] CPU5: Core temperature is above threshold, cpu clock is throttled (total events = 25)
[Sat Nov 15 01:32:44 2025] CPU16: Core temperature is above threshold, cpu clock is throttled (total events = 2)
[Sat Nov 15 01:32:44 2025] CPU72: Core temperature is above threshold, cpu clock is throttled (total events = 2)
[Sat Nov 15 01:32:45 2025] CPU19: Core temperature is above threshold, cpu clock is throttled (total events = 16)
[Sat Nov 15 01:32:45 2025] CPU75: Core temperature is above threshold, cpu clock is throttled (total events = 16)
[Sat Nov 15 01:32:45 2025] CPU64: Core temperature is above threshold, cpu clock is throttled (total events = 29)
[Sat Nov 15 01:32:45 2025] CPU8: Core temperature is above threshold, cpu clock is throttled (total events = 29)
[Sat Nov 15 01:32:45 2025] CPU73: Core temperature is above threshold, cpu clock is throttled (total events = 21)

I have 3 other nodes as well in the same cluster and there also same hardware is running but there we are not facing this issues

I’m trying to determine what exactly triggered the reset. Below is the relevant section of the BMC/kernel logs

2025 Nov 14 19:26:49 UTC:(4.1(3c)):kernel:-:[platform_reset_cb_handler]:77:Platform Reset ISR -> ResetState: 1
2025 Nov 14 19:26:49 UTC:(4.1(3c)):cipmi: Intel ME Operating State:
2025 Nov 14 19:26:49 UTC:(4.1(3c)):cipmi: Intel ME is initializing.
2025 Nov 14 19:26:49 UTC:(4.1(3c)):DOCTOR-BMC: Stopping Application [tsa_server]
2025 Nov 14 19:26:49 UTC:(4.1(3c)):doctor-bmc: Application tsa_server is stopped.
2025 Nov 14 19:26:49 UTC:(4.1(3c)):cipmi: IPMI Request Message --> Chan:11, Netfn:0x00, Cmd:0x06, Data: 0x30 0x01, CC:0x00
2025 Nov 14 19:26:49 UTC:(4.1(3c)):kernel:-:[platform_reset_cb_handler]:77:Platform Reset ISR -> ResetState: 0

What I need help with
  • Where in Proxmox (or on the host) should I look to find the exact reset reason?
  • Which logs are most reliable for detecting watchdog resets or kernel panics?
  • Is there a recommended way to correlate BMC reset triggers with Proxmox journal logs?
  • Do Intel ME state changes like “M0 without UMA” indicate any deeper firmware problem?
 
Well, I think it's pretty clear, your CPU is overheating. The second log from the IPMI request is just the system asking itself what the cause of the system restart was (Netfn 0x00 is Chassis, Cmd 0x06 is system reset cause). The data field is likely proprietary and you'd need the IPMI itself to answer what it means.

Presuming this is SuperMicro (because they are shit when finding these kinds of issues), then I would suggest logging the temperature graphs from IPMI on an external system, on better systems you can retrieve at least several weeks worth of temperature data from iLO/iDRAC type systems.

Use something like Prometheus collectors if you need to, your CPU is getting hot, why, likely a broken fan or the cooler came disconnected from the CPU or less likely dust or something clogging the intake. External factors could be datacenter cooling issues, but a proper investigation into the various intake temperatures and exhaust temperatures and core/CPU/motherboard and various chassis temperatures, providing your system has that (again, SuperMicro is bad at this).
 
Last edited: