[SOLVED] Lots of "CPUx: Package temperature above threshold, cpu clock throttled" syslog entries?

n1nj4888

Well-Known Member
Jan 13, 2019
162
22
58
45
Hi There,

I have a NUC8i5BEH (Intel Core i5-8259U CPU) node and noticed that my proxmox syslog (running Proxmox PVE 6 but also occured on PVE 5.4) often outputs the following set of critical errors throughout the day:

Code:
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057261] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057260] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057253] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057232] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057231] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057227] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057226] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057206] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057205] mce: CPU4: Core temperature above threshold, cpu clock throttled (total events = 139201)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057204] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 139201)

... Followed moments later (mainly in the same second) by the following set indicating the Temperature has returned to normal?

Code:
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058260] mce: CPU6: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058260] mce: CPU2: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058258] mce: CPU5: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058258] mce: CPU1: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058229] mce: CPU0: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058227] mce: CPU4: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058226] mce: CPU3: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058226] mce: CPU7: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058225] mce: CPU4: Core temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058224] mce: CPU0: Core temperature/speed normal

This set of temp threshold exceeded/back to normal messages was output to the log 40 times yesterday but, since I monitor the server CPU temperature every minute via SNMP, I can see that the Min / Max / Average temps were as follows yesterday and should indicate that the server is not getting hot, well at least not at the second the SNMP reading is being taken...

Min: 42C
Max: 57C
Average: 45C

I have a second node (a lower-powered NUC7PJYH) which has the same form of temperature profile and I don't see the same errors on that node, so I'm thinking the errors are either false positives or there is some configuration somewhere which is setting the CPU threshold too low.

Any ideas (A) What is controlling these messages being output to the syslog and (B) Where I'd configure the "thresholds" that it seems to be using?

Thanks!
 
It may be that the values received by the kernel are wrong, or that turbo is in effect triggern the messages. The BIOS max temperature limit should put the PC to a halt in any case.
 
It may be that the values received by the kernel are wrong, or that turbo is in effect triggern the messages. The BIOS max temperature limit should put the PC to a halt in any case.
I ended up monitoring the package0 temperatures with grafana /snmp and they seemed to be above what I would expect as “normal”... I therefore raised a support case with Intel who identified that it was a hardware fault with my particular NUC and replaced the unit. The replacement NUC (same model) does not exhibit the the same behaviour so this can be marked as SOLVED - Hardware Issue...

Thanks!
 
I ended up monitoring the package0 temperatures with grafana /snmp
sounds like a pretty cool solution. do you mind sharing how are you doing this exactly? is grafana running as a VM on the host itself, or is this from some external server?
 
I have Docker running inside a VM on the PVE cluster. Inside Docker runs the TIG (Telegraf, InfluxDB and Grafana) containers. I then installed SNMP server on all PVE physical hosts and use Telegraf (Docker VM) to poll the PVE hosts for generic CPU/Mem/Disk/Network SMNP metrics, write the relevant metric responses to InfluxDB and visualise them using Grafana)...

In addition I also configured the PVE External Metrics server on the PVE hosts to write their own “PVE Metrics” to the same InfluxDB instance (albeit different measurement/table in Influx)...

All important / useful metrics are then visualised in Grafana using a custom dashboard I built mainly from templates (including PVE VM/LXC) on the Grafana dashboards website
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!