[SOLVED] Lots of "CPUx: Package temperature above threshold, cpu clock throttled" syslog entries?

n1nj4888 · Jul 24, 2019

Hi There,

I have a NUC8i5BEH (Intel Core i5-8259U CPU) node and noticed that my proxmox syslog (running Proxmox PVE 6 but also occured on PVE 5.4) often outputs the following set of critical errors throughout the day:

Code:

23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057261] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057260] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057253] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057232] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057231] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057227] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057226] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057206] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057205] mce: CPU4: Core temperature above threshold, cpu clock throttled (total events = 139201)
23/07/2019    23:26:56    crit    pve-host1.local    kern    kernel    [292528.057204] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 139201)

... Followed moments later (mainly in the same second) by the following set indicating the Temperature has returned to normal?

Code:

23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058260] mce: CPU6: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058260] mce: CPU2: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058258] mce: CPU5: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058258] mce: CPU1: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058229] mce: CPU0: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058227] mce: CPU4: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058226] mce: CPU3: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058226] mce: CPU7: Package temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058225] mce: CPU4: Core temperature/speed normal
23/07/2019    23:26:56    Information    pve-host1.local    kern    kernel    [292528.058224] mce: CPU0: Core temperature/speed normal

This set of temp threshold exceeded/back to normal messages was output to the log 40 times yesterday but, since I monitor the server CPU temperature every minute via SNMP, I can see that the Min / Max / Average temps were as follows yesterday and should indicate that the server is not getting hot, well at least not at the second the SNMP reading is being taken...

Min: 42C
Max: 57C
Average: 45C

I have a second node (a lower-powered NUC7PJYH) which has the same form of temperature profile and I don't see the same errors on that node, so I'm thinking the errors are either false positives or there is some configuration somewhere which is setting the CPU threshold too low.

Any ideas (A) What is controlling these messages being output to the syslog and (B) Where I'd configure the "thresholds" that it seems to be using?

Thanks!

Alwin · Sep 5, 2019

It may be that the values received by the kernel are wrong, or that turbo is in effect triggern the messages. The BIOS max temperature limit should put the PC to a halt in any case.

n1nj4888 · Sep 7, 2019

Alwin said:
It may be that the values received by the kernel are wrong, or that turbo is in effect triggern the messages. The BIOS max temperature limit should put the PC to a halt in any case.

I ended up monitoring the package0 temperatures with grafana /snmp and they seemed to be above what I would expect as “normal”... I therefore raised a support case with Intel who identified that it was a hardware fault with my particular NUC and replaced the unit. The replacement NUC (same model) does not exhibit the the same behaviour so this can be marked as SOLVED - Hardware Issue...

Thanks!

luckman212 · Feb 5, 2020

n1nj4888 said:
I ended up monitoring the package0 temperatures with grafana /snmp

sounds like a pretty cool solution. do you mind sharing how are you doing this exactly? is grafana running as a VM on the host itself, or is this from some external server?

n1nj4888 · Feb 5, 2020

I have Docker running inside a VM on the PVE cluster. Inside Docker runs the TIG (Telegraf, InfluxDB and Grafana) containers. I then installed SNMP server on all PVE physical hosts and use Telegraf (Docker VM) to poll the PVE hosts for generic CPU/Mem/Disk/Network SMNP metrics, write the relevant metric responses to InfluxDB and visualise them using Grafana)...

In addition I also configured the PVE External Metrics server on the PVE hosts to write their own “PVE Metrics” to the same InfluxDB instance (albeit different measurement/table in Influx)...

All important / useful metrics are then visualised in Grafana using a custom dashboard I built mainly from templates (including PVE VM/LXC) on the Grafana dashboards website

luckman212 · Feb 5, 2020

Thanks @n1nj4888 - I'm going to take a crack at it...

Search

Search

[SOLVED] Lots of "CPUx: Package temperature above threshold, cpu clock throttled" syslog entries?

n1nj4888

Well-Known Member

Alwin

Proxmox Retired Staff

n1nj4888

Well-Known Member

luckman212

Active Member

n1nj4888

Well-Known Member

luckman212

Active Member