Hi There,
I have a NUC8i5BEH (Intel Core i5-8259U CPU) node and noticed that my proxmox syslog (running Proxmox PVE 6 but also occured on PVE 5.4) often outputs the following set of critical errors throughout the day:
... Followed moments later (mainly in the same second) by the following set indicating the Temperature has returned to normal?
This set of temp threshold exceeded/back to normal messages was output to the log 40 times yesterday but, since I monitor the server CPU temperature every minute via SNMP, I can see that the Min / Max / Average temps were as follows yesterday and should indicate that the server is not getting hot, well at least not at the second the SNMP reading is being taken...
Min: 42C
Max: 57C
Average: 45C
I have a second node (a lower-powered NUC7PJYH) which has the same form of temperature profile and I don't see the same errors on that node, so I'm thinking the errors are either false positives or there is some configuration somewhere which is setting the CPU threshold too low.
Any ideas (A) What is controlling these messages being output to the syslog and (B) Where I'd configure the "thresholds" that it seems to be using?
Thanks!
I have a NUC8i5BEH (Intel Core i5-8259U CPU) node and noticed that my proxmox syslog (running Proxmox PVE 6 but also occured on PVE 5.4) often outputs the following set of critical errors throughout the day:
Code:
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057261] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057260] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057253] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057232] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057231] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057227] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057226] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057206] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 153468)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057205] mce: CPU4: Core temperature above threshold, cpu clock throttled (total events = 139201)
23/07/2019 23:26:56 crit pve-host1.local kern kernel [292528.057204] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 139201)
... Followed moments later (mainly in the same second) by the following set indicating the Temperature has returned to normal?
Code:
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058260] mce: CPU6: Package temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058260] mce: CPU2: Package temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058258] mce: CPU5: Package temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058258] mce: CPU1: Package temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058229] mce: CPU0: Package temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058227] mce: CPU4: Package temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058226] mce: CPU3: Package temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058226] mce: CPU7: Package temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058225] mce: CPU4: Core temperature/speed normal
23/07/2019 23:26:56 Information pve-host1.local kern kernel [292528.058224] mce: CPU0: Core temperature/speed normal
This set of temp threshold exceeded/back to normal messages was output to the log 40 times yesterday but, since I monitor the server CPU temperature every minute via SNMP, I can see that the Min / Max / Average temps were as follows yesterday and should indicate that the server is not getting hot, well at least not at the second the SNMP reading is being taken...
Min: 42C
Max: 57C
Average: 45C
I have a second node (a lower-powered NUC7PJYH) which has the same form of temperature profile and I don't see the same errors on that node, so I'm thinking the errors are either false positives or there is some configuration somewhere which is setting the CPU threshold too low.
Any ideas (A) What is controlling these messages being output to the syslog and (B) Where I'd configure the "thresholds" that it seems to be using?
Thanks!