New Proxmox Server MCE Hardware errors

Nox71

New Member
Apr 27, 2023
2
0
1
HI all, last week i have build a new Proxmox server from a Lenovo Thinkstation P710 with two Intel Xeon E5-2620 V4 with 192 GB of DDR4 ECC memory. I have created all my VMs on it and everything is working fine except for one thing i am getting those erros in the Syslog every minutes:

May 01 08:27:53 srv-pve kernel: mce_notify_irq: 59 callbacks suppressed
May 01 08:27:53 srv-pve kernel: mce: [Hardware Error]: Machine check events logged
May 01 08:27:54 srv-pve kernel: mce: [Hardware Error]: Machine check events logged
May 01 08:28:54 srv-pve kernel: mce_notify_irq: 59 callbacks suppressed
May 01 08:28:54 srv-pve kernel: mce: [Hardware Error]: Machine check events logged
May 01 08:28:55 srv-pve kernel: mce: [Hardware Error]: Machine check events logged

My server is not crashing and all the VMs are working fine but these errors are worrying me.

Any advice on that ?

Thansk in advance for your help!
 
Hi,

any update in this topic ?
I am struggling similar issue but with different interval.

as I already know... MCElog is not installed anymore... I installed RASDAEMON but same result... no info what is wrong....
some people says that rasdaemon have some bugs.. some people says... any UEFI/BIOS MCE handling is somehow on... noone knows...
https://github.com/mchehab/rasdaemon/issues/95

other proxmox topics... also without result :
https://forum.proxmox.com/threads/ubuntu-vm-keeps-crashing.132384/#post-584316
https://forum.proxmox.com/threads/hardware-errors-at-regular-intervals.127228/
 
if You are asking me ? I have two different HW and both without 'issue'
one :

QUANTA motherboard from OCP leopard

Code:
ipmitool sel list
   1 | 07/15/2000 | 03:22:12 | Unknown #0x70 |  | Asserted
   2 | 08/26/2000 | 18:24:15 | Unknown #0x70 |  | Asserted
   3 | 09/06/2000 | 02:49:38 | Unknown #0x70 |  | Asserted
   4 | 12/17/2020 | 11:03:28 | Unknown #0x70 |  | Asserted
   5 | 11/24/2022 | 10:14:16 | Unknown #0x70 |  | Asserted
   6 | 08/10/2023 | 13:21:39 | Unknown #0x70 |  | Asserted
   7 | 08/29/2023 | 08:51:15 | Unknown #0x70 |  | Asserted
   8 | 08/31/2023 | 23:00:10 | Unknown #0x70 |  | Asserted
   9 | 09/13/2023 | 18:16:17 | Unknown #0x70 |  | Asserted
   a | 09/13/2023 | 18:49:09 | Unknown #0x70 |  | Asserted
ipmitool sdr
P0 Temp          | 78 degrees C      | ok
P1 Temp          | 83 degrees C      | ok
P0 DTSmax        | 95 degrees C      | ok
P1 DTSmax        | 95 degrees C      | ok
P0 Therm Margin  | -17 degrees C     | ok
P1 Therm Margin  | -12 degrees C     | ok
P3V3             | 3.32 Volts        | ok
P5V              | 5.06 Volts        | ok
P12V             | 12.48 Volts       | ok
P1V05_STBY       | 1.07 Volts        | ok
P1V8_AUX         | 1.80 Volts        | ok
P3V3_AUX         | 3.34 Volts        | ok
P5V_AUX          | 5.08 Volts        | ok
P3V_BAT          | 3.10 Volts        | ok
Inlet Temp       | 27 degrees C      | ok
Outlet Temp      | 58 degrees C      | ok
PCH Temp         | 42 degrees C      | ok
HSC Input Volt   | 12.30 Volts       | ok
HSC Input Power  | 348 Watts         | ok
HSC Temp         | 61 degrees C      | ok
HSC Sts Low      | 0x00              | ok
HSC Sts High     | 0x00              | ok
HSC Output Curr  | 28.60 Amps        | ok
SYS FAN0         | 2300 RPM          | ok
SYS FAN1         | 2300 RPM          | ok
P0 VR Temp       | 52 degrees C      | ok
P0 core VR Vol   | 1.76 Volts        | ok
P0 core VR Curr  | 62 Amps           | ok
P0 core VR POUT  | 109 Watts         | ok
P0 core VR PIN   | 127 Watts         | ok
P1 VR Temp       | 67 degrees C      | ok
P1 core VR Vol   | 1.77 Volts        | ok
P1 core VR Curr  | 51.50 Amps        | ok
P1 core VR POUT  | 96 Watts          | ok
P1 core VR PIN   | 101 Watts         | ok
P0 DIMM VR0 Temp | 42 degrees C      | ok
P0 DIMM VR0 Vol  | 1.51 Volts        | ok
P0 DIMM VR0 Curr | 4.50 Amps         | ok
P0 DIMM VR0 POUT | 8 Watts           | ok
P0 DIMM VR0 PIN  | 12 Watts          | ok
P0 DIMM VR1 Temp | 43 degrees C      | ok
P0 DIMM VR1 Vol  | 1.51 Volts        | ok
P0 DIMM VR1 Curr | 6.50 Amps         | ok
P0 DIMM VR1 POUT | 8 Watts           | ok
P0 DIMM VR1 PIN  | 10 Watts          | ok
P1 DIMM VR0 Temp | 51 degrees C      | ok
P1 DIMM VR0 Vol  | 1.51 Volts        | ok
P1 DIMM VR0 Curr | 5.50 Amps         | ok
P1 DIMM VR0 POUT | 11 Watts          | ok
P1 DIMM VR0 PIN  | 10 Watts          | ok
P1 DIMM VR1 Temp | 52 degrees C      | ok
P1 DIMM VR1 Vol  | 1.51 Volts        | ok
P1 DIMM VR1 Curr | 7.50 Amps         | ok
P1 DIMM VR1 POUT | 11 Watts          | ok
P1 DIMM VR1 PIN  | 10 Watts          | ok
P0 Package Power | 113 Watts         | ok
P1 Package Power | 113 Watts         | ok
P0 DIMM01 Temp   | 49 degrees C      | ok
P0 DIMM23 Temp   | 45 degrees C      | ok
P1 DIMM01 Temp   | 49 degrees C      | ok
P1 DIMM23 Temp   | 50 degrees C      | ok
C1 Local Temp    | no reading        | ns
C1 Remote Temp   | no reading        | ns
C2 Local Temp    | no reading        | ns
C2 Remote Temp   | no reading        | ns
C3 Local Temp    | no reading        | ns
C3 Remote Temp   | no reading        | ns
C4 Local Temp    | no reading        | ns
C4 Remote Temp   | no reading        | ns
CPU0 Error       | 0x00              | ok
CPU1 Error       | 0x00              | ok
P0_CH0DIMM0_Sts  | 0x00              | ok
P0_CH0DIMM1_Sts  | 0x00              | ok
P0_CH1DIMM0_Sts  | 0x00              | ok
P0_CH1DIMM1_Sts  | 0x00              | ok
P0_CH2DIMM0_Sts  | 0x00              | ok
P0_CH2DIMM1_Sts  | 0x00              | ok
P0_CH3DIMM0_Sts  | 0x00              | ok
P0_CH3DIMM1_Sts  | 0x00              | ok
P1_CH0DIMM0_Sts  | 0x00              | ok
P1_CH0DIMM1_Sts  | 0x00              | ok
P1_CH1DIMM0_Sts  | 0x00              | ok
P1_CH1DIMM1_Sts  | 0x00              | ok
P1_CH2DIMM0_Sts  | 0x00              | ok
P1_CH2DIMM1_Sts  | 0x00              | ok
P1_CH3DIMM0_Sts  | 0x00              | ok
P1_CH3DIMM1_Sts  | 0x00              | ok
SEL Status       | 0x00              | ok
DCMI Watchdog    | 0x00              | ok
NTP Status       | 0x00              | ok
Chassis Pwr Sts  | 0x00              | ok
VR HOT           | 0x00              | ok
CPU_DIMM HOT     | 0x00              | ok
Airflow          | no reading        | ns
Sys booting sts  | 0x00              | ok
System Status    | 0x00              | ok
Processor Fail   | 0x00              | ok
C2 NVMe Status   | Not Readable      | ns
C2 NVMe Warn     | Not Readable      | ns
C2 NVMe CTemp    | no reading        | ns
C2 NVMe T_Main   | no reading        | ns
C2 NVMe T_Inlet  | no reading        | ns
C2 NVMe T_DB1    | no reading        | ns
C2 NVMe T_DB2    | no reading        | ns
C2 NVMe PDLU     | no reading        | ns
C2 NVMe Power    | no reading        | ns
C3 NVMe Status   | Not Readable      | ns
C3 NVMe Warn     | Not Readable      | ns
C3 NVMe CTemp    | no reading        | ns
C3 NVMe T_Main   | no reading        | ns
C3 NVMe T_Inlet  | no reading        | ns
C3 NVMe T_DB1    | no reading        | ns
C3 NVMe T_DB2    | no reading        | ns
C3 NVMe PDLU     | no reading        | ns
C3 NVMe Power    | no reading        | ns
C4 NVMe Status   | Not Readable      | ns
C4 NVMe Warn     | Not Readable      | ns
C4 NVMe CTemp    | no reading        | ns
C4 NVMe T_Main   | no reading        | ns
C4 NVMe T_Inlet  | no reading        | ns
C4 NVMe T_DB1    | no reading        | ns
C4 NVMe T_DB2    | no reading        | ns
C4 NVMe PDLU     | no reading        | ns
C4 NVMe Power    | no reading        | ns
CablePresent1    | Not Readable      | ns
CablePresent2    | Not Readable      | ns
CablePresent3    | Not Readable      | ns
CablePresent4    | Not Readable      | ns
C2-0 NVMe CTemp  | no reading        | ns
C2-1 NVMe CTemp  | no reading        | ns
C3-0 NVMe CTemp  | no reading        | ns
C3-1 NVMe CTemp  | no reading        | ns
C3-2 NVMe CTemp  | no reading        | ns
C3-3 NVMe CTemp  | no reading        | ns
QAVABFTemp       | no reading        | ns
QAVABRTemp       | no reading        | ns
QAVATFTemp       | no reading        | ns
QAVATRTemp       | no reading        | ns
QAVABCurrent     | no reading        | ns
QAVABVoltage     | no reading        | ns
QAVABpower       | no reading        | ns
QAVATCurrent     | no reading        | ns
QAVATVoltage     | no reading        | ns
QAVATPower       | no reading        | ns

dmesg -T
[czw wrz 14 08:13:40 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:13:40 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:15:11 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:15:11 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:23:52 2023] mce_notify_irq: 8 callbacks suppressed
[czw wrz 14 08:23:52 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:23:52 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:37:51 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:37:51 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:39:10 2023] mce_notify_irq: 10 callbacks suppressed
[czw wrz 14 08:39:10 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:39:10 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:42:04 2023] mce_notify_irq: 8 callbacks suppressed
[czw wrz 14 08:42:04 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:42:04 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:43:50 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:43:50 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:47:40 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:47:40 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:49:36 2023] mce_notify_irq: 6 callbacks suppressed
[czw wrz 14 08:49:36 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:49:36 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:55:27 2023] mce_notify_irq: 2 callbacks suppressed
[czw wrz 14 08:55:27 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:55:27 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:57:34 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 08:57:34 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:02:20 2023] mce_notify_irq: 6 callbacks suppressed
[czw wrz 14 09:02:20 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:02:20 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:10:53 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:10:53 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:17:55 2023] mce_notify_irq: 4 callbacks suppressed
[czw wrz 14 09:17:55 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:17:55 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:19:56 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:19:56 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:21:12 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:21:12 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:25:52 2023] mce_notify_irq: 2 callbacks suppressed
[czw wrz 14 09:25:52 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:25:52 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:31:58 2023] mce_notify_irq: 4 callbacks suppressed
[czw wrz 14 09:31:58 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:31:58 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:34:36 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:34:36 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:36:10 2023] mce_notify_irq: 2 callbacks suppressed
[czw wrz 14 09:36:10 2023] mce: [Hardware Error]: Machine check events logged
[czw wrz 14 09:36:10 2023] mce: [Hardware Error]: Machine check events logged

second
Supermicro
Product Name: X11DDW-L

Code:
ipmitool sdr
CPU1 Temp        | 30 degrees C      | ok
CPU2 Temp        | 31 degrees C      | ok
Inlet Temp       | no reading        | ns
PCH Temp         | 37 degrees C      | ok
System Temp      | 24 degrees C      | ok
Peripheral Temp  | 27 degrees C      | ok
VRMCpu1 Temp     | 31 degrees C      | ok
VRMCpu2 Temp     | 31 degrees C      | ok
VRMP1ABC Temp    | 33 degrees C      | ok
VRMP1DEF Temp    | 32 degrees C      | ok
VRMP2ABC Temp    | 31 degrees C      | ok
VRMP2DEF Temp    | 29 degrees C      | ok
P1-DIMMA1 Temp   | 29 degrees C      | ok
P1-DIMMB1 Temp   | 30 degrees C      | ok
P1-DIMMC1 Temp   | no reading        | ns
P1-DIMMD1 Temp   | 27 degrees C      | ok
P1-DIMME1 Temp   | 26 degrees C      | ok
P1-DIMMF1 Temp   | no reading        | ns
P2-DIMMA1 Temp   | 27 degrees C      | ok
P2-DIMMB1 Temp   | 27 degrees C      | ok
P2-DIMMC1 Temp   | no reading        | ns
P2-DIMMD1 Temp   | 27 degrees C      | ok
P2-DIMME1 Temp   | 27 degrees C      | ok
P2-DIMMF1 Temp   | no reading        | ns
FAN1             | no reading        | ns
FAN2             | no reading        | ns
FAN3             | 6900 RPM          | ok
FAN4             | no reading        | ns
FAN5             | 7000 RPM          | ok
FAN6             | 6900 RPM          | ok
12V              | 11.82 Volts       | ok
5VCC             | 5.10 Volts        | ok
3.3VCC           | 3.45 Volts        | ok
VBAT             | 0x04              | ok
Vcpu1            | 1.86 Volts        | ok
Vcpu2            | 1.87 Volts        | ok
VDimmP1ABC       | 1.20 Volts        | ok
VDimmP1DEF       | 1.20 Volts        | ok
VDimmP2ABC       | 1.20 Volts        | ok
VDimmP2DEF       | 1.20 Volts        | ok
5VSB             | 5.07 Volts        | ok
3.3VSB           | 3.38 Volts        | ok
1.8V PCH         | 1.84 Volts        | ok
PVNN PCH         | 1.03 Volts        | ok
1.05V PCH        | 1.07 Volts        | ok
Chassis Intru    | 0x00              | ok
M2NVMeSSD Temp   | no reading        | ns
PS1 Status       | 0x01              | ok
PS2 Status       | 0x01              | ok
AOC_NIC Temp     | 39 degrees C      | ok
AOCM2NVMe_Temp   | no reading        | ns

ipmitool sel list
   1 | 09/03/2022 | 16:42:00 | Unknown #0xff |  | Asserted
   2 | 09/03/2022 | 17:12:05 | Power Supply #0xc8 | Failure detected () | Asserted
   3 | 09/03/2022 | 17:31:47 | Power Supply #0xc8 | Failure detected () | Deasserted
   4 | 09/03/2022 | 18:17:37 | OS Boot | C: boot completed () | Asserted
   5 | 09/03/2022 | 18:25:41 | OS Boot | C: boot completed () | Asserted
   6 | 09/06/2022 | 12:56:11 | Unknown #0xff |  | Asserted

dmesg -T
[Sun Sep  3 16:55:43 2023] mce: [Hardware Error]: Machine check events logged
[Mon Sep  4 15:57:41 2023] mce: [Hardware Error]: Machine check events logged
[Mon Sep  4 15:57:41 2023] mce: [Hardware Error]: Machine check events logged
[Wed Sep  6 00:07:34 2023] mce: [Hardware Error]: Machine check events logged
[Thu Sep  7 23:27:54 2023] mce: [Hardware Error]: Machine check events logged
[Fri Sep  8 00:06:23 2023] mce: [Hardware Error]: Machine check events logged
[Fri Sep  8 07:45:36 2023] mce: [Hardware Error]: Machine check events logged
[Fri Sep  8 22:57:23 2023] mce: [Hardware Error]: Machine check events logged
[Sat Sep  9 01:17:09 2023] mce: [Hardware Error]: Machine check events logged
[Sat Sep  9 05:11:12 2023] mce: [Hardware Error]: Machine check events logged
[Sat Sep  9 07:06:20 2023] mce: [Hardware Error]: Machine check events logged
[Sat Sep  9 10:37:01 2023] mce: [Hardware Error]: Machine check events logged
[Sat Sep  9 15:39:48 2023] mce: [Hardware Error]: Machine check events logged
[Sun Sep 10 02:41:08 2023] mce: [Hardware Error]: Machine check events logged
[Mon Sep 11 04:35:54 2023] mce: [Hardware Error]: Machine check events logged
[Wed Sep 13 06:09:35 2023] mce: [Hardware Error]: Machine check events logged
 
I have two different HW and both without 'issue'
Good, but bad for the post.

Updated BIOS? Sometimes it's just like this. I have also a machine with check exceptions and the vendor analysed it, no problem. Besides the MCEs, there is no obvious problem, so we now just ignore it.
 
On QANTA server mce error log dissapear after replacing Fibre Chanal card.
Code:
Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
But still PROXMOX team... why there is no info what is that 'hardware error'... MCELOG not exist anymore , rasdaemon not working well...

On supermicro problem persist...
 
Also seeing this issue. I can increase the frequency of these errors when stress testing the VMs but have had 0 issues with stability on any VM. First found this issue in my Gigabyte BMC (see attached screenshot) and this lead me down the rabbit hole of installed rasdaemon and checking for ECC error via ras-mc-ctl --error-count but that shows 0 issues. I've been trying to find a solution to this issues for weeks now but have come up with nothing and I'm about to just say "I hope this doesn't become an issue down the line" as I've seen no performance impact yet. Any help would be greatly appreciated :)

Gigabyte BMB Errors:
1708636466032.png

ras-mc-ctl output:
1708636523748.png

Proxmox syslog output:

Code:
Feb 21 22:40:17 mojojojo pveproxy[1271448]: worker exit
Feb 21 22:43:20 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 22:48:31 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 22:48:31 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:04:05 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:04:05 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:09:16 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:09:16 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:17:01 mojojojo CRON[1293278]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 21 23:17:01 mojojojo CRON[1293279]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 21 23:17:01 mojojojo CRON[1293278]: pam_unix(cron:session): session closed for user root
Feb 21 23:19:39 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:19:39 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:24:50 mojojojo kernel: mce_notify_irq: 2 callbacks suppressed
Feb 21 23:24:50 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:30:01 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:35:13 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:35:13 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:45:35 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:45:35 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
Feb 21 23:55:58 mojojojo kernel: mce_notify_irq: 2 callbacks suppressed
Feb 21 23:55:58 mojojojo kernel: mce: [Hardware Error]: Machine check events logged
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!