mce: [Hardware Error]: Machine check events logged

Jan 9, 2012
282
2
18
From Time to Time i have these Messages in my Syslog:


Code:
mce: [Hardware Error]: Machine check events logged

I installed mcelog, and there are these Messages:

Code:
TIME 1459479366 Fri Apr  1 04:56:06 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 2 BANK 0
TIME 1459480198 Fri Apr  1 05:09:58 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0
TIME 1459480814 Fri Apr  1 05:20:14 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0
TIME 1459487750 Fri Apr  1 07:15:50 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 3 BANK 0
TIME 1459492990 Fri Apr  1 08:43:10 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60

What does that mean?

For Info: CPU is a "Xeon E3-1246V3"
 

Hmm okay, there were a similar but also more grave problem with their boards here a few weeks ago, but the user had really no luck with the supermicro support :/

Do you run 32 bit VMs?
Looks like HSW131 from http://www.intel.com/content/dam/ww...cation-updates/xeon-e3-1200v3-spec-update.pdf (just search for it)

HSW131. Spurious Corrected Errors May be Reported

Problem: Due this erratum, spurious corrected errors may be logged in the IA32_MC0_STATUS register with the valid field (bit 63) set, the uncorrected error field (bit 61) not set, a Model Specific Error Code (bits [31:16]) of x000F, and an MCA Error Code (bits [15:0]) of 0x0005. If CMCI is enabled, these spurious corrected errors also signal interrupts.

Implication: When this erratum occurs, software may see corrected errors that are benign. These corrected errors may be safely ignored.

Workaround: None identified.

Status: For the steppings affected, see the Summary Table

As your status value is "STATUS 90000040000f0005" which translates to binary
Code:
1 0 0 1 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 1 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 1 1 1 1 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 1 0 1
↑   ↑                                                                         |-         bit[31:16] = 0x000F         -|-        bit[15:0] = 0x0005          -|
|   61- bit not set
63 bit set

Thus you may ignore this.
 
Yes, i run also 32-Bit Systems: 1x Win-2000, 1x Win-XP.
The other Systems are 64-Bit.

Sorry for the Question: But what means that now? I do not understand what the detailed Problem is, confused :)
 
Sorry for the Question: But what means that now? I do not understand what the detailed Problem is, confused :)

Uh, I could have been a little more clear, sorry.

That means Intel has also some bugs and you had the bad luck at running into one, but fortunately this one is as harmless as it gets, as the conclusion from Intel says.

They publish so called "Errata", those are documents which give information on a CPU model and its problems, hardware or software related also they tell us how to fix or workaround it if possible.

Your specific issue (HSW131) needs no intervention as it is an internal parity error and can correct itself, see also your log, it contains various "Error corrected" messages.
The "error" in the log is here more an information for the user, but with this specific problem it can be safely ignored and dismissed.

So I understand that its a bit strange to simply ignore this (or any "error message"), but as your model is clearly affected by this, the status message also concludes to HSW131 and Intel has a good reputation regarding such erratas it safe to do so, imo.
 
Many Thanks for this detailed Answer.

Can this Bug someday be corrected?
And if, by Mainboard Bios, or by the Kernel?


My Problem is now: I run a cronjob which runs each hour. It checks the dmesg and sylog for Errors, and if an error is discovered, I get an Email.
But there stands only "mce: [Hardware Error]: Machine check events logged", so i can not filter that, because the detailed Error stands in mcelog :(

Have you any Ideas?
 
Can this Bug someday be corrected?
And if, by Mainboard Bios, or by the Kernel?

I do not know the specific internal things from this bug, but I suspect that it is not fixable by CPU microcode updates, or they see no purpose in fixing it (affects nothing), else intel would have done it already and written the solution in the errata.

A fix from the kernel or the bios would simply suppress this specific error but wont fix it itself.

One way to solve this would be that you filter that one nonetheless, but to not miss another, different and possible dangerous, MCE error you also scan your MCE log and send an email if there is some other error than this specific one logged.
Not the nicest solution but it should work :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!