mce: [Hardware Error]: Machine check events logged

trendco · Apr 1, 2016

From Time to Time i have these Messages in my Syslog:

Code:

mce: [Hardware Error]: Machine check events logged

I installed mcelog, and there are these Messages:

Code:

TIME 1459479366 Fri Apr  1 04:56:06 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 2 BANK 0
TIME 1459480198 Fri Apr  1 05:09:58 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0
TIME 1459480814 Fri Apr  1 05:20:14 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0
TIME 1459487750 Fri Apr  1 07:15:50 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 3 BANK 0
TIME 1459492990 Fri Apr  1 08:43:10 2016
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 6 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60

What does that mean?

For Info: CPU is a "Xeon E3-1246V3"

trendco · Apr 20, 2016

Any Ideas?
I have these Problem nearly every Day.

t.lamprecht · Apr 20, 2016

trendco said:
Hardware event. This is not a software error.

What motherboard do you use? Have the newest BIOS/UEFI updates installed for it?

trendco · Apr 20, 2016

http://www.supermicro.com/products/motherboard/Xeon/C220/X10SAE.cfm
Newest BIOS

t.lamprecht · Apr 20, 2016

trendco said:
supermicro

Hmm okay, there were a similar but also more grave problem with their boards here a few weeks ago, but the user had really no luck with the supermicro support :/

Do you run 32 bit VMs?
Looks like HSW131 from http://www.intel.com/content/dam/ww...cation-updates/xeon-e3-1200v3-spec-update.pdf (just search for it)

HSW131. Spurious Corrected Errors May be Reported

Problem: Due this erratum, spurious corrected errors may be logged in the IA32_MC0_STATUS register with the valid field (bit 63) set, the uncorrected error field (bit 61) not set, a Model Specific Error Code (bits [31:16]) of x000F, and an MCA Error Code (bits [15:0]) of 0x0005. If CMCI is enabled, these spurious corrected errors also signal interrupts.

Implication: When this erratum occurs, software may see corrected errors that are benign. These corrected errors may be safely ignored.

Workaround: None identified.

Status: For the steppings affected, see the Summary Table

As your status value is "STATUS 90000040000f0005" which translates to binary

Code:

1 0 0 1 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 1 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 1 1 1 1 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 1 0 1
↑   ↑                                                                         |-         bit[31:16] = 0x000F         -|-        bit[15:0] = 0x0005          -|
|   61- bit not set
63 bit set

Thus you may ignore this.

trendco · Apr 20, 2016

Yes, i run also 32-Bit Systems: 1x Win-2000, 1x Win-XP.
The other Systems are 64-Bit.

Sorry for the Question: But what means that now? I do not understand what the detailed Problem is, confused

t.lamprecht · Apr 20, 2016

trendco said:
Sorry for the Question: But what means that now? I do not understand what the detailed Problem is, confused

Uh, I could have been a little more clear, sorry.

That means Intel has also some bugs and you had the bad luck at running into one, but fortunately this one is as harmless as it gets, as the conclusion from Intel says.

They publish so called "Errata", those are documents which give information on a CPU model and its problems, hardware or software related also they tell us how to fix or workaround it if possible.

Your specific issue (HSW131) needs no intervention as it is an internal parity error and can correct itself, see also your log, it contains various "Error corrected" messages.
The "error" in the log is here more an information for the user, but with this specific problem it can be safely ignored and dismissed.

So I understand that its a bit strange to simply ignore this (or any "error message"), but as your model is clearly affected by this, the status message also concludes to HSW131 and Intel has a good reputation regarding such erratas it safe to do so, imo.

trendco · Apr 20, 2016

Many Thanks for this detailed Answer.

Can this Bug someday be corrected?
And if, by Mainboard Bios, or by the Kernel?

My Problem is now: I run a cronjob which runs each hour. It checks the dmesg and sylog for Errors, and if an error is discovered, I get an Email.
But there stands only "mce: [Hardware Error]: Machine check events logged", so i can not filter that, because the detailed Error stands in mcelog

Have you any Ideas?

t.lamprecht · Apr 20, 2016

trendco said:
Can this Bug someday be corrected?
And if, by Mainboard Bios, or by the Kernel?

I do not know the specific internal things from this bug, but I suspect that it is not fixable by CPU microcode updates, or they see no purpose in fixing it (affects nothing), else intel would have done it already and written the solution in the errata.

A fix from the kernel or the bios would simply suppress this specific error but wont fix it itself.

One way to solve this would be that you filter that one nonetheless, but to not miss another, different and possible dangerous, MCE error you also scan your MCE log and send an email if there is some other error than this specific one logged.
Not the nicest solution but it should work

trendco · Apr 20, 2016

Ok, many thanks for Info.

Search

Search

mce: [Hardware Error]: Machine check events logged

trendco

Member

trendco

Member

t.lamprecht

Proxmox Staff Member

trendco

Member

t.lamprecht

Proxmox Staff Member

trendco

Member

t.lamprecht

Proxmox Staff Member

trendco

Member

t.lamprecht

Proxmox Staff Member

trendco

Member