"mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ..."

chudak

Well-Known Member
May 11, 2019
317
16
58
Found my NUC with Proxmox installed in unresponsive state today (first time ever after 2 weeks of use).

On reboot see these errors:
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: xxx
mce: [Hardware Error]: TSC 0 ADDR fef1ce80 MISC xxx
mce: [Hardware Error]: PROCESSOR 0:a0660 TIME xxx SOCKET 0
APIC 0 microcode ca
(see attached pic - https://i.imgur.com/LYsQyyN.png)


The box booted and seems normal so far but see those errors on boot
Quick memory test did not show any problems so far.

rasdaemon -f, journalctl -f show no obvious problems.

==========================
root@pve:~# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 64036 MB
node 0 free: 42377 MB
node distances:
node 0
0: 10
(reverse-i-search)`jo': ^Curnalctl -f
root@pve:~# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

===========================
root@pve:~# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

root@pve:~# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 64036 MB
node 0 free: 42345 MB
node distances:
node 0
0: 10
================


I run Intel NUC 7 BXNUC10i7FNH
Here is my CPU info https://pastebin.com/MpXedi1h

Anybody had experience with such errors ? Bad RAM, motherboard ?
Can it be benign?

Thx in advance!
 

Attachments

  • CPU_ERRORS.png
    CPU_ERRORS.png
    460.4 KB · Views: 20
Last edited:
Other then the fact that I noticed this after pve was unresponsive (which could be coincidental and unrelated to h/w errors), I see not issues running pve
 
Hi,

I guess the nuc gen 10 is too new and has some problems.
But if you like to prove that the memory with cpu is ok run for 30 min stress-ng

Code:
stress-ng --cpu 6 --vm 6 --verify 1 --vm-bytes 80%

If this test does not crash the likelihood is hight that the NUC will work without problems.
 
  • Like
Reactions: shammyh and chudak
@wolfgang

Thank you for a good practical advise !

I ran that for ~40 min with 100% CPU
This is what I saw in the log:

Sep 09 08:04:30 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <7f>
TDT <93>
next_to_use <93>
next_to_clean <7e>
buffer_info[next_to_clean]:
time_stamp <102670de1>
next_to_watch <7f>
jiffies <1026716a8>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>

This NUC is still replaceable, would you suggest to replace it or you suspect it's more a generic issue?
 
@evg32

Thank you !

What is interesting that after running stress-ng I did not see "mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6:" during boot

Have you seen the error like mine above too ? In other words, I want to understand why your solution is good for me? (I wish I was an expert in this area :) )

(
I did try it and it seem like there were much more output generated (without GRUB_CMDLINE_LINUX_DEFAULT="quiet" ), I did not see my error, but saw
pve kernel: [ 5.659369] Bluetooth: hci0: Failed to load Intel firmware file (-2)
...
/var/log/syslog:Sep 9 09:18:34 pve kernel: [ 6.616670] Bluetooth: hci0: Failed to load Intel firmware file (-2)
/var/log/syslog:Sep 9 09:18:34 pve systemd[1]: apparmor.service: Failed with result 'exit-code'.

Those maybe unrelated to this at all, guessing ...
)

Main problem I am trying to assess now if my NUC h/w is bad and need to be replaced. Based on your post sounds it is not h/w related, correct ?

Thanks again !
 
I noticed that mce errors occured randomly, I couldn't correlate them with anything.
Yep, I saw the same errors except that my CPU was i9-9900K.
That's a CPU bug, as described here https://bugzilla.kernel.org/show_bug.cgi?id=109051

You are lucky with i9-9900K, I got i9-9880H from Hystou and could not even set it up, returned and then got I7

OK I will not replace the NUC then.

Thank you !
 
Sep 09 08:04:30 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Ignore this.
This stress test is not comparable with normal load and it is normal that other non tested parts get no resources and run in errors.

Have you tried installing "intel-microcode" (apt install intel-microcode) ?
As long you update the bios of your NUC, the microcode will bring no benefit.
Because the microcode from Debian comes also from Intel and Intel does a good job with keeping the NUC firmware update.
 
Nope, I never touched sounds configs. I just needed stable VMs and host server.
 
In my case,
The issue came when trying to install Proxmox.
it was an easy fix.
Turns out that the firmware from a Kingston NV2 M.2 is not compatible.
When I switched this one for a Samsung 980 all worked fine.
Hope this helps someone.
Kind regards.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!