ECC memory issue?

sjevtic

New Member
Nov 24, 2024
10
5
3
I've been getting a set of messages like this every 2 days or so on my PVE host:

Code:
{192 pve-beast ~} # dmesg -T | grep -Pi 'edac|mce'
# ...
[Sat Feb  8 00:21:53 2025] mce: [Hardware Error]: Machine check events logged
[Sat Feb  8 23:54:29 2025] mce: [Hardware Error]: Machine check events logged
[Sat Feb  8 23:54:29 2025] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[Sat Feb  8 23:54:29 2025] EDAC skx MC3: CPU 28: Machine Check Event: 0x0 Bank 16: 0x8c000040000800c1
[Sat Feb  8 23:54:29 2025] EDAC skx MC3: TSC 0x8d672e97ad5a2
[Sat Feb  8 23:54:29 2025] EDAC skx MC3: ADDR 0x83e2aa95c0
[Sat Feb  8 23:54:29 2025] EDAC skx MC3: MISC 0x910800080008086
[Sat Feb  8 23:54:29 2025] EDAC skx MC3: PROCESSOR 0:0x50657 TIME 1739080479 SOCKET 1 APIC 0x40
[Sat Feb  8 23:54:29 2025] EDAC MC3: 1 CE memory scrubbing error on CPU_SrcID#1_MC#1_Chan#1_DIMM#0 (channel:1 slot:0 page:0x83e2aa9 offset:0x5c0 grain:32 syndrome:0x0 -  err_code:0x0008:0x00c1 ProcessorSocketId:0x1 MemoryControllerId:0x1 PhysicalRankId:0x2 Row:0xbc19 Column:0x238 Bank:0x3 BankGroup:0x1 retry_rd_err_log[0001b20d 00000000 00000002 048ed080 0000bc19] correrrcnt[0000 0000 000e 0000 0000 0000 0000 0000])

This sounds like an ECC error. In each case, the ADDR value is the same, alongside with the plaintext location description (CPU_SrcID#1_MC#1_Chan#1_DIMM#0).

The results of my Google searches on this topic have been clear as mud. There appear to be a couple relevant RedHat support articles on this topic (it's not really a PVE-specific matter) but I don't have access to those. I have two questions:

1. What's a "memory scrubbing error"? Is this somehow different than other types of ECC errors?
2. How do I map the information provided onto a physical DIMM slot? From what I've read, this seems to be very hardware dependent, and hardware vendors don't seem to use uniform nomenclature in referencing DIMM slots. CPUs are Intel Skylake; the box is a HP Z8 G4, and HP's DIMM slot nomenclature is below.

1739123749870.png
I've also installed the edac-utils package, only to get no useful information out of edac-util. Even after setting EDAC_DRIVER=skx_edac in /etc/default/edac and restarting the service, I still don't get anything useful:

Code:
{194 pve-beast ~} # edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc2: 0 Uncorrected Errors with no DIMM info
mc2: 0 Corrected Errors with no DIMM info
mc3: 0 Uncorrected Errors with no DIMM info
mc3: 0 Corrected Errors with no DIMM info
edac-util: No errors to report.

I'd like to figure out what DIMM to replace without trial and error.

Thanks.
 
DIMM "16" aus Bild.
There was actually no DIMM in this slot.

It looks like it was CPU1-DIMM3: I haven't seen the error for almost a week after replacing this module. That said, it still isn't clear how to map specs like CPU_SrcID#1_MC#1_Chan#1_DIMM#0 onto physical DIMM slot identifiers short of starting the system with 2 DIMMs at a time and looking at the output of ras-mc-ctl modified to print default labels even though no label mapping is set up. If any one has a RAM DIMM label file for the HP Z8 G4 that is parsed by ras-mc-ctl I would appreciate it.
 
Last edited:
We have this problem a couple of times per year (over 100 machines). Hardware support always suggests to update the bios and management controller firmware and do the cold reboot of your machine. After that, the memory errors go either aways or the memory module is diagnosed as faulty and you have to replace it.
 
  • Like
Reactions: Johannes S
I did multiple cold boots as well as a firmware update in this case with no change in behavior.