RAS: Soft-offlining pfn: 0x3bb1cc [Hardware Error]? Proxmox 8.0.4

Inlakesh

Well-Known Member
Jan 13, 2019
75
4
48
Proxmox version 8.0.4

Anyone know what can cause this?

I get this in the PVE GUI syslog some times:

Code:
Oct 17 08:13:05 pv2 pvedaemon[2802776]: <root@pam> successful auth for user 'root@pam'
Oct 17 08:13:10 pv2 kernel: RAS: Soft-offlining pfn: 0x3bb1cc
Oct 17 08:13:10 pv2 kernel: mce: [Hardware Error]: Machine check events logged
Oct 17 08:13:10 pv2 kernel: EDAC skx MC0: HANDLING MCE MEMORY ERROR
Oct 17 08:13:10 pv2 kernel: EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 7: 0x9c00004001010090
Oct 17 08:13:10 pv2 kernel: EDAC skx MC0: TSC 0x163d0a0d04c9a92
Oct 17 08:13:10 pv2 kernel: EDAC skx MC0: ADDR 0x3bb1cc480
Oct 17 08:13:10 pv2 kernel: EDAC skx MC0: MISC 0x200005c280001086
Oct 17 08:13:10 pv2 kernel: EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1697505190 SOCKET 0 APIC 0x0
Oct 17 08:13:10 pv2 kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3bb1cc offset:0x480 grain:32 syndrome:0x0 -  err_code:0x0101:0x0090 ProcessorSocketId:0x0 MemoryControllerId:0x0 PhysicalRankId:0x1 Row:0x3766 Column:0x118 Bank:0x0 BankGroup:0x2 retry_rd_err_log[0001a20d 00000000 00020000 0446204e 00003766] correrrcnt[0000 0002 0000 0000 0000 0000 0000 0000])

And this shows in the Server iKVM:
1697508121506.png
 
EDAC is detecting memory errors, which can probably be detected and corrected because your system uses ECC memory. Find out which DIMM is having problems and replace it?
 
  • Like
Reactions: Inlakesh
Hi,

as leesteken said, seems some memory module has (a few) bad bits.
You can run memtest86+ from the bootloader memory and let it run through - although that can take from a few hours to more than a day, depending on the amount of installed memory.

This will tell you exactly where the bad bits are, which you than can mask with either GRUB's BadRAM feature or directly on the kernel command line using the memmap parameter.
 
  • Like
Reactions: Inlakesh
Hi,

as leesteken said, seems some memory module has (a few) bad bits.
You can run memtest86+ from the bootloader memory and let it run through - although that can take from a few hours to more than a day, depending on the amount of installed memory.

This will tell you exactly where the bad bits are, which you than can mask with either GRUB's BadRAM feature or directly on the kernel command line using the memmap parameter.
I wish I could do that, though the server is in production use, so minimal downtime is required. Is there any other way to find out which module is having this error?
 
I wish I could do that, though the server is in production use, so minimal downtime is required. Is there any other way to find out which module is having this error?
Since it's a production system: Please don't assume that I know anything about this. I spent less than a minute searching the web about your error message.
 
Since it's a production system: Please don't assume that I know anything about this. I spent less than a minute searching the web about your error message.
No worries, I found some other sites to go through that touch on this problem also, though I wanted to put up a thread here at proxmomx forum as the server running this system. I wont do anything drastic, need to read up in this before turning the server off.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!