CPU errors AMD EPYC 7502P

achirkov

Active Member
Nov 26, 2020
30
1
28
31
Hello, I have a server (on hetzner) with AMD EPYC 7502P CPU.
Matherboard:
Code:
    Product Name: KRPA-U16 Series
    Version: Rev 1.xx
Code:
     # pveversion

    pve-manager/8.2.7/3e0176e6bb2ade3b (running kernel: 6.8.12-4-pve)

I am currently running 3 lxc containers (Ubuntu 20.04) and I am starting to get errors in dmesg:
Code:
[153996.660148] [Hardware Error]: CPU:0 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
[153996.660683] [Hardware Error]: Error Addr: 0x00000007b9493840
[153996.661203] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x0b000b000a801202
[153996.661730] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[153996.661835] EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x1f3524e offset:0x40 grain:64 syndrome:0xb00)
[153996.662875] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[154324.297911] mce: [Hardware Error]: Machine check events logged
[154324.298839] [Hardware Error]: Corrected error, no action required.
Installed all latest versions of packages in apt, amd64-microcode is installed. Maybe I missed something? Or the server has a hardware problem?
 
HI,

Or the server has a hardware problem?
Most likely, at least the messages seem to indicate some faulty RAM.

I'd suggest running memtest86+ for a bit and testing it through, that is never a bad idea. You can either run it directly from the boot menu or using the ISO - you'll need to disable Secure Boot though for it, if you have than enabled on your server.
 
It looks HW related. Have you checked for latest BIOS version?
If RAM is in fact good, you are probably looking at MB/CPU.
How long have you been running this server?
 
The server was rented a week ago from Hetzner. I thought the problem was in the system packages, but since the problem is hardware, I'll write to their technical support. Thanks for your help.