mce: [Hardware Error] - Proxmox Random Crashes When Idle

beeschurger

New Member
Nov 5, 2024
2
0
1
My current issue is that sometimes my PVE Host randomly crashes and needs a physical reboot.

Every 10-15 minutes my journalctl logs shows this "mce: [Hardware Error]":

Dec 10 12:30:30 pve rasdaemon[9885]: rasdaemon: mce_record store: 0x785480021d58
Dec 10 12:30:30 pve kernel: mce: [Hardware Error]: Machine check events logged
Dec 10 12:30:30 pve kernel: [Hardware Error]: Corrected error, no action required.
Dec 10 12:30:30 pve kernel: [Hardware Error]: CPU:0 (19:21:2) MC27_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x982000000002080b
Dec 10 12:30:30 pve kernel: [Hardware Error]: IPID: 0x0001002e00000500, Syndrome: 0x000000005a020005
Dec 10 12:30:30 pve kernel: [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2
Dec 10 12:30:30 pve kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)
Dec 10 12:30:30 pve kernel: rasdaemon[9887]: segfault at 0 ip 000078552b40ef05 sp 000078552affbbb0 error 4 in libsqlite3.so.0.8.6[78552b348000+f4000] likely on CPU 6 (core 8, socket 0)
Dec 10 12:30:30 pve kernel: Code: f7 47 14 00 90 0f 85 42 33 00 00 41 bf 04 00 00 00 66 44 89 7f 14 48 8b 44 24 08 41 bc 08 00 00 00 66 44 89 67 14 48 8b 40 10 <f2> 0f 10 00 f2 0f 11 07 e9 f6 d9 ff ff 48 8b 44 24 08 4c 89 ef 48

The crashes never happened while using it, but I always notice it when I'm away from home. The PVE installation is pretty new and it's also my first time trying proxmox.
I did passthrough a NVIDIA GTX 1060 GPU on a win10 VM in order to do some cloud gaming and also a HDD on an OpenMediaVault VM to use as a SMB/NFS share between my VMs and CTs.

Most of my VMs and CTs are deployed using Proxmox Helper Scripts, if that helps.

ras-mc-ctl --summary shows no error:

root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
root@pve:~# systemctl status ras-mc-ctl.service
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; preset: enabled)
Active: active (exited) since Tue 2024-12-10 12:14:08 CET; 3h 9min ago
Process: 932 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
Main PID: 932 (code=exited, status=0/SUCCESS)
CPU: 23ms

Dec 10 12:14:08 pve systemd[1]: Starting ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware...
Dec 10 12:14:08 pve ras-mc-ctl[932]: ras-mc-ctl: Error: No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
Dec 10 12:14:08 pve systemd[1]: Finished ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware.

edac-util: Error: No memory controller data found.
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.
root@pve:~# ras-mc-ctl --register
ras-mc-ctl: Error: No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
root@pve:~# dmesg | grep -i edac
[ 0.713674] EDAC MC: Ver: 3.0.0
root@pve:~# edac-util
edac-util: Error: No memory controller data found.

The dmesg output shows that same error all over again. I don't know if I did something wrong while configuring proxmox or if my hardware is faulty, so here I am asking for your help.

Here is my current configuration:

CPU: Ryzen 9 5900x
MOBO: MSI B550-A PRO (BIOS 7C56vAI)
RAM: Crucial Pro 4x16GB DDR4 3200 MHz CL22 (had the issue even with 2x16GB) - XMP 2.0 activated in BIOS
GPU: NVIDIA GTX 1060 6GB
PSU: EVGA 750w 80+ bronze
STORAGE:
- 1x500GB Samsung SSD 870 EVO
- 1x1000GB Samsung Memorie SSD 860 EVO
- 1x500GB Western Digital Blue HDD

BIOS CONFIG:

- XMP 2.0
- to be added when I get home

Bios monitoring shows:

CPU CORE: 1.420V
CPU NB/SOC: 1.004V
CPU VDDP: N/A
CPU 1P8: 1.822V
DRAM: 1.184V
CHIPSET CORE: 1.048V
SYSTEM 12V: 12.216V
SYSTEM 5V: 5.020V

------

What I did so far:

- Run Memtest86 for 8+ hours, all tests passed with no errors.
- Plugged the server directly to AC current instead of using the UPS
- installed amd64-microcode
- turned off every CTs and VMs

I'm not 100% sure this is the cause for the random crashes, but logs show that just before crashing, the last message was the infamous "mce: [Hardware Error]"
 
UPDATE:

After disabling XMP 2.0 from BIOS, the spamming of error seems to be gone. Since last reboot I only noticed this error:

Dec 10 19:10:05 pve kernel: microcode: Current revision: 0x0a20120e
Dec 10 19:10:05 pve kernel: mce: [Hardware Error]: Machine check events logged
Dec 10 19:10:05 pve kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: d82000000002080b
Dec 10 19:10:05 pve kernel: fbcon: Taking over console
Dec 10 19:10:05 pve kernel: mce: [Hardware Error]: TSC 0 MISC d01200b500000000 SYND 5a020005 IPID 1002e00000500
Dec 10 19:10:05 pve kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1733854200 SOCKET 0 APIC 0 microcode a20120e

Is it safe to ignore if it doesn't cause any major issues?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!