mce: [Hardware Error] - Proxmox Random Crashes When Idle

beeschurger

New Member
Nov 5, 2024
2
0
1
My current issue is that sometimes my PVE Host randomly crashes and needs a physical reboot.

Every 10-15 minutes my journalctl logs shows this "mce: [Hardware Error]":

Dec 10 12:30:30 pve rasdaemon[9885]: rasdaemon: mce_record store: 0x785480021d58
Dec 10 12:30:30 pve kernel: mce: [Hardware Error]: Machine check events logged
Dec 10 12:30:30 pve kernel: [Hardware Error]: Corrected error, no action required.
Dec 10 12:30:30 pve kernel: [Hardware Error]: CPU:0 (19:21:2) MC27_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x982000000002080b
Dec 10 12:30:30 pve kernel: [Hardware Error]: IPID: 0x0001002e00000500, Syndrome: 0x000000005a020005
Dec 10 12:30:30 pve kernel: [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2
Dec 10 12:30:30 pve kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)
Dec 10 12:30:30 pve kernel: rasdaemon[9887]: segfault at 0 ip 000078552b40ef05 sp 000078552affbbb0 error 4 in libsqlite3.so.0.8.6[78552b348000+f4000] likely on CPU 6 (core 8, socket 0)
Dec 10 12:30:30 pve kernel: Code: f7 47 14 00 90 0f 85 42 33 00 00 41 bf 04 00 00 00 66 44 89 7f 14 48 8b 44 24 08 41 bc 08 00 00 00 66 44 89 67 14 48 8b 40 10 <f2> 0f 10 00 f2 0f 11 07 e9 f6 d9 ff ff 48 8b 44 24 08 4c 89 ef 48

The crashes never happened while using it, but I always notice it when I'm away from home. The PVE installation is pretty new and it's also my first time trying proxmox.
I did passthrough a NVIDIA GTX 1060 GPU on a win10 VM in order to do some cloud gaming and also a HDD on an OpenMediaVault VM to use as a SMB/NFS share between my VMs and CTs.

Most of my VMs and CTs are deployed using Proxmox Helper Scripts, if that helps.

ras-mc-ctl --summary shows no error:

root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
root@pve:~# systemctl status ras-mc-ctl.service
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; preset: enabled)
Active: active (exited) since Tue 2024-12-10 12:14:08 CET; 3h 9min ago
Process: 932 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
Main PID: 932 (code=exited, status=0/SUCCESS)
CPU: 23ms

Dec 10 12:14:08 pve systemd[1]: Starting ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware...
Dec 10 12:14:08 pve ras-mc-ctl[932]: ras-mc-ctl: Error: No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
Dec 10 12:14:08 pve systemd[1]: Finished ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware.

edac-util: Error: No memory controller data found.
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.
root@pve:~# ras-mc-ctl --register
ras-mc-ctl: Error: No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
root@pve:~# dmesg | grep -i edac
[ 0.713674] EDAC MC: Ver: 3.0.0
root@pve:~# edac-util
edac-util: Error: No memory controller data found.

The dmesg output shows that same error all over again. I don't know if I did something wrong while configuring proxmox or if my hardware is faulty, so here I am asking for your help.

Here is my current configuration:

CPU: Ryzen 9 5900x
MOBO: MSI B550-A PRO (BIOS 7C56vAI)
RAM: Crucial Pro 4x16GB DDR4 3200 MHz CL22 (had the issue even with 2x16GB) - XMP 2.0 activated in BIOS
GPU: NVIDIA GTX 1060 6GB
PSU: EVGA 750w 80+ bronze
STORAGE:
- 1x500GB Samsung SSD 870 EVO
- 1x1000GB Samsung Memorie SSD 860 EVO
- 1x500GB Western Digital Blue HDD

BIOS CONFIG:

- XMP 2.0
- to be added when I get home

Bios monitoring shows:

CPU CORE: 1.420V
CPU NB/SOC: 1.004V
CPU VDDP: N/A
CPU 1P8: 1.822V
DRAM: 1.184V
CHIPSET CORE: 1.048V
SYSTEM 12V: 12.216V
SYSTEM 5V: 5.020V

------

What I did so far:

- Run Memtest86 for 8+ hours, all tests passed with no errors.
- Plugged the server directly to AC current instead of using the UPS
- installed amd64-microcode
- turned off every CTs and VMs

I'm not 100% sure this is the cause for the random crashes, but logs show that just before crashing, the last message was the infamous "mce: [Hardware Error]"
 
UPDATE:

After disabling XMP 2.0 from BIOS, the spamming of error seems to be gone. Since last reboot I only noticed this error:

Dec 10 19:10:05 pve kernel: microcode: Current revision: 0x0a20120e
Dec 10 19:10:05 pve kernel: mce: [Hardware Error]: Machine check events logged
Dec 10 19:10:05 pve kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: d82000000002080b
Dec 10 19:10:05 pve kernel: fbcon: Taking over console
Dec 10 19:10:05 pve kernel: mce: [Hardware Error]: TSC 0 MISC d01200b500000000 SYND 5a020005 IPID 1002e00000500
Dec 10 19:10:05 pve kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1733854200 SOCKET 0 APIC 0 microcode a20120e

Is it safe to ignore if it doesn't cause any major issues?