My current issue is that sometimes my PVE Host randomly crashes and needs a physical reboot.
Every 10-15 minutes my
The crashes never happened while using it, but I always notice it when I'm away from home. The PVE installation is pretty new and it's also my first time trying proxmox.
I did passthrough a NVIDIA GTX 1060 GPU on a win10 VM in order to do some cloud gaming and also a HDD on an OpenMediaVault VM to use as a SMB/NFS share between my VMs and CTs.
Most of my VMs and CTs are deployed using Proxmox Helper Scripts, if that helps.
The
Here is my current configuration:
CPU: Ryzen 9 5900x
MOBO: MSI B550-A PRO (BIOS 7C56vAI)
RAM: Crucial Pro 4x16GB DDR4 3200 MHz CL22 (had the issue even with 2x16GB) - XMP 2.0 activated in BIOS
GPU: NVIDIA GTX 1060 6GB
PSU: EVGA 750w 80+ bronze
STORAGE:
- 1x500GB Samsung SSD 870 EVO
- 1x1000GB Samsung Memorie SSD 860 EVO
- 1x500GB Western Digital Blue HDD
BIOS CONFIG:
- XMP 2.0
- to be added when I get home
Bios monitoring shows:
CPU CORE: 1.420V
CPU NB/SOC: 1.004V
CPU VDDP: N/A
CPU 1P8: 1.822V
DRAM: 1.184V
CHIPSET CORE: 1.048V
SYSTEM 12V: 12.216V
SYSTEM 5V: 5.020V
------
What I did so far:
- Run Memtest86 for 8+ hours, all tests passed with no errors.
- Plugged the server directly to AC current instead of using the UPS
- installed
- turned off every CTs and VMs
I'm not 100% sure this is the cause for the random crashes, but logs show that just before crashing, the last message was the infamous "mce: [Hardware Error]"
Every 10-15 minutes my
journalctl
logs shows this "mce: [Hardware Error]":Dec 10 12:30:30 pve rasdaemon[9885]: rasdaemon: mce_record store: 0x785480021d58
Dec 10 12:30:30 pve kernel: mce: [Hardware Error]: Machine check events logged
Dec 10 12:30:30 pve kernel: [Hardware Error]: Corrected error, no action required.
Dec 10 12:30:30 pve kernel: [Hardware Error]: CPU:0 (19:21:2) MC27_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x982000000002080b
Dec 10 12:30:30 pve kernel: [Hardware Error]: IPID: 0x0001002e00000500, Syndrome: 0x000000005a020005
Dec 10 12:30:30 pve kernel: [Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2
Dec 10 12:30:30 pve kernel: [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)
Dec 10 12:30:30 pve kernel: rasdaemon[9887]: segfault at 0 ip 000078552b40ef05 sp 000078552affbbb0 error 4 in libsqlite3.so.0.8.6[78552b348000+f4000] likely on CPU 6 (core 8, socket 0)
Dec 10 12:30:30 pve kernel: Code: f7 47 14 00 90 0f 85 42 33 00 00 41 bf 04 00 00 00 66 44 89 7f 14 48 8b 44 24 08 41 bc 08 00 00 00 66 44 89 67 14 48 8b 40 10 <f2> 0f 10 00 f2 0f 11 07 e9 f6 d9 ff ff 48 8b 44 24 08 4c 89 ef 48
The crashes never happened while using it, but I always notice it when I'm away from home. The PVE installation is pretty new and it's also my first time trying proxmox.
I did passthrough a NVIDIA GTX 1060 GPU on a win10 VM in order to do some cloud gaming and also a HDD on an OpenMediaVault VM to use as a SMB/NFS share between my VMs and CTs.
Most of my VMs and CTs are deployed using Proxmox Helper Scripts, if that helps.
ras-mc-ctl --summary
shows no error:root@pve:~# ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.
No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
root@pve:~# systemctl status ras-mc-ctl.service
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; preset: enabled)
Active: active (exited) since Tue 2024-12-10 12:14:08 CET; 3h 9min ago
Process: 932 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
Main PID: 932 (code=exited, status=0/SUCCESS)
CPU: 23ms
Dec 10 12:14:08 pve systemd[1]: Starting ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware...
Dec 10 12:14:08 pve ras-mc-ctl[932]: ras-mc-ctl: Error: No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
Dec 10 12:14:08 pve systemd[1]: Finished ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware.
edac-util: Error: No memory controller data found.
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.
root@pve:~# ras-mc-ctl --register
ras-mc-ctl: Error: No dimm labels for Micro-Star International Co., Ltd. model B550-A PRO (MS-7C56)
root@pve:~# dmesg | grep -i edac
[ 0.713674] EDAC MC: Ver: 3.0.0
root@pve:~# edac-util
edac-util: Error: No memory controller data found.
The
dmesg
output shows that same error all over again. I don't know if I did something wrong while configuring proxmox or if my hardware is faulty, so here I am asking for your help. Here is my current configuration:
CPU: Ryzen 9 5900x
MOBO: MSI B550-A PRO (BIOS 7C56vAI)
RAM: Crucial Pro 4x16GB DDR4 3200 MHz CL22 (had the issue even with 2x16GB) - XMP 2.0 activated in BIOS
GPU: NVIDIA GTX 1060 6GB
PSU: EVGA 750w 80+ bronze
STORAGE:
- 1x500GB Samsung SSD 870 EVO
- 1x1000GB Samsung Memorie SSD 860 EVO
- 1x500GB Western Digital Blue HDD
BIOS CONFIG:
- XMP 2.0
- to be added when I get home
Bios monitoring shows:
CPU CORE: 1.420V
CPU NB/SOC: 1.004V
CPU VDDP: N/A
CPU 1P8: 1.822V
DRAM: 1.184V
CHIPSET CORE: 1.048V
SYSTEM 12V: 12.216V
SYSTEM 5V: 5.020V
------
What I did so far:
- Run Memtest86 for 8+ hours, all tests passed with no errors.
- Plugged the server directly to AC current instead of using the UPS
- installed
amd64-microcode
- turned off every CTs and VMs
I'm not 100% sure this is the cause for the random crashes, but logs show that just before crashing, the last message was the infamous "mce: [Hardware Error]"