Hardware Errors at Regular Intervals

PastramiKing

New Member
Apr 29, 2023
1
0
1
I am having a problem with the following configuration of software and hardware: running a VM of TrueNas Scale on Proxmox 7.4, Kernel 6.2.11-1-pve, with ASRock X570D4U-2L2T and AMD Ryzen 5950X (for qm config output, see below).

I receive errors (mce: [Hardware Error]: Machine check events logged) only if I am running the VM, and the errors occur as follows: first error occurs after 31 minutes, 19 seconds; then second error occurs after 36 minutes, 8 seconds; then third error occurs after 31 minutes, 19 seconds; then fourth error occurs after 36 minutes, 8 seconds; and so on until I shutdown the VM and/or the Host (for targeted dmesg output, see below). I have tried to use rasdaemon v0.6.6 (From Debian, deb http://ftp.us.debian.org/debian bullseye) to diagnose the problem; however, not only does it not provide any details, it does not even capture any error (for ras-mc-ctl --summary output, see below).

I am a novice, so if you need more information to assist, do not hesitate to ask.

Best regards.

Code:
root@pve:~# qm config 100
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0
cores: 24
cpu: host
efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:3b:00,pcie=1
hostpci1: 0000:31:00,pcie=1
hostpci2: 0000:32:00,pcie=1
hostpci3: 0000:33:00,pcie=1
hostpci4: 0000:34:00,pcie=1
hostpci5: 0000:37:00,pcie=1
hostpci6: 0000:38:00,pcie=1
hostpci7: 0000:39:00,pcie=1
hostpci8: 0000:3a:00,pcie=1
machine: q35
memory: 96537
meta: creation-qemu=7.2.0,ctime=1683155962
name: Giattino-TrueNas
net0: virtio=6A:6D:DB:D4:5D:70,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-100-disk-1,iothread=1,size=64G,ssd=1
scsi1: local-zfs:vm-100-disk-2,iothread=1,size=468750173K,ssd=1
scsi2: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2BJBWESN,size=14902G
scsi3: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2CJWEYTN,size=14902G
scsi4: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2CJV2SZN,size=14902G
scsi5: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2CKN18JJ,size=14902G
scsi6: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2BJD5YHN,size=14902G
scsi7: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2CKXKNHJ,size=14902G
scsihw: virtio-scsi-single
smbios1: uuid=cccdfe84-f983-430d-9866-c033a10725e1
sockets: 1
vmgenid: 0d709e9e-a0f7-43bf-b404-c8708aab5fdd

Code:
root@pve:~# dmesg -T | grep -i mce
[Wed May 10 07:29:55 2023] MCE: In-kernel MCE decoding enabled.
[Wed May 10 08:06:13 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 08:37:21 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:13:40 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:44:48 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:21:07 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:52:15 2023] mce: [Hardware Error]: Machine check events logged

Code:
root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No devlink errors.

No disk errors.

No MCE errors.

Code:
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.

Code:
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# systemctl status rasdaemon.service
● rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-05-10 07:29:57 EDT; 3h 36min ago
    Process: 2587 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
   Main PID: 2582 (rasdaemon)
      Tasks: 1 (limit: 154393)
     Memory: 15.2M
        CPU: 33ms
     CGroup: /system.slice/rasdaemon.service
             └─2582 /usr/sbin/rasdaemon -f -r


May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:arm_event
May 10 07:29:57 pve rasdaemon[2582]: mce:mce_record event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event mce:mce_record
May 10 07:29:57 pve rasdaemon[2582]: ras:extlog_mem_event event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:extlog_mem_event
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mc_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording aer_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording extlog_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mce_record events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording arm_event events

Code:
root@pve:~# systemctl status ras-mc-ctl
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
     Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2023-05-10 07:29:57 EDT; 3h 37min ago
    Process: 2574 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
   Main PID: 2574 (code=exited, status=0/SUCCESS)
        CPU: 20ms


May 10 07:29:57 pve systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
May 10 07:29:57 pve ras-mc-ctl[2574]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X570D4U-2L2T
May 10 07:29:57 pve systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware.
 
From the mcelog website [1]: "All errors are logged to /var/log/mcelog or syslog or the journal." Can you attach the output of these logs at the timestamps when you get the mce errors?

[1]: http://mcelog.org/
 
From the mcelog website [1]: "All errors are logged to /var/log/mcelog or syslog or the journal." Can you attach the output of these logs at the timestamps when you get the mce errors?

[1]: http://mcelog.org/
from debian 10 onwards the mcelog package was replaced by rasdaemon, so the package for debian 11 does not exist. I have the same problem, now I have to upgrade to proxmox 8 which contains debia 12 the problem is fixed and if the logs are written correctly, because at the moment the kernel reports mce errors but rasdaemon does not report anything abnormal
 
Hello,

Code:
mce: [Hardware Error]: Machine check events logged

I am facing same issue as above.
proxmox version :
Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-3
pve-kernel-5.15: 7.3-1
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.4.203-1-pve: 5.4.203-1
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-1
lxcfs: 5.0.3-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

and
I have the same problem, now I have to upgrade to proxmox 8 which contains debia 12 the problem is fixed and if the logs are written correctly, because at the moment the kernel reports mce errors but rasdaemon does not report anything abnormal
could You please explain more ?
I have the same problem
so it still exist ?
now I have to upgrade to proxmox 8 which contains debia 12
did You update or not ?
the problem is fixed and if the logs are written correctly,
can You confirm it is fixed ?
 
Hi,

I have the same errors in my proxmox journalctl. Sometimes every 2hours, sometimes it is quite for 24h:

Example:
Sep 18 08:47:08 pve2 kernel: mce: [Hardware Error]: Machine check events logged
Sep 18 08:47:08 pve2 kernel: [Hardware Error]: Corrected error, no action required.
Sep 18 08:47:08 pve2 kernel: [Hardware Error]: CPU:15 (17:60:1) MC3_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x9820000000010150
Sep 18 08:47:08 pve2 kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000001a000601
Sep 18 08:47:08 pve2 kernel: [Hardware Error]: Decode Unit Ext. Error Code: 1, Micro-op cache data parity error.
Sep 18 08:47:08 pve2 kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD

Sep 18 10:46:28 pve2 kernel: mce: [Hardware Error]: Machine check events logged
Sep 18 10:46:28 pve2 kernel: [Hardware Error]: Corrected error, no action required.
Sep 18 10:46:28 pve2 kernel: [Hardware Error]: CPU:14 (17:60:1) MC3_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x9820000000010150
Sep 18 10:46:28 pve2 kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000001a000405
Sep 18 10:46:28 pve2 kernel: [Hardware Error]: Decode Unit Ext. Error Code: 1, Micro-op cache data parity error.
Sep 18 10:46:28 pve2 kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD

pveversion:
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-10-pve)

I have three VMs running. Home Assistant, Windows and Ubuntu Server. The last one crashes from time to time. Didn't check its logs so far ....

Is it a hardware problem? VM problem? Any advice/hint is appreciated.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!