I am having a problem with the following configuration of software and hardware: running a VM of TrueNas Scale on Proxmox 7.4, Kernel 6.2.11-1-pve, with ASRock X570D4U-2L2T and AMD Ryzen 5950X (for qm config output, see below).
I receive errors (mce: [Hardware Error]: Machine check events logged) only if I am running the VM, and the errors occur as follows: first error occurs after 31 minutes, 19 seconds; then second error occurs after 36 minutes, 8 seconds; then third error occurs after 31 minutes, 19 seconds; then fourth error occurs after 36 minutes, 8 seconds; and so on until I shutdown the VM and/or the Host (for targeted dmesg output, see below). I have tried to use rasdaemon v0.6.6 (From Debian, deb http://ftp.us.debian.org/debian bullseye) to diagnose the problem; however, not only does it not provide any details, it does not even capture any error (for ras-mc-ctl --summary output, see below).
I am a novice, so if you need more information to assist, do not hesitate to ask.
Best regards.
I receive errors (mce: [Hardware Error]: Machine check events logged) only if I am running the VM, and the errors occur as follows: first error occurs after 31 minutes, 19 seconds; then second error occurs after 36 minutes, 8 seconds; then third error occurs after 31 minutes, 19 seconds; then fourth error occurs after 36 minutes, 8 seconds; and so on until I shutdown the VM and/or the Host (for targeted dmesg output, see below). I have tried to use rasdaemon v0.6.6 (From Debian, deb http://ftp.us.debian.org/debian bullseye) to diagnose the problem; however, not only does it not provide any details, it does not even capture any error (for ras-mc-ctl --summary output, see below).
I am a novice, so if you need more information to assist, do not hesitate to ask.
Best regards.
Code:
root@pve:~# qm config 100
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0
cores: 24
cpu: host
efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:3b:00,pcie=1
hostpci1: 0000:31:00,pcie=1
hostpci2: 0000:32:00,pcie=1
hostpci3: 0000:33:00,pcie=1
hostpci4: 0000:34:00,pcie=1
hostpci5: 0000:37:00,pcie=1
hostpci6: 0000:38:00,pcie=1
hostpci7: 0000:39:00,pcie=1
hostpci8: 0000:3a:00,pcie=1
machine: q35
memory: 96537
meta: creation-qemu=7.2.0,ctime=1683155962
name: Giattino-TrueNas
net0: virtio=6A:6D:DB:D4:5D:70,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-100-disk-1,iothread=1,size=64G,ssd=1
scsi1: local-zfs:vm-100-disk-2,iothread=1,size=468750173K,ssd=1
scsi2: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2BJBWESN,size=14902G
scsi3: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2CJWEYTN,size=14902G
scsi4: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2CJV2SZN,size=14902G
scsi5: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2CKN18JJ,size=14902G
scsi6: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2BJD5YHN,size=14902G
scsi7: /dev/disk/by-id/ata-WDC_WUH721816ALE6L1_2CKXKNHJ,size=14902G
scsihw: virtio-scsi-single
smbios1: uuid=cccdfe84-f983-430d-9866-c033a10725e1
sockets: 1
vmgenid: 0d709e9e-a0f7-43bf-b404-c8708aab5fdd
Code:
root@pve:~# dmesg -T | grep -i mce
[Wed May 10 07:29:55 2023] MCE: In-kernel MCE decoding enabled.
[Wed May 10 08:06:13 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 08:37:21 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:13:40 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:44:48 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:21:07 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:52:15 2023] mce: [Hardware Error]: Machine check events logged
Code:
root@pve:~# ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No devlink errors.
No disk errors.
No MCE errors.
Code:
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
Code:
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# systemctl status rasdaemon.service
● rasdaemon.service - RAS daemon to log the RAS events
Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-05-10 07:29:57 EDT; 3h 36min ago
Process: 2587 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
Main PID: 2582 (rasdaemon)
Tasks: 1 (limit: 154393)
Memory: 15.2M
CPU: 33ms
CGroup: /system.slice/rasdaemon.service
└─2582 /usr/sbin/rasdaemon -f -r
May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:arm_event
May 10 07:29:57 pve rasdaemon[2582]: mce:mce_record event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event mce:mce_record
May 10 07:29:57 pve rasdaemon[2582]: ras:extlog_mem_event event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:extlog_mem_event
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mc_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording aer_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording extlog_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mce_record events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording arm_event events
Code:
root@pve:~# systemctl status ras-mc-ctl
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2023-05-10 07:29:57 EDT; 3h 37min ago
Process: 2574 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
Main PID: 2574 (code=exited, status=0/SUCCESS)
CPU: 20ms
May 10 07:29:57 pve systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
May 10 07:29:57 pve ras-mc-ctl[2574]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X570D4U-2L2T
May 10 07:29:57 pve systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware.