[SOLVED] Hardware Error's reboots on it's own

phdelodder · Dec 18, 2019

I'm a new user of proxmox did an installation based on the 6.0 and upgrade to the 6.1. I regularly receiving these error's in the syslog:

Code:

[Wed Dec 18 07:57:11 2019] mce: [Hardware Error]: Machine check events logged
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Corrected error, no action required.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c0135
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Error Addr: 0x00000006b743969c
[Wed Dec 18 07:57:11 2019] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b1a04
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[Wed Dec 18 07:57:11 2019] mce: [Hardware Error]: Machine check events logged
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Corrected error, no action required.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Error Addr: 0x000000081e8d5e5c
[Wed Dec 18 07:57:11 2019] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b3904
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV

I also have reboots of the system as well, I don't know what it triggers but I believe it's related towards the Hardware Error's. It's always on CPU:4/CPU:10 the error's are logged.

My system has the following configuration:

AMD Ryzen 5 3600 processor
Cooler Master MWE Gold 650 Full Modular PSU / PC voeding
G.Skill DDR4 Ripjaws-V 2x8GB 3200Mhz - [F4-3200C16D-16GVKB]
G.Skill DDR4 Ripjaws-V 2x8GB 3200Mhz - [f4-3200c16d-16gvgb]
Noctua NH-L9x65 SE-AM4
Sharkoon Case SKILLER SGC1
MSI MSI B450M PRO-VDH MAX B450
Intel Consumer SSD 660p 512 GB PCI Express 3.0 M.2, SSDPEKNW512G8X1
Radeon HD5450 PCI-E R81KLC DDR3 512MB DVI Video Card AX5450 512MK3-SH.

I have done the following tests:

stress --vm 32 --vm-bytes 1024M -> resulted in restart and error's in the log
stress --vm 16 --vm-bytes 1024M for F4-3200C16D-16GVKB -> resulted in restart and error's in the log
stress --vm 16 --vm-bytes 1024M for F4-3200C16D-16GVGB -> resulted in restart and error's in the log
memtest86 used an bootable USB, no errors found after 4 passes

Note: power consumption on idle was 52 - 55W, when running stress it was max 118W.

I also found "https://forum.proxmox.com/threads/proxmox-freezing-on-amd-ryzen-machines.56806/" which adds a few parameters in grub. My grub options are now:

Code:

quiet rcu_nocbs=0-11 processor.max_cstate=1 iommu=pt amd_iommu=on video=efifb:off

Is this something kernel related or do I need to start an RMA procedure with AMD? Need some help on this one!

Stoiko Ivanov · Dec 18, 2019

First thing I would try to do is to upgrade the firmware/bios on the mainboard - This has become more and more important recently and it does sometimes fix errors like these.

I hope this helps!

phdelodder · Dec 18, 2019

Stoiko Ivanov said:
First thing I would try to do is to upgrade the firmware/bios on the mainboard - This has become more and more important recently and it does sometimes fix errors like these.

I hope this helps!

BIOS is already updated to the latest version I found: 7A38vB4

Was one of the first things I tried, just forgot to mention it. So doesn't solve the problem.

Already tried kernel 5.0 and kernel 5.3

phdelodder · Dec 18, 2019

I do would like to add that the hardware error Messages appear on the clock every 5min and 11seconds. Exactly the same

Stoiko Ivanov · Dec 18, 2019

hmm - the periodic thing seems odd - any correlation with some task running at that time ? (anything happening in the journal before the messages show up)?

Not too much experience with mce - but maybe this blog-post might be helpful to get some more information out of the error messages:
https://www.cnx-software.com/2019/07/17/machine-check-exception-mce-errors-linux/

phdelodder · Dec 18, 2019

Stoiko Ivanov said:
hmm - the periodic thing seems odd - any correlation with some task running at that time ? (anything happening in the journal before the messages show up)?

Not too much experience with mce - but maybe this blog-post might be helpful to get some more information out of the error messages:
https://www.cnx-software.com/2019/07/17/machine-check-exception-mce-errors-linux/

There is nothing happening in the syslog that correlates towards it. I have however stopped and disabled pvesr.timer. Hadn't had any impact in the logging.

rasdeamon errors (last 3):

Code:

220 2019-12-18 12:20:30 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci Error_overflow CECC, mcgcap=0x0000011c, status=0xdc204000000c0135, addr=0x569e9cb1c, misc=0xd01a025400000000, walltime=0x5dfa0b7e, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b
221 2019-12-18 12:25:41 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci Error_overflow CECC, mcgcap=0x0000011c, status=0xdc204000000c0135, addr=0x7354d039c, misc=0xd01a026100000000, walltime=0x5dfa0cb5, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b
222 2019-12-18 12:30:52 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci CECC, mcgcap=0x0000011c, status=0x9c204000000c0135, addr=0x7271dab5c, misc=0xd01a026200000000, walltime=0x5dfa0dec, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b

Snippet from syslog:

Code:

Dec 18 15:30:22 bulldog kernel: [ 5608.491511] mce: [Hardware Error]: Machine check events logged
Dec 18 15:30:22 bulldog kernel: [ 5608.491513] [Hardware Error]: Corrected error, no action required.
Dec 18 15:30:22 bulldog kernel: [ 5608.491518] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c0135
Dec 18 15:30:22 bulldog kernel: [ 5608.491521] [Hardware Error]: Error Addr: 0x00000005d0342adc
Dec 18 15:30:22 bulldog kernel: [ 5608.491522] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b2b01
Dec 18 15:30:22 bulldog kernel: [ 5608.491524] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:30:22 bulldog kernel: [ 5608.491526] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:35:33 bulldog kernel: [ 5919.789739] mce: [Hardware Error]: Machine check events logged
Dec 18 15:35:33 bulldog kernel: [ 5919.789740] [Hardware Error]: Corrected error, no action required.
Dec 18 15:35:33 bulldog kernel: [ 5919.789746] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c204000000c0135
Dec 18 15:35:33 bulldog kernel: [ 5919.789749] [Hardware Error]: Error Addr: 0x0000000717805d5c
Dec 18 15:35:33 bulldog kernel: [ 5919.789750] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b3504
Dec 18 15:35:33 bulldog kernel: [ 5919.789753] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:35:33 bulldog kernel: [ 5919.789755] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:35:33 bulldog kernel: [ 5919.789757] mce: [Hardware Error]: Machine check events logged
Dec 18 15:35:33 bulldog kernel: [ 5919.789758] [Hardware Error]: Corrected error, no action required.
Dec 18 15:35:33 bulldog kernel: [ 5919.789759] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
Dec 18 15:35:33 bulldog kernel: [ 5919.789762] [Hardware Error]: Error Addr: 0x0000000788fa041c
Dec 18 15:35:33 bulldog kernel: [ 5919.789763] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b1000
Dec 18 15:35:33 bulldog kernel: [ 5919.789765] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
Dec 18 15:35:33 bulldog kernel: [ 5919.789767] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV
Dec 18 15:40:44 bulldog kernel: [ 6231.087907] mce: [Hardware Error]: Machine check events logged
Dec 18 15:40:44 bulldog kernel: [ 6231.087910] [Hardware Error]: Corrected error, no action required.
Dec 18 15:40:44 bulldog kernel: [ 6231.087914] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c204000000c0135
Dec 18 15:40:44 bulldog kernel: [ 6231.087918] [Hardware Error]: Error Addr: 0x00000007eb1b1a1c
Dec 18 15:40:44 bulldog kernel: [ 6231.087919] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b2804
Dec 18 15:40:44 bulldog kernel: [ 6231.087921] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:40:44 bulldog kernel: [ 6231.087924] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:40:44 bulldog kernel: [ 6231.087926] mce: [Hardware Error]: Machine check events logged
Dec 18 15:40:44 bulldog kernel: [ 6231.087926] [Hardware Error]: Corrected error, no action required.
Dec 18 15:40:44 bulldog kernel: [ 6231.087928] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
Dec 18 15:40:44 bulldog kernel: [ 6231.087931] [Hardware Error]: Error Addr: 0x00000004f635801c
Dec 18 15:40:44 bulldog kernel: [ 6231.087932] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b0001
Dec 18 15:40:44 bulldog kernel: [ 6231.087934] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
Dec 18 15:40:44 bulldog kernel: [ 6231.087935] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV

phdelodder · Dec 19, 2019

The system is now completely unstable with proxmox, If however I use a debian live usb it doesn't reboot.

When using a ubuntu live usb, the system reboots when the system goes in to a sleep.

Disabled cstate in the bios didn't help.

phdelodder · Dec 19, 2019

I received feedback from the supplier: Errors indicates instruction errors and a defect in the CPU’s cache memory. AMD approved RMA, so hopefully next year I'm back in business

Stoiko Ivanov · Dec 19, 2019

Hm - sad for the downtime - but glad that the error-messages indeed hit the spot and that AMD quickly provides RMA.

Please mark the thread as 'SOLVED'
Thanks!

phdelodder · Jan 9, 2020

New CPU have solved it, uptime of more then a day!

Search

Search

[SOLVED] Hardware Error's reboots on it's own

phdelodder

Member

Stoiko Ivanov

Proxmox Staff Member

phdelodder

Member

phdelodder

Member

Stoiko Ivanov

Proxmox Staff Member

phdelodder

Member

phdelodder

Member

phdelodder

Member

Stoiko Ivanov

Proxmox Staff Member

phdelodder

Member