[SOLVED] Hardware Error's reboots on it's own

phdelodder

Member
Dec 18, 2019
9
1
8
40
I'm a new user of proxmox did an installation based on the 6.0 and upgrade to the 6.1. I regularly receiving these error's in the syslog:

Code:
[Wed Dec 18 07:57:11 2019] mce: [Hardware Error]: Machine check events logged
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Corrected error, no action required.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c0135
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Error Addr: 0x00000006b743969c
[Wed Dec 18 07:57:11 2019] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b1a04
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[Wed Dec 18 07:57:11 2019] mce: [Hardware Error]: Machine check events logged
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Corrected error, no action required.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Error Addr: 0x000000081e8d5e5c
[Wed Dec 18 07:57:11 2019] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b3904
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV

I also have reboots of the system as well, I don't know what it triggers but I believe it's related towards the Hardware Error's. It's always on CPU:4/CPU:10 the error's are logged.

My system has the following configuration:
  • AMD Ryzen 5 3600 processor
  • Cooler Master MWE Gold 650 Full Modular PSU / PC voeding
  • G.Skill DDR4 Ripjaws-V 2x8GB 3200Mhz - [F4-3200C16D-16GVKB]
  • G.Skill DDR4 Ripjaws-V 2x8GB 3200Mhz - [f4-3200c16d-16gvgb]
  • Noctua NH-L9x65 SE-AM4
  • Sharkoon Case SKILLER SGC1
  • MSI MSI B450M PRO-VDH MAX B450
  • Intel Consumer SSD 660p 512 GB PCI Express 3.0 M.2, SSDPEKNW512G8X1
  • Radeon HD5450 PCI-E R81KLC DDR3 512MB DVI Video Card AX5450 512MK3-SH.
I have done the following tests:
  • stress --vm 32 --vm-bytes 1024M -> resulted in restart and error's in the log
  • stress --vm 16 --vm-bytes 1024M for F4-3200C16D-16GVKB -> resulted in restart and error's in the log
  • stress --vm 16 --vm-bytes 1024M for F4-3200C16D-16GVGB -> resulted in restart and error's in the log
  • memtest86 used an bootable USB, no errors found after 4 passes
Note: power consumption on idle was 52 - 55W, when running stress it was max 118W.

I also found "https://forum.proxmox.com/threads/proxmox-freezing-on-amd-ryzen-machines.56806/" which adds a few parameters in grub. My grub options are now:

Code:
quiet rcu_nocbs=0-11 processor.max_cstate=1 iommu=pt amd_iommu=on video=efifb:off

Is this something kernel related or do I need to start an RMA procedure with AMD? Need some help on this one!
 
Last edited:
  • Like
Reactions: coolspot
First thing I would try to do is to upgrade the firmware/bios on the mainboard - This has become more and more important recently and it does sometimes fix errors like these.

I hope this helps!
 
First thing I would try to do is to upgrade the firmware/bios on the mainboard - This has become more and more important recently and it does sometimes fix errors like these.

I hope this helps!
BIOS is already updated to the latest version I found: 7A38vB4

Was one of the first things I tried, just forgot to mention it. So doesn't solve the problem.

Already tried kernel 5.0 and kernel 5.3
 
I do would like to add that the hardware error Messages appear on the clock every 5min and 11seconds. Exactly the same
 
hmm - the periodic thing seems odd - any correlation with some task running at that time ? (anything happening in the journal before the messages show up)?

Not too much experience with mce - but maybe this blog-post might be helpful to get some more information out of the error messages:
https://www.cnx-software.com/2019/07/17/machine-check-exception-mce-errors-linux/
There is nothing happening in the syslog that correlates towards it. I have however stopped and disabled pvesr.timer. Hadn't had any impact in the logging.

rasdeamon errors (last 3):
Code:
220 2019-12-18 12:20:30 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci Error_overflow CECC, mcgcap=0x0000011c, status=0xdc204000000c0135, addr=0x569e9cb1c, misc=0xd01a025400000000, walltime=0x5dfa0b7e, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b
221 2019-12-18 12:25:41 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci Error_overflow CECC, mcgcap=0x0000011c, status=0xdc204000000c0135, addr=0x7354d039c, misc=0xd01a026100000000, walltime=0x5dfa0cb5, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b
222 2019-12-18 12:30:52 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci CECC, mcgcap=0x0000011c, status=0x9c204000000c0135, addr=0x7271dab5c, misc=0xd01a026200000000, walltime=0x5dfa0dec, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b

Snippet from syslog:

Code:
Dec 18 15:30:22 bulldog kernel: [ 5608.491511] mce: [Hardware Error]: Machine check events logged
Dec 18 15:30:22 bulldog kernel: [ 5608.491513] [Hardware Error]: Corrected error, no action required.
Dec 18 15:30:22 bulldog kernel: [ 5608.491518] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c0135
Dec 18 15:30:22 bulldog kernel: [ 5608.491521] [Hardware Error]: Error Addr: 0x00000005d0342adc
Dec 18 15:30:22 bulldog kernel: [ 5608.491522] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b2b01
Dec 18 15:30:22 bulldog kernel: [ 5608.491524] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:30:22 bulldog kernel: [ 5608.491526] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:35:33 bulldog kernel: [ 5919.789739] mce: [Hardware Error]: Machine check events logged
Dec 18 15:35:33 bulldog kernel: [ 5919.789740] [Hardware Error]: Corrected error, no action required.
Dec 18 15:35:33 bulldog kernel: [ 5919.789746] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c204000000c0135
Dec 18 15:35:33 bulldog kernel: [ 5919.789749] [Hardware Error]: Error Addr: 0x0000000717805d5c
Dec 18 15:35:33 bulldog kernel: [ 5919.789750] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b3504
Dec 18 15:35:33 bulldog kernel: [ 5919.789753] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:35:33 bulldog kernel: [ 5919.789755] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:35:33 bulldog kernel: [ 5919.789757] mce: [Hardware Error]: Machine check events logged
Dec 18 15:35:33 bulldog kernel: [ 5919.789758] [Hardware Error]: Corrected error, no action required.
Dec 18 15:35:33 bulldog kernel: [ 5919.789759] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
Dec 18 15:35:33 bulldog kernel: [ 5919.789762] [Hardware Error]: Error Addr: 0x0000000788fa041c
Dec 18 15:35:33 bulldog kernel: [ 5919.789763] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b1000
Dec 18 15:35:33 bulldog kernel: [ 5919.789765] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
Dec 18 15:35:33 bulldog kernel: [ 5919.789767] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV
Dec 18 15:40:44 bulldog kernel: [ 6231.087907] mce: [Hardware Error]: Machine check events logged
Dec 18 15:40:44 bulldog kernel: [ 6231.087910] [Hardware Error]: Corrected error, no action required.
Dec 18 15:40:44 bulldog kernel: [ 6231.087914] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c204000000c0135
Dec 18 15:40:44 bulldog kernel: [ 6231.087918] [Hardware Error]: Error Addr: 0x00000007eb1b1a1c
Dec 18 15:40:44 bulldog kernel: [ 6231.087919] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b2804
Dec 18 15:40:44 bulldog kernel: [ 6231.087921] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:40:44 bulldog kernel: [ 6231.087924] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:40:44 bulldog kernel: [ 6231.087926] mce: [Hardware Error]: Machine check events logged
Dec 18 15:40:44 bulldog kernel: [ 6231.087926] [Hardware Error]: Corrected error, no action required.
Dec 18 15:40:44 bulldog kernel: [ 6231.087928] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
Dec 18 15:40:44 bulldog kernel: [ 6231.087931] [Hardware Error]: Error Addr: 0x00000004f635801c
Dec 18 15:40:44 bulldog kernel: [ 6231.087932] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b0001
Dec 18 15:40:44 bulldog kernel: [ 6231.087934] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
Dec 18 15:40:44 bulldog kernel: [ 6231.087935] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV
 
The system is now completely unstable with proxmox, If however I use a debian live usb it doesn't reboot.

When using a ubuntu live usb, the system reboots when the system goes in to a sleep.

Disabled cstate in the bios didn't help.
 
I received feedback from the supplier: Errors indicates instruction errors and a defect in the CPU’s cache memory. AMD approved RMA, so hopefully next year I'm back in business
 
Hm - sad for the downtime - but glad that the error-messages indeed hit the spot and that AMD quickly provides RMA.

Please mark the thread as 'SOLVED'
Thanks!