[SOLVED] Hardware Error's reboots on it's own

phdelodder

Member
Dec 18, 2019
9
1
8
38
I'm a new user of proxmox did an installation based on the 6.0 and upgrade to the 6.1. I regularly receiving these error's in the syslog:

Code:
[Wed Dec 18 07:57:11 2019] mce: [Hardware Error]: Machine check events logged
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Corrected error, no action required.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c0135
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Error Addr: 0x00000006b743969c
[Wed Dec 18 07:57:11 2019] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b1a04
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[Wed Dec 18 07:57:11 2019] mce: [Hardware Error]: Machine check events logged
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Corrected error, no action required.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Error Addr: 0x000000081e8d5e5c
[Wed Dec 18 07:57:11 2019] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b3904
[Wed Dec 18 07:57:11 2019] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
[Wed Dec 18 07:57:11 2019] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV

I also have reboots of the system as well, I don't know what it triggers but I believe it's related towards the Hardware Error's. It's always on CPU:4/CPU:10 the error's are logged.

My system has the following configuration:
  • AMD Ryzen 5 3600 processor
  • Cooler Master MWE Gold 650 Full Modular PSU / PC voeding
  • G.Skill DDR4 Ripjaws-V 2x8GB 3200Mhz - [F4-3200C16D-16GVKB]
  • G.Skill DDR4 Ripjaws-V 2x8GB 3200Mhz - [f4-3200c16d-16gvgb]
  • Noctua NH-L9x65 SE-AM4
  • Sharkoon Case SKILLER SGC1
  • MSI MSI B450M PRO-VDH MAX B450
  • Intel Consumer SSD 660p 512 GB PCI Express 3.0 M.2, SSDPEKNW512G8X1
  • Radeon HD5450 PCI-E R81KLC DDR3 512MB DVI Video Card AX5450 512MK3-SH.
I have done the following tests:
  • stress --vm 32 --vm-bytes 1024M -> resulted in restart and error's in the log
  • stress --vm 16 --vm-bytes 1024M for F4-3200C16D-16GVKB -> resulted in restart and error's in the log
  • stress --vm 16 --vm-bytes 1024M for F4-3200C16D-16GVGB -> resulted in restart and error's in the log
  • memtest86 used an bootable USB, no errors found after 4 passes
Note: power consumption on idle was 52 - 55W, when running stress it was max 118W.

I also found "https://forum.proxmox.com/threads/proxmox-freezing-on-amd-ryzen-machines.56806/" which adds a few parameters in grub. My grub options are now:

Code:
quiet rcu_nocbs=0-11 processor.max_cstate=1 iommu=pt amd_iommu=on video=efifb:off

Is this something kernel related or do I need to start an RMA procedure with AMD? Need some help on this one!
 
Last edited:
  • Like
Reactions: coolspot

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,760
1,281
169
First thing I would try to do is to upgrade the firmware/bios on the mainboard - This has become more and more important recently and it does sometimes fix errors like these.

I hope this helps!
 

phdelodder

Member
Dec 18, 2019
9
1
8
38
First thing I would try to do is to upgrade the firmware/bios on the mainboard - This has become more and more important recently and it does sometimes fix errors like these.

I hope this helps!
BIOS is already updated to the latest version I found: 7A38vB4

Was one of the first things I tried, just forgot to mention it. So doesn't solve the problem.

Already tried kernel 5.0 and kernel 5.3
 

phdelodder

Member
Dec 18, 2019
9
1
8
38
I do would like to add that the hardware error Messages appear on the clock every 5min and 11seconds. Exactly the same
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,760
1,281
169

phdelodder

Member
Dec 18, 2019
9
1
8
38
hmm - the periodic thing seems odd - any correlation with some task running at that time ? (anything happening in the journal before the messages show up)?

Not too much experience with mce - but maybe this blog-post might be helpful to get some more information out of the error messages:
https://www.cnx-software.com/2019/07/17/machine-check-exception-mce-errors-linux/
There is nothing happening in the syslog that correlates towards it. I have however stopped and disabled pvesr.timer. Hadn't had any impact in the logging.

rasdeamon errors (last 3):
Code:
220 2019-12-18 12:20:30 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci Error_overflow CECC, mcgcap=0x0000011c, status=0xdc204000000c0135, addr=0x569e9cb1c, misc=0xd01a025400000000, walltime=0x5dfa0b7e, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b
221 2019-12-18 12:25:41 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci Error_overflow CECC, mcgcap=0x0000011c, status=0xdc204000000c0135, addr=0x7354d039c, misc=0xd01a026100000000, walltime=0x5dfa0cb5, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b
222 2019-12-18 12:30:52 +0100 error: Corrected error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci CECC, mcgcap=0x0000011c, status=0x9c204000000c0135, addr=0x7271dab5c, misc=0xd01a026200000000, walltime=0x5dfa0dec, cpu=0x0000000a, cpuid=0x00870f10, apicid=0x0000000b

Snippet from syslog:

Code:
Dec 18 15:30:22 bulldog kernel: [ 5608.491511] mce: [Hardware Error]: Machine check events logged
Dec 18 15:30:22 bulldog kernel: [ 5608.491513] [Hardware Error]: Corrected error, no action required.
Dec 18 15:30:22 bulldog kernel: [ 5608.491518] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c0135
Dec 18 15:30:22 bulldog kernel: [ 5608.491521] [Hardware Error]: Error Addr: 0x00000005d0342adc
Dec 18 15:30:22 bulldog kernel: [ 5608.491522] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b2b01
Dec 18 15:30:22 bulldog kernel: [ 5608.491524] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:30:22 bulldog kernel: [ 5608.491526] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:35:33 bulldog kernel: [ 5919.789739] mce: [Hardware Error]: Machine check events logged
Dec 18 15:35:33 bulldog kernel: [ 5919.789740] [Hardware Error]: Corrected error, no action required.
Dec 18 15:35:33 bulldog kernel: [ 5919.789746] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c204000000c0135
Dec 18 15:35:33 bulldog kernel: [ 5919.789749] [Hardware Error]: Error Addr: 0x0000000717805d5c
Dec 18 15:35:33 bulldog kernel: [ 5919.789750] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b3504
Dec 18 15:35:33 bulldog kernel: [ 5919.789753] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:35:33 bulldog kernel: [ 5919.789755] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:35:33 bulldog kernel: [ 5919.789757] mce: [Hardware Error]: Machine check events logged
Dec 18 15:35:33 bulldog kernel: [ 5919.789758] [Hardware Error]: Corrected error, no action required.
Dec 18 15:35:33 bulldog kernel: [ 5919.789759] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
Dec 18 15:35:33 bulldog kernel: [ 5919.789762] [Hardware Error]: Error Addr: 0x0000000788fa041c
Dec 18 15:35:33 bulldog kernel: [ 5919.789763] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b1000
Dec 18 15:35:33 bulldog kernel: [ 5919.789765] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
Dec 18 15:35:33 bulldog kernel: [ 5919.789767] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV
Dec 18 15:40:44 bulldog kernel: [ 6231.087907] mce: [Hardware Error]: Machine check events logged
Dec 18 15:40:44 bulldog kernel: [ 6231.087910] [Hardware Error]: Corrected error, no action required.
Dec 18 15:40:44 bulldog kernel: [ 6231.087914] [Hardware Error]: CPU:10 (17:71:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c204000000c0135
Dec 18 15:40:44 bulldog kernel: [ 6231.087918] [Hardware Error]: Error Addr: 0x00000007eb1b1a1c
Dec 18 15:40:44 bulldog kernel: [ 6231.087919] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b2804
Dec 18 15:40:44 bulldog kernel: [ 6231.087921] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Dec 18 15:40:44 bulldog kernel: [ 6231.087924] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
Dec 18 15:40:44 bulldog kernel: [ 6231.087926] mce: [Hardware Error]: Machine check events logged
Dec 18 15:40:44 bulldog kernel: [ 6231.087926] [Hardware Error]: Corrected error, no action required.
Dec 18 15:40:44 bulldog kernel: [ 6231.087928] [Hardware Error]: CPU:4 (17:71:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000d0175
Dec 18 15:40:44 bulldog kernel: [ 6231.087931] [Hardware Error]: Error Addr: 0x00000004f635801c
Dec 18 15:40:44 bulldog kernel: [ 6231.087932] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000201a1b0001
Dec 18 15:40:44 bulldog kernel: [ 6231.087934] [Hardware Error]: Load Store Unit Ext. Error Code: 13, DC Data error type 2.
Dec 18 15:40:44 bulldog kernel: [ 6231.087935] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: EV
 

phdelodder

Member
Dec 18, 2019
9
1
8
38
The system is now completely unstable with proxmox, If however I use a debian live usb it doesn't reboot.

When using a ubuntu live usb, the system reboots when the system goes in to a sleep.

Disabled cstate in the bios didn't help.
 

phdelodder

Member
Dec 18, 2019
9
1
8
38
I received feedback from the supplier: Errors indicates instruction errors and a defect in the CPU’s cache memory. AMD approved RMA, so hopefully next year I'm back in business
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
7,760
1,281
169
Hm - sad for the downtime - but glad that the error-messages indeed hit the spot and that AMD quickly provides RMA.

Please mark the thread as 'SOLVED'
Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!