Hardware Error - Sudden Restart of Proxmox

aoiblue775

New Member
May 6, 2024
1
0
1
The Proxmox 8.2.2 server we have running suddenly rebooted and looking through the dmesg log all I see of interest is 'Hardware Error'. I would like to understand what likely caused this and how to prevent it. We are building software on this system and it isn't ideal for it to randomly shutdown or reboot. I've thought it might be temperature related but I don't think it is because the temps seem reasonable...

Hardware:
AMD Ryzen Threadripper 7980X
256 GB of DDR5 (4 sticks)

Not sure what logs to attach besides dmesg, let me know what else to attach and I'll add it!

Thanks in advance!

https://pastecry.pt/4H1hWM
(key: 'proxmox123!')
 
Last edited:
[[
BERT: Error records from previous boot:
[ 1.354520] [Hardware Error]: It has been corrected by h/w and requires no further action
[ 1.354687] [Hardware Error]: event severity: corrected
[ 1.354774] [Hardware Error]: Error 0, type: corrected
[ 1.354857] [Hardware Error]: fru_text: ProcessorError
[ 1.354939] [Hardware Error]: section_type: IA32/X64 processor error

[ 1.355022] [Hardware Error]: Local APIC_ID: 0xe
[ 1.355104] [Hardware Error]: CPUID Info:
[ 1.355183] [Hardware Error]: 00000000: 00a10f81 00000000 0e800800 00000000
[ 1.355263] [Hardware Error]: 00000010: 76fa320b 00000000 178bfbff 00000000
[ 1.355343] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
[ 1.355422] [Hardware Error]: Error Information Structure 0:
[ 1.355501] [Hardware Error]: Error Structure Type: cache error
[ 1.355579] [Hardware Error]: Check Information: 0x0000000020140087
[ 1.355657] [Hardware Error]: Transaction Type: 0, Instruction
[ 1.355734] [Hardware Error]: Operation: 5, instruction fetch
[ 1.355807] [Hardware Error]: Level: 0
[ 1.355879] [Hardware Error]: Overflow: true
[ 1.355950] [Hardware Error]: Context Information Structure 0:
[ 1.356022] [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
[ 1.356170] [Hardware Error]: Register Array Size: 0x0080
[ 1.356248] [Hardware Error]: MSR Address: 0xc0002111
[ 1.356330] BERT: Total records found: 1
[ 1.356484] PM: Magic number: 8:495:848
[ 1.356632] mce: [Hardware Error]: Machine check events logged
[ 1.356730] mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 17: d820000000060150
[ 1.356759] acpi device:c9: hash matches
[ 1.356907] mce: [Hardware Error]: TSC 0 MISC d0150fff00000000 PPIN 2b0ba718454c06e SYND 812d4a000000 IPID 24117c09200
[ 1.357120] GHES GHES.23690: hash matches
[ 1.357183] mce: [Hardware Error]: PROCESSOR 2:a10f81 TIME 1716576628 SOCKET 0 APIC e microcode a108105

[ 1.357361] GHES GHES.22854: hash matches
[ 1.357787] GHES GHES.21763: hash matches
[ 1.357972] GHES GHES.20927: hash matches
[ 1.358247] GHES GHES.6942: hash matches
[ 1.358385] GHES GHES.6496: hash matches
[ 1.358682] GHES GHES.4569: hash matches
[ 1.358828] processor cpu12: hash matches
[ 1.366816] RAS: Correctable Errors collector initialized.
]]

Look at replacing the CPU, or the entire server. If this is for business, nobody has time to mess around with deep troubleshooting and experiments while production is being delayed. Get Proxmox on a firm footing on different hardware, decommission this server and then troubleshoot it.

Also if this is for business you should have a support contract, would recommend opening a ticket with Proxmox support and the hardware vendor.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!