random reboots (amd epyc)

rafaelmoreira

Member
Jan 11, 2022
3
1
8
38
Hi Everyone,

Recently I'm having random reboots with my environment

Enviroment hardware details(2 x Dell R6525):
AMD EPYC 74F3
BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet
PERC H745 (with raid 5 and 5x3.84ssd 12g)
Fully updated (bios/network/idrac/sas)

Proxmox:
Proxmox 7.1.8
Kernel 5.13.19-2-pve

I contacted dell support, we did a log survey and they assured me that it is not a hardware problem (I believe it is only because it happens on another server exactly like this one)

looking for the logs found this (kern.log):

[ 2.989605] BERT: Error records from previous boot:
[ 2.989606] [Hardware Error]: event severity: fatal
[ 2.989607] [Hardware Error]: Error 0, type: fatal
[ 2.989608] [Hardware Error]: section_type: IA32/X64 processor error
[ 2.989609] [Hardware Error]: Local APIC_ID: 0x4a
[ 2.989611] [Hardware Error]: CPUID Info:
[ 2.989613] [Hardware Error]: 00000000: 00a00f11 00000000 4a300800 00000000
[ 2.989614] [Hardware Error]: 00000010: 76fa320b 00000000 178bfbff 00000000
[ 2.989614] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
[ 2.989615] [Hardware Error]: Error Information Structure 0:
[ 2.989616] [Hardware Error]: Error Structure Type: bus error
[ 2.989617] [Hardware Error]: Check Information: 0x00000000164267ff
[ 2.989618] [Hardware Error]: Transaction Type: 2, Generic
[ 2.989619] [Hardware Error]: Operation: 0, generic error
[ 2.989620] [Hardware Error]: Level: 1
[ 2.989620] [Hardware Error]: Processor Context Corrupt: true
[ 2.989621] [Hardware Error]: Uncorrected: true
[ 2.989621] [Hardware Error]: Precise IP: false
[ 2.989621] [Hardware Error]: Restartable IP: true
[ 2.989622] [Hardware Error]: Overflow: false
[ 2.989622] [Hardware Error]: Participation Type: 0, Local Processor originated request
[ 2.989623] [Hardware Error]: Time Out: false
[ 2.989624] [Hardware Error]: Address Space: 0, Memory Access
[ 2.989624] [Hardware Error]: Context Information Structure 0:
[ 2.989625] [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
[ 2.989626] [Hardware Error]: Register Array Size: 0x0058
[ 2.989626] [Hardware Error]: MSR Address: 0xc0002010
[ 2.989627] [Hardware Error]: Register Array:
[ 2.989627] [Hardware Error]: 00000000: 0000000000000000 b2a00000060e0809
[ 2.989628] [Hardware Error]: 00000010: 0000000000000000 d010000000000000
[ 2.989629] [Hardware Error]: 00000020: 00000003000001f9 000100b00000004a
[ 2.989629] [Hardware Error]: 00000030: 000000005d000030 0000000000000000
[ 2.989630] [Hardware Error]: 00000040: 0000000000000000 0000000000000000
[ 2.989630] [Hardware Error]: 00000050: 0000000000000000
[ 2.989652] PM: Magic number: 6:347:782

I would like to know if there are other cases like this because it happens on other servers and if there is any suggestion of correction or workaround.

Att
 
Your BIOS is up to date?

Set your BIOS options to the following:

Advanced -> NB Configuration -> IOMMU (change to Enabled)

Advanced -> PCIe/PCI/PnP Configuration -> SR-IOV Support (change to Enabled)
 
@Huch,

I already set this options according to other related post but no effect.

today i updated my system to 5.15.17-1 kernel and hope this solve this issue.

thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!