AMD EPYC based systems rebooting

Alinor

New Member
Jun 16, 2017
5
0
1
37
I have multiple AMD EPYC based systems on different motherboards all brand new and having same issue.
They keep crashing randomly, system would just reboot.

Cooling and temps are good, power proper with sine wave, UPS, Ram is brand new ECC,
They all run off NVME for main OS.

BCM does not report any faults or error events.

Systems all run different workloads, being fairly low. But crashes do happen under load.
I am unable to see any issues in error logs, but maybe i am not looking in the right place.

Few weeks a go I saw errors along the lines of core XX not responding in the CLI

I am going to be reaching out to all motherboard manufacturers, This happens on boards from Asrock,Gigabyte, Sumermicro. using EPYC 7301 and 7351P cpus.

Will keep this thread updated with what i find, if you have any information about this issue or can point to to place that does - will be appreciated.
 

Alinor

New Member
Jun 16, 2017
5
0
1
37
This maybe related to c-states setting in bios, I will update if this resolves the issue
 

stark

Member
Feb 23, 2019
29
18
8
49
While we don't use Proxmox, we do disable c-states on all of our AMD servers (Opterons and EPYC)

Hopefully it's what's causing your grief. Good luck!
 

goseph

Active Member
Dec 4, 2014
35
1
28
I've been having the same issue with my nodes
Code:
SuperMicro M11SDV-8C-LN4F
AMD EPYC 3251

I've did what they described in this wiki entry (https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server) and disabled c-states.

I also updated the firmware of my Intel X710-DA2 (2 x SFP) from 6.01 to 8.40.

Any other suggestions what I could do?
Anyone having the same issues?
Which version of Proxmox are you using?
Which Linux Kernel?
Did you check RAM yet?
 
Oct 11, 2021
5
0
1
52
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve)

I did check RAM 4 weeks ago running full memtest. No errors.
I also did a 48h cpu stress test with no issues.

I have two identical nodes.
 

goseph

Active Member
Dec 4, 2014
35
1
28
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve)

I did check RAM 4 weeks ago running full memtest. No errors.
I also did a 48h cpu stress test with no issues.

I have two identical nodes.
Latest Bios?
 

floerke

New Member
Oct 19, 2021
2
0
1
49
Same here:
* Latest proxmox
Code:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
...
* Running on AMD EPYC
* "stress" and "memtest" ok
* no cluster or ha
* Randomly rebooting without any notices in the log files


Special here:
* Running as a kvm machine

I am out of ideas where to look and what to change. I followed many ways to get more information whats happening and changed many thigs. No luck yet.

I am able to provide any information you are interested in and I would be happy for any idea.
 
Oct 11, 2021
5
0
1
52
Do you have some VM in HA?
No.

Updates:

Some changes now: This time it did not reboot. Now it freezed!

Shorty before the reboot I can find these lines in the logs:
Oct 19 08:46:18 xyz kernel: [671031.235982] clocksource: timekeeping watchdog on CPU4: hpet retried 2 times before success
Oct 19 08:46:18 xyz kernel: [671031.384656] perf: interrupt took too long (6416 > 5580), lowering kernel.perf_event_max_sample_rate to 31000
 
Last edited:
Oct 11, 2021
5
0
1
52
I changed the PCI-E SFP+ Card to a Supermicro AOC-STGN-I2S

No change. But now I can see the error in dmesg:

BERT: Error records from previous boot:
[Hardware Error]: event severity: info
[Hardware Error]: Error 0, type: fatal
[Hardware Error]: fru_text: DIMM# Sourced
[Hardware Error]: section_type: memory error
[Firmware Warn]: valid bits set for fields beyond structure
fbcon: Taking over console
[Hardware Error]: Error 1, type: fatal
[Hardware Error]: fru_text: ProcessorError
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Local APIC_ID: 0xc
[Hardware Error]: CPUID Info:
[Hardware Error]: 00000000: 00800f12 00000000 0c100800 00000000
[Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]: 00000020: 48ab7f57 4f6cdc34 b5b0d3a7 1443a7b0
[Hardware Error]: Error Information Structure 0:
[Hardware Error]: Error Structure Type: unknown
[Hardware Error]: Error Structure Type: 00000001-0000-0000-2700-980000000000
[Hardware Error]: Error 2, type: fatal
[Hardware Error]: fru_text: ProcessorError
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Local APIC_ID: 0xf
[Hardware Error]: CPUID Info:
[Hardware Error]: 00000000: 00800f12 00000000 0f100800 00000000
[Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]: 00000020: a55701f5 43dee3ef 9b2472ac 2cad3f57
[Hardware Error]: Error Information Structure 0:
[Hardware Error]: Error Structure Type: unknown
[Hardware Error]: Error Structure Type: 00000001-0000-0000-9f00-4d2600000000
Console: switching to colour frame buffer device 128x48
PM: Magic number: 9:574:614
acpi device:180: hash matches
acpi_cpufreq: overriding BIOS provided _PSD data
RAS: Correctable Errors collector initialized.
 

spirit

Famous Member
Apr 2, 2010
5,873
704
133
www.odiso.com
I changed the PCI-E SFP+ Card to a Supermicro AOC-STGN-I2S

No change. But now I can see the error in dmesg:

BERT: Error records from previous boot:
[Hardware Error]: event severity: info
[Hardware Error]: Error 0, type: fatal
[Hardware Error]: fru_text: DIMM# Sourced
[Hardware Error]: section_type: memory error
[Firmware Warn]: valid bits set for fields beyond structure
fbcon: Taking over console
[Hardware Error]: Error 1, type: fatal
[Hardware Error]: fru_text: ProcessorError
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Local APIC_ID: 0xc
[Hardware Error]: CPUID Info:
[Hardware Error]: 00000000: 00800f12 00000000 0c100800 00000000
[Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]: 00000020: 48ab7f57 4f6cdc34 b5b0d3a7 1443a7b0
[Hardware Error]: Error Information Structure 0:
[Hardware Error]: Error Structure Type: unknown
[Hardware Error]: Error Structure Type: 00000001-0000-0000-2700-980000000000
[Hardware Error]: Error 2, type: fatal
[Hardware Error]: fru_text: ProcessorError
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Local APIC_ID: 0xf
[Hardware Error]: CPUID Info:
[Hardware Error]: 00000000: 00800f12 00000000 0f100800 00000000
[Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]: 00000020: a55701f5 43dee3ef 9b2472ac 2cad3f57
[Hardware Error]: Error Information Structure 0:
[Hardware Error]: Error Structure Type: unknown
[Hardware Error]: Error Structure Type: 00000001-0000-0000-9f00-4d2600000000
Console: switching to colour frame buffer device 128x48
PM: Magic number: 9:574:614
acpi device:180: hash matches
acpi_cpufreq: overriding BIOS provided _PSD data
RAS: Correctable Errors collector initialized.
found a proposed workaround here : https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server

Proposed solution​

In a posting in the Fedora forum, users write that adjusting the following BIOS parameters solved the problem: [2]

Advanced -> NB Configuration -> IOMMU (change to Enabled)
Advanced -> PCIe / PCI / PnP Configuration -> SR-IOV Support (change to Enabled)

In general, we recommend an update to the latest BIOS version. These contain newer AMD AGESA versions or microcodes.
 
Oct 11, 2021
5
0
1
52
found a proposed workaround here : https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server

Proposed solution​

In a posting in the Fedora forum, users write that adjusting the following BIOS parameters solved the problem: [2]

Advanced -> NB Configuration -> IOMMU (change to Enabled)
Advanced -> PCIe / PCI / PnP Configuration -> SR-IOV Support (change to Enabled)

In general, we recommend an update to the latest BIOS version. These contain newer AMD AGESA versions or microcodes.

Tried that. No success.
 

itsu

Member
Feb 22, 2018
3
0
21
58
I have the same Mobo without CPU fan: https://www.supermicro.com/en/products/motherboard/M11SDV-8C-LN4F
Same issues as discussed here: runs stable, but random crashes on higher load. Tried many things, no help.

Then removed Proxmox and installed Ubuntu Server 22.04 LTS on bare metal -> still same issue -> in my case, Proxmox is not the culprit.
Ran Passmark memtest86 Free version off USB disk https://www.memtest86.com/download.htm -> all tests fine except the last one, the "Hammer" test. Findings:
# of inserted Kingston KSM26RD4/32HDI ECC RAM modules, population sequence as per Supermicro mobo manualPassmark tests
4 (all slots used)Hammer crashes every time early in the test. All other tests finish OK.
2 (tried all 4 modules in different combinations)all tests ok, including Hammer
3 (tried only 1 combination, as all modules tested ok above)all tests ok, including Hammer

The system now has been running stable for some days with 3 RAM modules. Never got so far with 4 modules before.

Although my RAMs are not in the Supermicro recommended list, I think the crashes are not primarily memory module related. There might be a mobo HW design flaw, or a memory timing BIOS default value that is slightly off.

It is possible that with specific RAM modules, the Hammer test runs ok even with 4 modules. Hopefully at least all those from the Supermicro list, but I won't spend my money on new RAM modules to find out.

Next steps in my case would be starting to tamper with the BIOS "Advanced > NB Config > Memory Config" params (all in "Auto" mode right now), but I keep my hands off that and try to live with 3 memory modules.
 

itsu

Member
Feb 22, 2018
3
0
21
58
Update: I got my hands on 2 old cheap used 64GB DDR4-2400 LRDIMMs, Micron OEM, HP part number HP 809085-091 https://www.google.com/search?q=HP+PN+809085-091
Hammer test runs fine, server has been running with 128GB for 6 weeks now without problems, at periodical high load (bare metal Ubuntu 22.04, no Proxmox). In addition, BIOS shows these LRDIMM modules run at full 2400 speed, while for the 2666-rated 32GB RDIMM Kingston modules the BIOS indicated only 2133 speed. Passmark memtest86+ also measures slightly higher memory bandwidth for the LRDIMMs.
Could not test with all 4 slots populated with this module type, but my takeaway is: use only LRDIMMs (not RDIMMs) in this mobo.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!