AMD EPYC based systems rebooting

Alinor · May 16, 2019

I have multiple AMD EPYC based systems on different motherboards all brand new and having same issue.
They keep crashing randomly, system would just reboot.

Cooling and temps are good, power proper with sine wave, UPS, Ram is brand new ECC,
They all run off NVME for main OS.

BCM does not report any faults or error events.

Systems all run different workloads, being fairly low. But crashes do happen under load.
I am unable to see any issues in error logs, but maybe i am not looking in the right place.

Few weeks a go I saw errors along the lines of core XX not responding in the CLI

I am going to be reaching out to all motherboard manufacturers, This happens on boards from Asrock,Gigabyte, Sumermicro. using EPYC 7301 and 7351P cpus.

Will keep this thread updated with what i find, if you have any information about this issue or can point to to place that does - will be appreciated.

Alinor · May 16, 2019

This maybe related to c-states setting in bios, I will update if this resolves the issue

stark · May 17, 2019

While we don't use Proxmox, we do disable c-states on all of our AMD servers (Opterons and EPYC)

Hopefully it's what's causing your grief. Good luck!

goseph · Apr 26, 2021

Alinor said:
This maybe related to c-states setting in bios, I will update if this resolves the issue

Hi, did this fix your issues?
If not, what fixed it?

Thanks!

p0rt0up · Oct 11, 2021

I've been having the same issue with my nodes

Code:

SuperMicro M11SDV-8C-LN4F
AMD EPYC 3251

I've did what they described in this wiki entry (https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server) and disabled c-states.

I also updated the firmware of my Intel X710-DA2 (2 x SFP) from 6.01 to 8.40.

Any other suggestions what I could do?
Anyone having the same issues?

goseph · Oct 11, 2021

p0rt0up said:
I've been having the same issue with my nodes

Code:

SuperMicro M11SDV-8C-LN4F AMD EPYC 3251

I've did what they described in this wiki entry (https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server) and disabled c-states.

I also updated the firmware of my Intel X710-DA2 (2 x SFP) from 6.01 to 8.40.

Any other suggestions what I could do?
Anyone having the same issues?

Which version of Proxmox are you using?
Which Linux Kernel?
Did you check RAM yet?

p0rt0up · Oct 11, 2021

pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve)

I did check RAM 4 weeks ago running full memtest. No errors.
I also did a 48h cpu stress test with no issues.

I have two identical nodes.

goseph · Oct 13, 2021

p0rt0up said:
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve)

I did check RAM 4 weeks ago running full memtest. No errors.
I also did a 48h cpu stress test with no issues.

I have two identical nodes.

Latest Bios?

ales · Oct 13, 2021

Do you have some VM in HA?

floerke · Oct 19, 2021

Same here:
* Latest proxmox

Code:

proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
...

* Running on AMD EPYC
* "stress" and "memtest" ok
* no cluster or ha
* Randomly rebooting without any notices in the log files

Special here:
* Running as a kvm machine

I am out of ideas where to look and what to change. I followed many ways to get more information whats happening and changed many thigs. No luck yet.

I am able to provide any information you are interested in and I would be happy for any idea.

p0rt0up · Oct 19, 2021

ales said:
Do you have some VM in HA?

No.

Updates:

Some changes now: This time it did not reboot. Now it freezed!

Shorty before the reboot I can find these lines in the logs:
Oct 19 08:46:18 xyz kernel: [671031.235982] clocksource: timekeeping watchdog on CPU4: hpet retried 2 times before success
Oct 19 08:46:18 xyz kernel: [671031.384656] perf: interrupt took too long (6416 > 5580), lowering kernel.perf_event_max_sample_rate to 31000

p0rt0up · Oct 27, 2021

I changed the PCI-E SFP+ Card to a Supermicro AOC-STGN-I2S

No change. But now I can see the error in dmesg:

BERT: Error records from previous boot:
[Hardware Error]: event severity: info
[Hardware Error]: Error 0, type: fatal
[Hardware Error]: fru_text: DIMM# Sourced
[Hardware Error]: section_type: memory error
[Firmware Warn]: valid bits set for fields beyond structure
fbcon: Taking over console
[Hardware Error]: Error 1, type: fatal
[Hardware Error]: fru_text: ProcessorError
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Local APIC_ID: 0xc
[Hardware Error]: CPUID Info:
[Hardware Error]: 00000000: 00800f12 00000000 0c100800 00000000
[Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]: 00000020: 48ab7f57 4f6cdc34 b5b0d3a7 1443a7b0
[Hardware Error]: Error Information Structure 0:
[Hardware Error]: Error Structure Type: unknown
[Hardware Error]: Error Structure Type: 00000001-0000-0000-2700-980000000000
[Hardware Error]: Error 2, type: fatal
[Hardware Error]: fru_text: ProcessorError
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Local APIC_ID: 0xf
[Hardware Error]: CPUID Info:
[Hardware Error]: 00000000: 00800f12 00000000 0f100800 00000000
[Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]: 00000020: a55701f5 43dee3ef 9b2472ac 2cad3f57
[Hardware Error]: Error Information Structure 0:
[Hardware Error]: Error Structure Type: unknown
[Hardware Error]: Error Structure Type: 00000001-0000-0000-9f00-4d2600000000
Console: switching to colour frame buffer device 128x48
PM: Magic number: 9:574:614
acpi device:180: hash matches
acpi_cpufreq: overriding BIOS provided _PSD data
RAS: Correctable Errors collector initialized.

spirit · Oct 27, 2021

p0rt0up said:
I changed the PCI-E SFP+ Card to a Supermicro AOC-STGN-I2S

No change. But now I can see the error in dmesg:

BERT: Error records from previous boot:
[Hardware Error]: event severity: info
[Hardware Error]: Error 0, type: fatal
[Hardware Error]: fru_text: DIMM# Sourced
[Hardware Error]: section_type: memory error
[Firmware Warn]: valid bits set for fields beyond structure
fbcon: Taking over console
[Hardware Error]: Error 1, type: fatal
[Hardware Error]: fru_text: ProcessorError
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Local APIC_ID: 0xc
[Hardware Error]: CPUID Info:
[Hardware Error]: 00000000: 00800f12 00000000 0c100800 00000000
[Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]: 00000020: 48ab7f57 4f6cdc34 b5b0d3a7 1443a7b0
[Hardware Error]: Error Information Structure 0:
[Hardware Error]: Error Structure Type: unknown
[Hardware Error]: Error Structure Type: 00000001-0000-0000-2700-980000000000
[Hardware Error]: Error 2, type: fatal
[Hardware Error]: fru_text: ProcessorError
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Local APIC_ID: 0xf
[Hardware Error]: CPUID Info:
[Hardware Error]: 00000000: 00800f12 00000000 0f100800 00000000
[Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[Hardware Error]: 00000020: a55701f5 43dee3ef 9b2472ac 2cad3f57
[Hardware Error]: Error Information Structure 0:
[Hardware Error]: Error Structure Type: unknown
[Hardware Error]: Error Structure Type: 00000001-0000-0000-9f00-4d2600000000
Console: switching to colour frame buffer device 128x48
PM: Magic number: 9:574:614
acpi device:180: hash matches
acpi_cpufreq: overriding BIOS provided _PSD data
RAS: Correctable Errors collector initialized.

found a proposed workaround here : https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server

Proposed solution

In a posting in the Fedora forum, users write that adjusting the following BIOS parameters solved the problem: [2]

Advanced -> NB Configuration -> IOMMU (change to Enabled)
Advanced -> PCIe / PCI / PnP Configuration -> SR-IOV Support (change to Enabled)

In general, we recommend an update to the latest BIOS version. These contain newer AMD AGESA versions or microcodes.

p0rt0up · Oct 27, 2021

spirit said:
found a proposed workaround here : https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server

Proposed solution
In a posting in the Fedora forum, users write that adjusting the following BIOS parameters solved the problem: [2]

Advanced -> NB Configuration -> IOMMU (change to Enabled)
Advanced -> PCIe / PCI / PnP Configuration -> SR-IOV Support (change to Enabled)

In general, we recommend an update to the latest BIOS version. These contain newer AMD AGESA versions or microcodes.

Tried that. No success.

hardwareadictos · Apr 5, 2022

Same issue on my MB: https://www.supermicro.com/en/products/motherboard/M11SDV-8C+-LN4F

itsu · Aug 7, 2022

hardwareadictos said:
Same issue on my MB: https://www.supermicro.com/en/products/motherboard/M11SDV-8C+-LN4F

I have the same Mobo without CPU fan: https://www.supermicro.com/en/products/motherboard/M11SDV-8C-LN4F
Same issues as discussed here: runs stable, but random crashes on higher load. Tried many things, no help.

Then removed Proxmox and installed Ubuntu Server 22.04 LTS on bare metal -> still same issue -> in my case, Proxmox is not the culprit.
Ran Passmark memtest86 Free version off USB disk https://www.memtest86.com/download.htm -> all tests fine except the last one, the "Hammer" test. Findings:

# of inserted Kingston KSM26RD4/32HDI ECC RAM modules, population sequence as per Supermicro mobo manual	Passmark tests
4 (all slots used)	Hammer crashes every time early in the test. All other tests finish OK.
2 (tried all 4 modules in different combinations)	all tests ok, including Hammer
3 (tried only 1 combination, as all modules tested ok above)	all tests ok, including Hammer

The system now has been running stable for some days with 3 RAM modules. Never got so far with 4 modules before.

Although my RAMs are not in the Supermicro recommended list, I think the crashes are not primarily memory module related. There might be a mobo HW design flaw, or a memory timing BIOS default value that is slightly off.

It is possible that with specific RAM modules, the Hammer test runs ok even with 4 modules. Hopefully at least all those from the Supermicro list, but I won't spend my money on new RAM modules to find out.

Next steps in my case would be starting to tamper with the BIOS "Advanced > NB Config > Memory Config" params (all in "Auto" mode right now), but I keep my hands off that and try to live with 3 memory modules.

itsu · Sep 21, 2022

Update: I got my hands on 2 old cheap used 64GB DDR4-2400 LRDIMMs, Micron OEM, HP part number HP 809085-091 https://www.google.com/search?q=HP+PN+809085-091
Hammer test runs fine, server has been running with 128GB for 6 weeks now without problems, at periodical high load (bare metal Ubuntu 22.04, no Proxmox). In addition, BIOS shows these LRDIMM modules run at full 2400 speed, while for the 2666-rated 32GB RDIMM Kingston modules the BIOS indicated only 2133 speed. Passmark memtest86+ also measures slightly higher memory bandwidth for the LRDIMMs.
Could not test with all 4 slots populated with this module type, but my takeaway is: use only LRDIMMs (not RDIMMs) in this mobo.

Search

Search

AMD EPYC based systems rebooting

Alinor

New Member

Alinor

New Member

stark

Member

goseph

Renowned Member

p0rt0up

New Member

goseph

Renowned Member

p0rt0up

New Member

goseph

Renowned Member

ales

Member

floerke

New Member

p0rt0up

New Member

p0rt0up

New Member

spirit

Distinguished Member

Proposed solution

p0rt0up

New Member

Proposed solution

hardwareadictos

Member

itsu

Active Member

itsu

Active Member

We value your privacy

AMD EPYC based systems rebooting

New Member

New Member

Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

Member

New Member

New Member

New Member

Distinguished Member

Proposed solution​

New Member

Proposed solution​

Member

Active Member

Active Member

We value your privacy

Proposed solution

Proposed solution