I changed the PCI-E SFP+ Card to a Supermicro AOC-STGN-I2S
No change. But now I can see the error in dmesg:
BERT: Error records from previous boot:
[Hardware Error]: event severity: info
[Hardware Error]: Error 0, type: fatal
[Hardware Error]: fru_text: DIMM# Sourced
[Hardware Error]...
No.
Updates:
Some changes now: This time it did not reboot. Now it freezed!
Shorty before the reboot I can find these lines in the logs:
Oct 19 08:46:18 xyz kernel: [671031.235982] clocksource: timekeeping watchdog on CPU4: hpet retried 2 times before success
Oct 19 08:46:18 xyz kernel...
pve-manager/7.0-11/63d82f4e (running kernel: 5.11.22-5-pve)
I did check RAM 4 weeks ago running full memtest. No errors.
I also did a 48h cpu stress test with no issues.
I have two identical nodes.
I've been having the same issue with my nodes
SuperMicro M11SDV-8C-LN4F
AMD EPYC 3251
I've did what they described in this wiki entry (https://www.thomas-krenn.com/de/wiki/Random_Reboots_AMD_EPYC_Server) and disabled c-states.
I also updated the firmware of my Intel X710-DA2 (2 x SFP) from...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.