Spontaneous reboots on Minisforum MS-A2 with 6.17 (and later 6.14)

VivienM

Member
Jun 8, 2023
22
0
6
Hi,

This is a weird one. I have a Minisforum MS-A2, Ryzen 9955HX, 128GB of RAM, a Samsung SSD. Running up to kernels 6.14.8-2, it is rock solid. So I don't think it's a hardware issue...

Newer kernels, certainly including all the 6.17s I've tried including now 6.17.9-1 but I believe also including some newer 6.14s, cause spontaneous reboots within 24 hours.

I had "solved" this before Christmas by just going back to 6.14.8-2, but had a little power mishap yesterday, it booted back up to 6.17.9-1, and... less than 24 hours later, spontaneous reboot.

In the dmesg output, I note the following:
[ 0.892726] x86/amd: Previous system reset reason [0x00300800]: software wrote 0xE to reset control register 0xC
F9
[ 0.892728] x86/amd: Previous system reset reason [0x00300800]: ACPI power state transition occurred
I poked around journalctl, I'm not seeing any log entries that are particularly pertinent...

Happy to provide any further logs, etc.
 
Googling "software wrote 0xE to reset control register 0xC" leads to some interesting info.
I didn't find that much, but I did discover that that message is cut off. Should be "software wrote 0xE to reset control register 0xCF9"

When you google that, yes, it starts to get more interesting, but most of what I'm finding so far is about instability issues with older Zen chips back in 2017-18...
 
The stuff about setting a slightly higher voltage and/or lower frequency in BIOS seems relevant though. It might also pay to look at what c-states are enabled and whether you have the AMD microcode installed.

Other than that I got nuthin'.
 
The stuff about setting a slightly higher voltage and/or lower frequency in BIOS seems relevant though. It might also pay to look at what c-states are enabled and whether you have the AMD microcode installed.

Other than that I got nuthin'.
AMD microcode is installed.

I guess I can find a keyboard/monitor to go poke at the BIOS, but if those things are set wrong, why doesn't 6.14.8-2 have a problem with it?

Found something else while googling, someone having similar issues in ArchLinux that seemed to have to do with kernels being compiled with GCC 15.2. I wonder what GCC is used to compile which proxmox kernels...
 
I got cautiously excited when I discovered my Samsung SSD firmware was behind, but... updated that, same issue.

For now, I've just pinned 6.14.8-2. Unless someone has some ideas, I think I'll revisit it when proxmox releases 7.0 kernels...
 
Well, I tried 7.0.0-2... and... same issue. Same dmesg entry:
Code:
[    0.828655] x86/amd: Previous system reset reason [0x00300800]: software wrote 0xE to reset control register 0xCF9
[    0.828656] x86/amd: Previous system reset reason [0x00300800]: ACPI power state transition occurred

I need to get to the bottom of this, I can't be stuck at 6.14.8-2 forever...
 
I've got 3 of the same units running on 6.17.13-2 on BIOS 1.02 with no issues. On a controlled restart I get the following log entry, note the hex is different:

0.790135] x86/amd: Previous system reset reason [0x00080800]: software wrote 0x6 to reset control register 0xCF9

I've got the following customization

Grub - to suppress PCIE bus warning spam
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=noaer"

For the X710 i40e nic, I changed the VLAN default range of 2-4094 to 2-50 otherwise I got these errors spamming the log (although it did not affect nic behaviour). Forum post https://forum.proxmox.com/threads/e...rcing-overflow-promiscuous-on-pf.62875/page-3

Error LIBIE_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Error LIBIE_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on

I made these BIOS changes as per https://etcwiki.org/wiki/Minisforum_MS-A2_9955HX_temperature_fix

Advanced->AMD Overclocking-> Accept->Precision Boost Overdrive
CPU Boost Clock Override: Enabled(Negative)
Max CPU Boost Clock Override(-): 500
TJMAX 78

All my NVME slots are set to PCI3.0 x4.
 
This is getting weirder. Before your reply, I figured there was a chance there was a kernel panic-type situation that wasn't being logged, so I figured I would hook up a monitor and set the kernel to panic=0. And... 23 hours later, no reboot so far. Which is the longest I've ever had 6.17/7.0 running for...

A watched server never crashes, I guess. If/when it does crash I will try some of your BIOS settings...