Spontaneous reboots on Minisforum MS-A2 with 6.17 (and later 6.14)

VivienM

Member
Jun 8, 2023
22
0
6
Hi,

This is a weird one. I have a Minisforum MS-A2, Ryzen 9955HX, 128GB of RAM, a Samsung SSD. Running up to kernels 6.14.8-2, it is rock solid. So I don't think it's a hardware issue...

Newer kernels, certainly including all the 6.17s I've tried including now 6.17.9-1 but I believe also including some newer 6.14s, cause spontaneous reboots within 24 hours.

I had "solved" this before Christmas by just going back to 6.14.8-2, but had a little power mishap yesterday, it booted back up to 6.17.9-1, and... less than 24 hours later, spontaneous reboot.

In the dmesg output, I note the following:
[ 0.892726] x86/amd: Previous system reset reason [0x00300800]: software wrote 0xE to reset control register 0xC
F9
[ 0.892728] x86/amd: Previous system reset reason [0x00300800]: ACPI power state transition occurred
I poked around journalctl, I'm not seeing any log entries that are particularly pertinent...

Happy to provide any further logs, etc.
 
Googling "software wrote 0xE to reset control register 0xC" leads to some interesting info.
I didn't find that much, but I did discover that that message is cut off. Should be "software wrote 0xE to reset control register 0xCF9"

When you google that, yes, it starts to get more interesting, but most of what I'm finding so far is about instability issues with older Zen chips back in 2017-18...
 
The stuff about setting a slightly higher voltage and/or lower frequency in BIOS seems relevant though. It might also pay to look at what c-states are enabled and whether you have the AMD microcode installed.

Other than that I got nuthin'.
 
The stuff about setting a slightly higher voltage and/or lower frequency in BIOS seems relevant though. It might also pay to look at what c-states are enabled and whether you have the AMD microcode installed.

Other than that I got nuthin'.
AMD microcode is installed.

I guess I can find a keyboard/monitor to go poke at the BIOS, but if those things are set wrong, why doesn't 6.14.8-2 have a problem with it?

Found something else while googling, someone having similar issues in ArchLinux that seemed to have to do with kernels being compiled with GCC 15.2. I wonder what GCC is used to compile which proxmox kernels...
 
I got cautiously excited when I discovered my Samsung SSD firmware was behind, but... updated that, same issue.

For now, I've just pinned 6.14.8-2. Unless someone has some ideas, I think I'll revisit it when proxmox releases 7.0 kernels...
 
Well, I tried 7.0.0-2... and... same issue. Same dmesg entry:
Code:
[    0.828655] x86/amd: Previous system reset reason [0x00300800]: software wrote 0xE to reset control register 0xCF9
[    0.828656] x86/amd: Previous system reset reason [0x00300800]: ACPI power state transition occurred

I need to get to the bottom of this, I can't be stuck at 6.14.8-2 forever...
 
I've got 3 of the same units running on 6.17.13-2 on BIOS 1.02 with no issues. On a controlled restart I get the following log entry, note the hex is different:

0.790135] x86/amd: Previous system reset reason [0x00080800]: software wrote 0x6 to reset control register 0xCF9

I've got the following customization

Grub - to suppress PCIE bus warning spam
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=noaer"

For the X710 i40e nic, I changed the VLAN default range of 2-4094 to 2-50 otherwise I got these errors spamming the log (although it did not affect nic behaviour). Forum post https://forum.proxmox.com/threads/e...rcing-overflow-promiscuous-on-pf.62875/page-3

Error LIBIE_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
Error LIBIE_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on

I made these BIOS changes as per https://etcwiki.org/wiki/Minisforum_MS-A2_9955HX_temperature_fix

Advanced->AMD Overclocking-> Accept->Precision Boost Overdrive
CPU Boost Clock Override: Enabled(Negative)
Max CPU Boost Clock Override(-): 500
TJMAX 78

All my NVME slots are set to PCI3.0 x4.
 
This is getting weirder. Before your reply, I figured there was a chance there was a kernel panic-type situation that wasn't being logged, so I figured I would hook up a monitor and set the kernel to panic=0. And... 23 hours later, no reboot so far. Which is the longest I've ever had 6.17/7.0 running for...

A watched server never crashes, I guess. If/when it does crash I will try some of your BIOS settings...
 
I have similar systems, where some were unstable after upgrading PVE 8 to 9. The key difference between them was that the AMD microcode and other firmware was considerably older. Downgrading to proxmox-kernel-6.14.8-2-pve-signed yielded fewer crashes (3-4 a day as opposed to every 10-40 minutes on proxmox-kernel-6.17.13-3-pve-signed) we fixed the issue by updating the BIOS.

Systems are Lenovo ThinkCentre M715q systems with `cat /proc/cpuinfo` reporting:
AMD Ryzen 5 PRO 2400GE w/ Radeon Vega Graphics

My systems were simply locking up and not restarting with the default softdog kernel module. I however got them to use the hardware TCO by updating /etc/default/pve-ha-manager to contain:
Code:
WATCHDOG_MODULE=sp5100_tco

Validation:
Code:
[admin@kvm2c ~]# wdctl
Device:        /dev/watchdog0
Identity:      SP5100 TCO timer [version 0]
Timeout:       10 seconds
Timeleft:      10 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0


Before upgrade:
Code:
  lshw | less
    version: M1XKT34A
    date: 09/04/2018
  AMD microcode in BIOS:
    [root@kvm2d ~]# journalctl -n 10000 | grep microcode
    Apr 25 11:22:49 kvm2d kernel: microcode: Current revision: 0x08101007


Before upgrade:
Code:
  lshw | less
    version: M1XKT63A
    date: 04/11/2024
  AMD microcode in BIOS:
    [root@kvm2c ~]# journalctl -n 10000 | grep microcode
    Apr 25 12:42:15 kvm1 kernel: microcode: Current revision: 0x0810100b

PS: Installing the amd64-microcode package didn't help, it apparently can't help with certain parts which initialise after the kernel boots.


Google Gemini summarised the difference in the BIOS updates to be:

M1XKT63A replaces the experimental early-Zen power management logic (M1XKT34A) with the industry-standard stable AGESA 1.2.x, resolving critical CPU-idle hangs and providing hardware-level mitigations for Zen-architecture vulnerabilities.

1. AGESA & CPU Stability​

The most critical change is the AGESA (AMD Generic Encapsulated Software Architecture).
  • The 2018 version was written when the Raven Ridge architecture was brand new. It had aggressive power-saving bugs that caused the CPU to drop voltage too low during idle transitions, which is what caused your Proxmox "Hard Freezes."
  • The 2024 version is the "refined" logic. It ensures that even during deep idle (C-states), the CPU maintains a stable floor voltage.

2. Microcode & "Zenbleed"​

The microcode revision 0x0810100b in the new BIOS is the official fix for several silicon-level bugs.
  • Speculative Execution: The 2018 version was vulnerable to several side-channel attacks that could crash a kernel under specific branch-prediction loads.
  • The fix: The 2024 microcode fundamentally changes how the CPU handles certain "Move" instructions, making it significantly more robust for virtualization (KVM/Proxmox) environments.

3. ACPI Tables (The Kernel Handshake)​

When Linux boots, it reads the ACPI Tables from the BIOS to learn how to manage the hardware.
  • Old BIOS: Included messy tables that didn't strictly follow UEFI standards, often leading to "Spurious Interrupt" or "IOMMU" errors in the Linux dmesg.
  • New BIOS: Features cleaned-up tables that match modern Linux kernel expectations. This is why you no longer need complex "boot parameters" (like idle=nomwait) to keep the system stable.

4. Security (LogoFAIL)​

The 2024 update specifically addresses LogoFAIL.
  • The 2018 BIOS had a vulnerability where a malicious image file used as a boot logo could execute code at the firmware level.
  • The 2024 version (the one you are using to flash your custom logos) has a hardened image parser to prevent this.
 
Last edited: