Finding cause of instability

RealPjotr · Nov 19, 2024

I have an instability in my new system. As I start using it more, it becomes more frequent. System logs shows nothing at all before reboot.

I have built a new server based on Asrock Rack SIENAD8-2L2T, Epyc 8434P and 6x64 GB Registred ECC RAM. In PCIE7 slot I've installed a SATA breakout board for 4x 4xSATA ports, I have 1 SSD and 4 HDDs conected. In slots PCIE1 and PCIE3 I've installed two 4xM.2 boards and populated these with various SSDs, totally 8. None installed yet in the 2 M.2 slots on the motherboard. I use 1G connected to the IPMI and one 10G NIC connected.

pve-manager/8.2.9/98c7f34632fee424
Linux 6.8.12-4-pve (2024-11-06T15:04Z)

Today I caught a reboot while using "dmesg -w" on another system, the log shows:

Code:

[ 4427.215725] watchdog: Watchdog detected hard LOCKUP on cpu 40
[ 4427.215734] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables softdog sunrpc
[ 4427.215784] softdog: Initiating system reboot
[ 4427.215783] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 24777258129 wd_nsec: 24777251435
[ 4427.215787]  binfmt_misc
[ 4427.215791]  bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common ipmi_ssif amd64_edac edac_mce_amd kvm_amd kvm irqbypass
[ 4427.215815] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 4427.215819]  crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd dax_hmem cxl_acpi acpi_ipmi rapl cxl_core ast pcspkr ipmi_si i2c_algo_bit k10temp ipmi_devintf ccp ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbmouse usbkbd usbhid cdc_ether usbnet hid mii xhci_pci nvme xhci_pci_renesas crc32_pclmul nvme_core ahci i40e xhci_hcd nvme_auth libahci i2c_piix4
[ 4427.215927] CPU: 40 PID: 0 Comm: swapper/40 Tainted: P           O       6.8.12-4-pve #1
[ 4427.215935] Hardware name:  SIENAD8-2L2T/SIENAD8-2L2T, BIOS 1.13 04/08/2024

Before that, I could observe this type of error at times:

Code:

[ 3034.032234] {8}[Hardware Error]:  Error 9, type: corrected
[ 3034.032679] {8}[Hardware Error]:  fru_text: PcieError
[ 3034.033124] {8}[Hardware Error]:   section_type: PCIe error
[ 3034.033569] {8}[Hardware Error]:   port_type: 4, root port
[ 3034.034014] {8}[Hardware Error]:   version: 0.2
[ 3034.034456] {8}[Hardware Error]:   command: 0x0407, status: 0x0010
[ 3034.034902] {8}[Hardware Error]:   device_id: 0000:00:05.1
[ 3034.035349] {8}[Hardware Error]:   slot: 0
[ 3034.035794] {8}[Hardware Error]:   secondary_bus: 0x09
[ 3034.036247] {8}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x14aa
[ 3034.036693] {8}[Hardware Error]:   class_code: 060400
[ 3034.037134] {8}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0002
[ 3034.037591] {8}[Hardware Error]:  Error 10, type: corrected
[ 3034.038036] {8}[Hardware Error]:  fru_text: PcieError
[ 3034.038481] {8}[Hardware Error]:   section_type: PCIe error
[ 3034.038926] {8}[Hardware Error]:   port_type: 4, root port
[ 3034.039372] {8}[Hardware Error]:   version: 0.2
[ 3034.039815] {8}[Hardware Error]:   command: 0x0407, status: 0x0010
[ 3034.040262] {8}[Hardware Error]:   device_id: 0000:00:05.1
[ 3034.040709] {8}[Hardware Error]:   slot: 0
[ 3034.041153] {8}[Hardware Error]:   secondary_bus: 0x09
[ 3034.041595] {8}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x14aa
[ 3034.042044] {8}[Hardware Error]:   class_code: 060400
[ 3034.042486] {8}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0002
[ 3034.043264] pcieport 0000:00:05.1: AER: aer_status: 0x00000040, aer_mask: 0x00001000
[ 3034.043746] pcieport 0000:00:05.1:    [ 6] BadTLP
[ 3034.044219] pcieport 0000:00:05.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
[ 3034.044704] pcieport 0000:00:05.1: AER: aer_status: 0x00000040, aer_mask: 0x00001000
[ 3034.045182] pcieport 0000:00:05.1:    [ 6] BadTLP
[ 3034.045649] pcieport 0000:00:05.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID

00:05.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14aa (rev 01)

Code:

[ 3764.613538] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 3764.614159] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 3764.614738] {9}[Hardware Error]: event severity: corrected
[ 3764.615303] {9}[Hardware Error]:  Error 0, type: corrected
[ 3764.615858] {9}[Hardware Error]:  fru_text: PcieError
[ 3764.616413] {9}[Hardware Error]:   section_type: PCIe error
[ 3764.616969] {9}[Hardware Error]:   port_type: 0, PCIe end point
[ 3764.617526] {9}[Hardware Error]:   version: 0.2
[ 3764.618078] {9}[Hardware Error]:   command: 0x0406, status: 0x0010
[ 3764.618639] {9}[Hardware Error]:   device_id: 0000:09:00.0
[ 3764.619195] {9}[Hardware Error]:   slot: 0
[ 3764.619807] {9}[Hardware Error]:   secondary_bus: 0x00
[ 3764.620366] {9}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x15ff
[ 3764.620926] {9}[Hardware Error]:   class_code: 020000
[ 3764.621478] {9}[Hardware Error]:   bridge: secondary_status: 0x2400, control: 0x0000
[ 3764.622032] {9}[Hardware Error]:  Error 1, type: corrected
[ 3764.622582] {9}[Hardware Error]:  fru_text: PcieError
[ 3764.623126] {9}[Hardware Error]:   section_type: PCIe error
[ 3764.623665] {9}[Hardware Error]:   port_type: 0, PCIe end point
[ 3764.624202] {9}[Hardware Error]:   version: 0.2
[ 3764.624734] {9}[Hardware Error]:   command: 0x0406, status: 0x0010
[ 3764.625267] {9}[Hardware Error]:   device_id: 0000:09:00.1
[ 3764.625795] {9}[Hardware Error]:   slot: 0
[ 3764.626312] {9}[Hardware Error]:   secondary_bus: 0x00
[ 3764.626829] {9}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x15ff
[ 3764.627346] {9}[Hardware Error]:   class_code: 020000
[ 3764.627859] {9}[Hardware Error]:   bridge: secondary_status: 0x2400, control: 0x0000
[ 3764.637049] i40e 0000:09:00.0: AER: aer_status: 0x00003100, aer_mask: 0x00001000
[ 3764.637690] i40e 0000:09:00.0:    [ 8] Rollover
[ 3764.638275] i40e 0000:09:00.0:    [13] NonFatalErr
[ 3764.638866] i40e 0000:09:00.0: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 3764.639477] i40e 0000:09:00.1: AER: aer_status: 0x00003100, aer_mask: 0x00001000
[ 3764.640069] i40e 0000:09:00.1:    [ 8] Rollover
[ 3764.640658] i40e 0000:09:00.1:    [13] NonFatalErr
[ 3764.641229] i40e 0000:09:00.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

09:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GBASE-T (rev 02)

Any hints about what to do? Is this a hardware issue and then what?

I've tried to look around and found some hints at setting kernel option pci=nommconf which disables Memory-Mapped PCI Configuration Space? Other things suggest issues with M.2 SSDs, specifically WD (I've got 2x SN770 as mirrored boot disks), should I take them out to check and upgrade firmware?

somebodyoverthere · Nov 20, 2024

have tried to monitor the temps of the 2x SN770? or maybe a stress test of the cpu/ram/storage? one by one and after all of them at the same time to eliminate the possibility of a bad PSU

RealPjotr · Nov 27, 2024

Sorry for the delay. I got some problems setting up my SSDs between host and a VM, but got that sorted now.

I made a temporary Windows install and tested all SSDs. I upgrade firmware on the two WD SN770 boot drives. I ran performance test and monitored the temps on all of them, nothing excessive at all (around 50-60 C at most), even though they don't have coolers. I have decent air flow in the case.

I added "pci=nommconf" as kernel option. I had the system running stable for 3 days until last night when I got another sudden reboot with no trace in the logs. This time I didn't have any active "dmesg -w" session as it happened in the middle of the night. journalctl and system log in Proxmox shows nothing, just a sudden reboot half a minute after some ZFS replication of VM images (which is a 15 minute sync, so never big).

I have a Truenas Scale VM using some SATA HDDs via a PCI7 slot SATA extender. Each time the system reboots, I sometimes get an increase in "Device: /dev/sdb [SAT], ATA error count increased from 88 to 92". Since it happens on boot, I'm not too concerned. But it could be bad cables. My intention now is to change this extender card and cables to an old 8 port SATA HBA card with separate SATA cables instead, to try to isolate the issue.

PS. When I first got the system I did run some Phoronix benchmarks and some AI software runs to stress test the system. All went perfectly fine, no issues at all, no high temps etc.

RealPjotr · Nov 29, 2024

I'm pretty sure (at least hoping very much!) this is either the WD SN770 SSDs used as boots drives not handling power states correctly OR it's the two 4xSSD PCIe 4.0 x16 bifurcation cards purchased from AliExpress. The SN770 problem is all over internet, but I've also seen people online replace generic PCIe cards with Asus Hyper V2 to fix problems.

So what I did now was swap the SSDs around. I moved the two SN770 to the two M.2 slots on the motherboard, I kept the two VM storage Samsungs on the PCIe boards. I then added a VM on the SN770 boot drives ("spool") and installed and ran LM Studio to stress things a bit. It all runs fine. But overnight it has crashed and this time I got a second "dmesg -w" capture of what is going on:

[25693.784353] VFIO - User Level meta-driver version: 0.3
[25693.805614] xhci_hcd 0000:c7:00.4: remove, state 4
[25693.806199] usb usb2: USB disconnect, device number 1
[25693.808150] xhci_hcd 0000:c7:00.4: USB bus 2 deregistered
[25693.808674] xhci_hcd 0000:c7:00.4: remove, state 1
[25693.809067] usb usb1: USB disconnect, device number 1
[25693.809603] usb 1-1: USB disconnect, device number 2
[25693.810143] usb 1-1.4: USB disconnect, device number 4
[25693.855619] usb 1-1.5: USB disconnect, device number 5
[25693.856308] cdc_ether 1-1.5:2.0 enx32754cc1f0f4: unregister 'cdc_ether' usb-0000:c7:00.4-1.5, CDC Ethernet Device
[25693.887608] usb 1-1.6: USB disconnect, device number 6
[25694.047641] usb 1-2: USB disconnect, device number 3
[25694.080836] xhci_hcd 0000:c7:00.4: USB bus 1 deregistered
[25694.128963] sd 3:0:0:0: [sda] Synchronizing SCSI cache
[25694.133431] ata4.00: Entering standby power mode
[25696.640992] sd 4:0:0:0: [sdb] Synchronizing SCSI cache
[25696.644433] ata5.00: Entering standby power mode
[25697.384018] sd 5:0:0:0: [sdc] Synchronizing SCSI cache
[25697.388757] ata6.00: Entering standby power mode
[25698.125050] sd 6:0:0:0: [sdd] Synchronizing SCSI cache
[25698.127718] ata7.00: Entering standby power mode
[25698.843057] sd 7:0:0:0: [sde] Synchronizing SCSI cache
[25698.845716] ata8.00: Entering standby power mode
[25699.943898] zio pool=spool vdev=/dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b47f495c4-part3 error=5 type=1 offset=270336 size=8192 flags=721601
[25700.042310] zio pool=spool vdev=/dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b47f4993a-part3 error=5 type=1 offset=270336 size=8192 flags=721601
[25700.043488] zio pool=spool vdev=/dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b47f4993a-part3 error=5 type=1 offset=999129358336 size=8192 flags=721601
[25700.046801] WARNING: Pool 'spool' has encountered an uncorrectable I/O failure and has been suspended.

[25700.055778] WARNING: Pool 'spool' has encountered an uncorrectable I/O failure and has been suspended.

The 4+1 SATA drives were still connected, but not in use. What I see here is that the system starts putting drives to sleep and it is at this time the "spool" fails, indicating a problem with WD SN770 handling power states? Does anyone know what these errors are? (I tried to Google but couldn't find out)

My next idea is to try to replace the boot drives in the spool one by one. I have two more data storage Samsung SSDs I can temporarily test with. So I intend to pull one SN770 out, replace it with a Samsung and resilver. The when it runs fine, pull the second SN770 and replace that with a second Samsung. Then I will run and stress test the system again to see if the problem goes away. Sounds like a good idea?

RealPjotr · Dec 12, 2024

As I started stressing the system more, I found that the WD SN770 SSDs definitely was a source of instability. I have now returned these and replaced them with Samsung 990 Pro SSDs. It has been running for 2-3 weeks stable with various loads until a sudden restart today again!

Now it rebooted without any trace in the logs. I did not see this coming, so I didn't have any "dmesg -w" running externally. There is nothing in the Proxmox system log or from "journalctl -o short-precise -k -b -1". I also looked in the IPMI, but don't find anything relevant.

Where should I look? What can I do to get more info next time?

Edit: I found this interesting setting for a consumer Asrock Rack motherboard, so I made this change in the hope it will be stable:

After an embarrassing amount of bisection and testing, it turned out that for this particular motherboard (ASRock X670E Taichi Carrarra), there exists a setting Advanced\AMD CBS\CPU Common Options\Core Watchdog\Core Watchdog Timer Enable in the BIOS, whose default setting (Auto) seems to be to ENABLE the Core Watchdog Timer, hence causing sudden reboots to occur at unpredictable intervals on Debian, and hence Proxmox as well.

The workaround is to set the Core Watchdog Timer Enable setting to Disable. In my case, that caused the system to become stable under load.

- https://forum.asrock.com/forum_posts.asp?TID=85183&title=workaround-for-asrock-random-reboots

marcio79 · Dec 12, 2024

RealPjotr said:
As I started stressing the system more, I found that the WD SN770 SSDs definitely was a source of instability. I have now returned these and replaced them with Samsung 990 Pro SSDs. It has been running for 2-3 weeks stable with various loads until a sudden restart today again!

Now it rebooted without any trace in the logs. I did not see this coming, so I didn't have any "dmesg -w" running externally. There is nothing in the Proxmox system log or from "journalctl -o short-precise -k -b -1". I also looked in the IPMI, but don't find anything relevant.

Where should I look? What can I do to get more info next time?

give a try in the kernel Linux 6.11.0-1-pve

RealPjotr · Dec 12, 2024

Ok, that was a quick one! Another magic reboot, nothing in any logs. This is after I set Core Watchdog Timer = Disable in BIOS.

So, I installed kernel 6.11.0-2-pve, thanks for the tip, let's see where it goes...

marcio79 · Dec 14, 2024

RealPjotr said:
Ok, that was a quick one! Another magic reboot, nothing in any logs. This is after I set Core Watchdog Timer = Disable in BIOS.

So, I installed kernel 6.11.0-2-pve, thanks for the tip, let's see where it goes...

Tell me friend, its stable now?

RealPjotr · Dec 14, 2024

I am monitoring. Before Thursday, it had run around two weeks (kernel 6.8) before it crashed twice that day.

RealPjotr · Dec 22, 2024

And after 9 days a new crash with no trace in the logs (I only had "dmesg -w" running in a terminal on a VM on another system, it didn't see anything either). This happened at ~22:30 last night, I rebooted and got things running again and went to bed. At ~02:30 it happened again.

Since I booted 9 days ago, I moved my docker NFS storage to the TrueNAS Scale VM, thereby increasing the I/O load. I also do a nightly backup of this ~440 GB storage to a separate (physical) TrueNAS server and all VMs are backed up to two separate PBS on two different mini-PCs. I've also been testing some LLMs etc without issues. Basically I'm using it as a full home server with some 15 VMs and docker environment, etc. At the time of the crash at 22:30 I had just started a 10 mbit/s download from internet, nothing unusual. A few minutes into it it crashed. The download was done via a docker service to a 2.5" SSD which is shared via SMB on the TrueNAS VM (It has this 2.5" SSD, 4xHDD as two ZFS mirrors and 2xNVME SSDs as ZFS mirror with the docker NFS storage). The proxmox host has 4xNVME SSDs in two ZFS mirrors.

So the next step is that I now added "pcie_aspm=off" to the kernel cmdline, turing off all ASPM:

Code:

root@kosmos:~# dmesg | grep ASPM
[    0.264504] PCIe ASPM is disabled
[    1.109899] acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    1.118571] acpi PNP0A08:02: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    1.135435] acpi PNP0A08:03: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]

I take it this means no ASPM enabled? Any other suggestions?

RealPjotr · Dec 29, 2024

And yesterday it happened again, out of the blue.

I'm running out of ideas. Next step I can try is remove the SATA extender/cables that uses the motherboard PCIE7 SATA capabilities and replace that with an old trustworthy 9207 HBA card.

wuwu · Jan 13, 2025

Hi @RealPjotr. Have you been able to find the source of the problem?
For your information. I have a very similar configuration and similar errors on log.
Asrock Rack SIENAD8-2L2T, Epyc 8534P and 6x96 GB Registred ECC RAM. I have also WD Red SN700 2 x 256 GB for Boot in mirror mode (in M2 slots) and WD Red SN700 2 x 4TB in mirror mode for VM/LXC containers (in MCIO slot).

These errors appeared for the first time when I heavily using tdarr app with transcoding thousens little files and when I put a cache folder on those WD Red SN700 2 x 4TB in mirror. So far the situation has happened only once and I am watching it closely.

I have a GPU card on PCIE1, HBA card on PCIE5 (moved from PCIE6) and Radian Memory on PCIE3.

IamTheForth · Jan 30, 2025

I had that same issue with proxmox, it just freezes randomly ( within hours sometimes ), moved back to kernel 6.5, no reboot since last week.

vineet · Mar 3, 2025

did anyone find any solutions to this ? i am also seeing similar issues but for GPU passthrough here is a screenshot of error

RealPjotr · Mar 19, 2025

Sorry for the absence, this is due to having 70+ days of uptime since December, so I was hoping some magic had fixed things. But NOPE, suddenly I've got three more random reboots in two weeks.

Please note that these are without any logs in dmesg or system log. What I now did was e-mail Asrock Rack support and explain. I immediately got a reply asking me to try BIOS 2.01! This is not available on their web site, I got the BIOS ROM file in the e-mail and it's dated August 4 2024! (I've asked for the README EDIT: They replied "Sorry, no changelog"!). I have now upgraded BIOS on the motherboard and all well so far, it only changed one of the two motherboard M.2 device IDs I share to a VM. I guess we'll see in a few months if it's stable or not.

I've noticed in my smart home system that I occasionally have power spikes in usage on other equipment where my server is connected. I thought it might cause this instability, but after adding some scripts to notify me, it's not synced with when the server reboots, so kind of a cold theory at the moment.

The other problem I think might be the cause is that the SATA controller slots share IOMMU groups with a few other peripherals. Since I share PCIE7 SATA slot to my TrueNAS VM, that's not good if the other devices are active on the host. But I would think this should be seen in the logs. Does anyone know if these matter:

IOMMU Group 13:
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Genoa/Bergamo Dummy Host Bridge [1022:149f] (rev 01)
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Genoa/Bergamo Internal PCIe GPP Bridge to Bus [D:B] [1022:14a7] (rev 01)
00:07.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Genoa/Bergamo Internal PCIe GPP Bridge to Bus [D:B] [1022:14a7] (rev 01)
0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Genoa/Bergamo Dummy Function [1022:14ac] (rev 01)
0a:00.1 System peripheral [0880]: Advanced Micro Devices, Inc. [AMD] SDXI [1022:14dc]
0a:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:14c9] (rev da)
0a:00.5 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Genoa CCP/PSP 4.0 Device [1022:14ca]
0b:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 91)
0b:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 91)

Again, if I install a HBA card for SATA, I would bypass this potential issue.

IamTheForth said:
I had that same issue with proxmox, it just freezes randomly ( within hours sometimes ), moved back to kernel 6.5, no reboot since last week.

I haven't had any difference from different kernels. Currently on Proxmox opt-in 6.11 kernel.

vineet said:
did anyone find any solutions to this ? i am also seeing similar issues but for GPU passthrough here is a screenshot of error

My reboots are without anything in the logs. Your issue is likely IOMMU group related or some other issue. Sharing GPUs to VMs is a lot harder to get right. I'm not the expert on it, never done it.

RealPjotr · Apr 11, 2025

About two weeks and suddenly at ¨ 05:35 in the morning a reboot with nothing in the logs again. So BIOS 2.01 did not help!

I will upgrade to Proxmox 8.4, install the optional kernel 6.14 and hope for the best...

EDIT: I upgraded to 8.4, ok. I installed opt-in kernel 6.14, my TrueNAS VM would not boot unless I removed PCIE7 SATA slot sharing to it. So I reverted to running kernel 6.11 for now.

kn2mUc · Apr 12, 2025

Just put a similar system into test. Will let you know if i hit similar issues.

kn2mUc · Apr 13, 2025

My system just killed all the VMs. Will try 8.4 as well ad cross my fingers.
- Asrock Rack SIENAD8-2L2T
- AMD EPYC 8024P 8-Core Processor
- 6x32 MTC20F2085S1RC48BA1
- LSI 9400-16i HBA
- 2 M2 drives
Memory and CPU both checked out fine for 4 passes of memtest and CPU burn.

RealPjotr · May 13, 2025

I still have random reboots with nothing in the logs. What I've done since I last posted:

* Installed a 2308 HBA in PCIE6 to replace using the PCIE7 SATA functionality. I basically got reboots more often, every few days.
* Yesterday I ran a full pass of memtest, took about 10 hours. No errors or warnings.
* Last night I removed the HBA, going back to using PCIE7 SATA ports, still 1 SSD and 4 HDDs.
* I also moved the two 4xM.2 PCIe 4.0 boards, now running 2 + 2 M-2 SSDs. Previously I had them in PCIE1 and PCI3, I moved them to PCIE3 and PCIE5.

Since I got this up and running again at 21:00 last night, I got THREE reboots overnight:

-2 67badac6329f409a9100f1a0f1e5c651 Mon 2025-05-12 23:27:58 CEST Tue 2025-05-13 04:43:23 CEST
-1 fa1461eee470428582f3455fc9afe06d Tue 2025-05-13 04:49:31 CEST Tue 2025-05-13 07:32:59 CEST
0 c779ccdc195b4289bfaaad2ea8b19a5e Tue 2025-05-13 07:46:18 CEST Tue 2025-05-13 08:31:48 CEST

None of them have anything logged in system log. I'm still on opt-in kernel 6.11. Proxmox 8.4. SHould I try to get 6.14 working again? Should I revert to 6.8? Older?

I'm at a loss of what to do. I'm thinking of asking the site I bought the PSU (Seasonic 850W ATX 3.0) from to buy another for testing and return one after. Or try to get a replacement motherboard, meaning a lot of work and down time?

RealPjotr · May 13, 2025

Ok, I now got BIOS 2.03 (February 12) from Asrock rack support, another version with no release notes. I've upgraded and will monitor.

Finding cause of instability

Member

New Member

Member

Member

Member

New Member

Member

New Member

Member

Member

Member

New Member

New Member

Member

Member

Member

Renowned Member

Renowned Member

Member

Member

We value your privacy