Finding cause of instability

RealPjotr

New Member
May 31, 2023
6
1
3
I have an instability in my new system. As I start using it more, it becomes more frequent. System logs shows nothing at all before reboot.

I have built a new server based on Asrock Rack SIENAD8-2L2T, Epyc 8434P and 6x64 GB Registred ECC RAM. In PCIE7 slot I've installed a SATA breakout board for 4x 4xSATA ports, I have 1 SSD and 4 HDDs conected. In slots PCIE1 and PCIE3 I've installed two 4xM.2 boards and populated these with various SSDs, totally 8. None installed yet in the 2 M.2 slots on the motherboard. I use 1G connected to the IPMI and one 10G NIC connected.

pve-manager/8.2.9/98c7f34632fee424
Linux 6.8.12-4-pve (2024-11-06T15:04Z)

Today I caught a reboot while using "dmesg -w" on another system, the log shows:

Code:
[ 4427.215725] watchdog: Watchdog detected hard LOCKUP on cpu 40
[ 4427.215734] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables softdog sunrpc
[ 4427.215784] softdog: Initiating system reboot
[ 4427.215783] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 24777258129 wd_nsec: 24777251435
[ 4427.215787]  binfmt_misc
[ 4427.215791]  bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common ipmi_ssif amd64_edac edac_mce_amd kvm_amd kvm irqbypass
[ 4427.215815] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 4427.215819]  crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd dax_hmem cxl_acpi acpi_ipmi rapl cxl_core ast pcspkr ipmi_si i2c_algo_bit k10temp ipmi_devintf ccp ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbmouse usbkbd usbhid cdc_ether usbnet hid mii xhci_pci nvme xhci_pci_renesas crc32_pclmul nvme_core ahci i40e xhci_hcd nvme_auth libahci i2c_piix4
[ 4427.215927] CPU: 40 PID: 0 Comm: swapper/40 Tainted: P           O       6.8.12-4-pve #1
[ 4427.215935] Hardware name:  SIENAD8-2L2T/SIENAD8-2L2T, BIOS 1.13 04/08/2024

Before that, I could observe this type of error at times:

Code:
[ 3034.032234] {8}[Hardware Error]:  Error 9, type: corrected
[ 3034.032679] {8}[Hardware Error]:  fru_text: PcieError
[ 3034.033124] {8}[Hardware Error]:   section_type: PCIe error
[ 3034.033569] {8}[Hardware Error]:   port_type: 4, root port
[ 3034.034014] {8}[Hardware Error]:   version: 0.2
[ 3034.034456] {8}[Hardware Error]:   command: 0x0407, status: 0x0010
[ 3034.034902] {8}[Hardware Error]:   device_id: 0000:00:05.1
[ 3034.035349] {8}[Hardware Error]:   slot: 0
[ 3034.035794] {8}[Hardware Error]:   secondary_bus: 0x09
[ 3034.036247] {8}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x14aa
[ 3034.036693] {8}[Hardware Error]:   class_code: 060400
[ 3034.037134] {8}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0002
[ 3034.037591] {8}[Hardware Error]:  Error 10, type: corrected
[ 3034.038036] {8}[Hardware Error]:  fru_text: PcieError
[ 3034.038481] {8}[Hardware Error]:   section_type: PCIe error
[ 3034.038926] {8}[Hardware Error]:   port_type: 4, root port
[ 3034.039372] {8}[Hardware Error]:   version: 0.2
[ 3034.039815] {8}[Hardware Error]:   command: 0x0407, status: 0x0010
[ 3034.040262] {8}[Hardware Error]:   device_id: 0000:00:05.1
[ 3034.040709] {8}[Hardware Error]:   slot: 0
[ 3034.041153] {8}[Hardware Error]:   secondary_bus: 0x09
[ 3034.041595] {8}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x14aa
[ 3034.042044] {8}[Hardware Error]:   class_code: 060400
[ 3034.042486] {8}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0002
[ 3034.043264] pcieport 0000:00:05.1: AER: aer_status: 0x00000040, aer_mask: 0x00001000
[ 3034.043746] pcieport 0000:00:05.1:    [ 6] BadTLP
[ 3034.044219] pcieport 0000:00:05.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
[ 3034.044704] pcieport 0000:00:05.1: AER: aer_status: 0x00000040, aer_mask: 0x00001000
[ 3034.045182] pcieport 0000:00:05.1:    [ 6] BadTLP
[ 3034.045649] pcieport 0000:00:05.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID

00:05.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14aa (rev 01)

Code:
[ 3764.613538] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 3764.614159] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 3764.614738] {9}[Hardware Error]: event severity: corrected
[ 3764.615303] {9}[Hardware Error]:  Error 0, type: corrected
[ 3764.615858] {9}[Hardware Error]:  fru_text: PcieError
[ 3764.616413] {9}[Hardware Error]:   section_type: PCIe error
[ 3764.616969] {9}[Hardware Error]:   port_type: 0, PCIe end point
[ 3764.617526] {9}[Hardware Error]:   version: 0.2
[ 3764.618078] {9}[Hardware Error]:   command: 0x0406, status: 0x0010
[ 3764.618639] {9}[Hardware Error]:   device_id: 0000:09:00.0
[ 3764.619195] {9}[Hardware Error]:   slot: 0
[ 3764.619807] {9}[Hardware Error]:   secondary_bus: 0x00
[ 3764.620366] {9}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x15ff
[ 3764.620926] {9}[Hardware Error]:   class_code: 020000
[ 3764.621478] {9}[Hardware Error]:   bridge: secondary_status: 0x2400, control: 0x0000
[ 3764.622032] {9}[Hardware Error]:  Error 1, type: corrected
[ 3764.622582] {9}[Hardware Error]:  fru_text: PcieError
[ 3764.623126] {9}[Hardware Error]:   section_type: PCIe error
[ 3764.623665] {9}[Hardware Error]:   port_type: 0, PCIe end point
[ 3764.624202] {9}[Hardware Error]:   version: 0.2
[ 3764.624734] {9}[Hardware Error]:   command: 0x0406, status: 0x0010
[ 3764.625267] {9}[Hardware Error]:   device_id: 0000:09:00.1
[ 3764.625795] {9}[Hardware Error]:   slot: 0
[ 3764.626312] {9}[Hardware Error]:   secondary_bus: 0x00
[ 3764.626829] {9}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x15ff
[ 3764.627346] {9}[Hardware Error]:   class_code: 020000
[ 3764.627859] {9}[Hardware Error]:   bridge: secondary_status: 0x2400, control: 0x0000
[ 3764.637049] i40e 0000:09:00.0: AER: aer_status: 0x00003100, aer_mask: 0x00001000
[ 3764.637690] i40e 0000:09:00.0:    [ 8] Rollover
[ 3764.638275] i40e 0000:09:00.0:    [13] NonFatalErr
[ 3764.638866] i40e 0000:09:00.0: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 3764.639477] i40e 0000:09:00.1: AER: aer_status: 0x00003100, aer_mask: 0x00001000
[ 3764.640069] i40e 0000:09:00.1:    [ 8] Rollover
[ 3764.640658] i40e 0000:09:00.1:    [13] NonFatalErr
[ 3764.641229] i40e 0000:09:00.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

09:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GBASE-T (rev 02)

Any hints about what to do? Is this a hardware issue and then what?

I've tried to look around and found some hints at setting kernel option pci=nommconf which disables Memory-Mapped PCI Configuration Space? Other things suggest issues with M.2 SSDs, specifically WD (I've got 2x SN770 as mirrored boot disks), should I take them out to check and upgrade firmware?
 
have tried to monitor the temps of the 2x SN770? or maybe a stress test of the cpu/ram/storage? one by one and after all of them at the same time to eliminate the possibility of a bad PSU
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!