I have an instability in my new system. As I start using it more, it becomes more frequent. System logs shows nothing at all before reboot.
I have built a new server based on Asrock Rack SIENAD8-2L2T, Epyc 8434P and 6x64 GB Registred ECC RAM. In PCIE7 slot I've installed a SATA breakout board for 4x 4xSATA ports, I have 1 SSD and 4 HDDs conected. In slots PCIE1 and PCIE3 I've installed two 4xM.2 boards and populated these with various SSDs, totally 8. None installed yet in the 2 M.2 slots on the motherboard. I use 1G connected to the IPMI and one 10G NIC connected.
pve-manager/8.2.9/98c7f34632fee424
Linux 6.8.12-4-pve (2024-11-06T15:04Z)
Today I caught a reboot while using "dmesg -w" on another system, the log shows:
Before that, I could observe this type of error at times:
00:05.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14aa (rev 01)
09:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GBASE-T (rev 02)
Any hints about what to do? Is this a hardware issue and then what?
I've tried to look around and found some hints at setting kernel option pci=nommconf which disables Memory-Mapped PCI Configuration Space? Other things suggest issues with M.2 SSDs, specifically WD (I've got 2x SN770 as mirrored boot disks), should I take them out to check and upgrade firmware?
I have built a new server based on Asrock Rack SIENAD8-2L2T, Epyc 8434P and 6x64 GB Registred ECC RAM. In PCIE7 slot I've installed a SATA breakout board for 4x 4xSATA ports, I have 1 SSD and 4 HDDs conected. In slots PCIE1 and PCIE3 I've installed two 4xM.2 boards and populated these with various SSDs, totally 8. None installed yet in the 2 M.2 slots on the motherboard. I use 1G connected to the IPMI and one 10G NIC connected.
pve-manager/8.2.9/98c7f34632fee424
Linux 6.8.12-4-pve (2024-11-06T15:04Z)
Today I caught a reboot while using "dmesg -w" on another system, the log shows:
Code:
[ 4427.215725] watchdog: Watchdog detected hard LOCKUP on cpu 40
[ 4427.215734] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables softdog sunrpc
[ 4427.215784] softdog: Initiating system reboot
[ 4427.215783] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 24777258129 wd_nsec: 24777251435
[ 4427.215787] binfmt_misc
[ 4427.215791] bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common ipmi_ssif amd64_edac edac_mce_amd kvm_amd kvm irqbypass
[ 4427.215815] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 4427.215819] crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd dax_hmem cxl_acpi acpi_ipmi rapl cxl_core ast pcspkr ipmi_si i2c_algo_bit k10temp ipmi_devintf ccp ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbmouse usbkbd usbhid cdc_ether usbnet hid mii xhci_pci nvme xhci_pci_renesas crc32_pclmul nvme_core ahci i40e xhci_hcd nvme_auth libahci i2c_piix4
[ 4427.215927] CPU: 40 PID: 0 Comm: swapper/40 Tainted: P O 6.8.12-4-pve #1
[ 4427.215935] Hardware name: SIENAD8-2L2T/SIENAD8-2L2T, BIOS 1.13 04/08/2024
Before that, I could observe this type of error at times:
Code:
[ 3034.032234] {8}[Hardware Error]: Error 9, type: corrected
[ 3034.032679] {8}[Hardware Error]: fru_text: PcieError
[ 3034.033124] {8}[Hardware Error]: section_type: PCIe error
[ 3034.033569] {8}[Hardware Error]: port_type: 4, root port
[ 3034.034014] {8}[Hardware Error]: version: 0.2
[ 3034.034456] {8}[Hardware Error]: command: 0x0407, status: 0x0010
[ 3034.034902] {8}[Hardware Error]: device_id: 0000:00:05.1
[ 3034.035349] {8}[Hardware Error]: slot: 0
[ 3034.035794] {8}[Hardware Error]: secondary_bus: 0x09
[ 3034.036247] {8}[Hardware Error]: vendor_id: 0x1022, device_id: 0x14aa
[ 3034.036693] {8}[Hardware Error]: class_code: 060400
[ 3034.037134] {8}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0002
[ 3034.037591] {8}[Hardware Error]: Error 10, type: corrected
[ 3034.038036] {8}[Hardware Error]: fru_text: PcieError
[ 3034.038481] {8}[Hardware Error]: section_type: PCIe error
[ 3034.038926] {8}[Hardware Error]: port_type: 4, root port
[ 3034.039372] {8}[Hardware Error]: version: 0.2
[ 3034.039815] {8}[Hardware Error]: command: 0x0407, status: 0x0010
[ 3034.040262] {8}[Hardware Error]: device_id: 0000:00:05.1
[ 3034.040709] {8}[Hardware Error]: slot: 0
[ 3034.041153] {8}[Hardware Error]: secondary_bus: 0x09
[ 3034.041595] {8}[Hardware Error]: vendor_id: 0x1022, device_id: 0x14aa
[ 3034.042044] {8}[Hardware Error]: class_code: 060400
[ 3034.042486] {8}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0002
[ 3034.043264] pcieport 0000:00:05.1: AER: aer_status: 0x00000040, aer_mask: 0x00001000
[ 3034.043746] pcieport 0000:00:05.1: [ 6] BadTLP
[ 3034.044219] pcieport 0000:00:05.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
[ 3034.044704] pcieport 0000:00:05.1: AER: aer_status: 0x00000040, aer_mask: 0x00001000
[ 3034.045182] pcieport 0000:00:05.1: [ 6] BadTLP
[ 3034.045649] pcieport 0000:00:05.1: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
00:05.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14aa (rev 01)
Code:
[ 3764.613538] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 3764.614159] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 3764.614738] {9}[Hardware Error]: event severity: corrected
[ 3764.615303] {9}[Hardware Error]: Error 0, type: corrected
[ 3764.615858] {9}[Hardware Error]: fru_text: PcieError
[ 3764.616413] {9}[Hardware Error]: section_type: PCIe error
[ 3764.616969] {9}[Hardware Error]: port_type: 0, PCIe end point
[ 3764.617526] {9}[Hardware Error]: version: 0.2
[ 3764.618078] {9}[Hardware Error]: command: 0x0406, status: 0x0010
[ 3764.618639] {9}[Hardware Error]: device_id: 0000:09:00.0
[ 3764.619195] {9}[Hardware Error]: slot: 0
[ 3764.619807] {9}[Hardware Error]: secondary_bus: 0x00
[ 3764.620366] {9}[Hardware Error]: vendor_id: 0x8086, device_id: 0x15ff
[ 3764.620926] {9}[Hardware Error]: class_code: 020000
[ 3764.621478] {9}[Hardware Error]: bridge: secondary_status: 0x2400, control: 0x0000
[ 3764.622032] {9}[Hardware Error]: Error 1, type: corrected
[ 3764.622582] {9}[Hardware Error]: fru_text: PcieError
[ 3764.623126] {9}[Hardware Error]: section_type: PCIe error
[ 3764.623665] {9}[Hardware Error]: port_type: 0, PCIe end point
[ 3764.624202] {9}[Hardware Error]: version: 0.2
[ 3764.624734] {9}[Hardware Error]: command: 0x0406, status: 0x0010
[ 3764.625267] {9}[Hardware Error]: device_id: 0000:09:00.1
[ 3764.625795] {9}[Hardware Error]: slot: 0
[ 3764.626312] {9}[Hardware Error]: secondary_bus: 0x00
[ 3764.626829] {9}[Hardware Error]: vendor_id: 0x8086, device_id: 0x15ff
[ 3764.627346] {9}[Hardware Error]: class_code: 020000
[ 3764.627859] {9}[Hardware Error]: bridge: secondary_status: 0x2400, control: 0x0000
[ 3764.637049] i40e 0000:09:00.0: AER: aer_status: 0x00003100, aer_mask: 0x00001000
[ 3764.637690] i40e 0000:09:00.0: [ 8] Rollover
[ 3764.638275] i40e 0000:09:00.0: [13] NonFatalErr
[ 3764.638866] i40e 0000:09:00.0: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 3764.639477] i40e 0000:09:00.1: AER: aer_status: 0x00003100, aer_mask: 0x00001000
[ 3764.640069] i40e 0000:09:00.1: [ 8] Rollover
[ 3764.640658] i40e 0000:09:00.1: [13] NonFatalErr
[ 3764.641229] i40e 0000:09:00.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
09:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GBASE-T (rev 02)
Any hints about what to do? Is this a hardware issue and then what?
I've tried to look around and found some hints at setting kernel option pci=nommconf which disables Memory-Mapped PCI Configuration Space? Other things suggest issues with M.2 SSDs, specifically WD (I've got 2x SN770 as mirrored boot disks), should I take them out to check and upgrade firmware?