Force PCIe port downgrade at boot (or solve AER errors)

logan893

Member
Mar 23, 2022
19
0
6
Is there a way to force downgrade of a specific PCIe port or device on boot? (I cannot seem to change it for this port in BIOS.)

Or some other way to solve these PCIe device errors?

I have one machine with an M.2 2242 slot, which supports PCIe 3.0 x4. I have an adapter riser with a 25cm cable to convert the M.2 to PCIe x4. I have used this in another PC (based on Xeon D-1518) with great results, with an Intel 10Gb adapter (X540-AT2, using only one port), running Proxmox 8.4.

With the new system, Asrock Rack workstation motherboard based on the C246 chipset, I am having some trouble. This is a new installation of Proxmox 9.

AER reports correctable errors (most of them reported as "multiple correctable errors") from the NIC when placed in the M.2 riser adapter. When the X540 card is idle, there are only a few errors. But, when I try to run a network test (iperf3) from one VM to another machine using a vmbridge linked to this NIC, I get a lot of the errors in the console / dmesg log, even when throughput is good. Short tests seem to run fine, but leaving it longer causes the interface to break completely. It took 40 seconds of iperf3 (TCP sending at 9+ Gbps), and then it enters a reset loop.

ASPM shows as disabled for this port (even without kernel command line modification), and I have also tried with kernel command line additions "pci=nommconf pcie_aspm=off" without any improvement.

Errors during the iperf3 transfer:
code_language.shell:
[ 1600.726157] pcieport 0000:00:1d.0: AER: Multiple Correctable error message received from 0000:05:00.1
[ 1600.726832] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
[ 1600.727491] pcieport 0000:00:1d.0:   device [8086:a330] error status/mask=00001000/00002000
[ 1600.728160] pcieport 0000:00:1d.0:    [12] Timeout
[ 1600.728819] ixgbe 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID)
[ 1600.729476] ixgbe 0000:05:00.0:   device [8086:1528] error status/mask=000010c1/00002000
[ 1600.730134] ixgbe 0000:05:00.0:    [ 0] RxErr                  (First)
[ 1600.730788] ixgbe 0000:05:00.0:    [ 6] BadTLP
[ 1600.731432] ixgbe 0000:05:00.0:    [ 7] BadDLLP
[ 1600.732074] ixgbe 0000:05:00.0:    [12] Timeout
[ 1600.732717] ixgbe 0000:05:00.1: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID)
[ 1600.733370] ixgbe 0000:05:00.1:   device [8086:1528] error status/mask=000010c1/00002000
[ 1600.734012] ixgbe 0000:05:00.1:    [ 0] RxErr                  (First)
[ 1600.734652] ixgbe 0000:05:00.1:    [ 6] BadTLP
[ 1600.735295] ixgbe 0000:05:00.1:    [ 7] BadDLLP
[ 1600.735933] ixgbe 0000:05:00.1:    [12] Timeout
[ 1600.736577] ixgbe 0000:05:00.1: AER:   Error of this Agent is reported first


After a 40 second continuous run of iperf3, the NIC resets and enters into a loop of resets. No further network activity is possible, my VM can no longer run iperf or ping to my remote server.

First reset, and then it loops every 5-20 seconds. Connectivity is not possible even when the device says it is back up.
code_language.shell:
[ 1604.914955] pcieport 0000:00:1d.0: AER: Multiple Correctable error message received from 0000:05:00.1
[ 1604.915623] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
[ 1604.916267] pcieport 0000:00:1d.0:   device [8086:a330] error status/mask=00001000/00002000
[ 1604.916914] pcieport 0000:00:1d.0:    [12] Timeout
[ 1604.917554] ixgbe 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID)
[ 1604.918196] ixgbe 0000:05:00.0:   device [8086:1528] error status/mask=000030c1/00002000
[ 1604.918838] ixgbe 0000:05:00.0:    [ 0] RxErr                  (First)
[ 1604.919477] ixgbe 0000:05:00.0:    [ 6] BadTLP
[ 1604.920109] ixgbe 0000:05:00.0:    [ 7] BadDLLP
[ 1604.920733] ixgbe 0000:05:00.0:    [12] Timeout
[ 1604.921375] ixgbe 0000:05:00.1: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Transmitter ID)
[ 1604.922008] ixgbe 0000:05:00.1:   device [8086:1528] error status/mask=000010c1/00002000
[ 1604.922643] ixgbe 0000:05:00.1:    [ 0] RxErr                  (First)
[ 1604.923275] ixgbe 0000:05:00.1:    [ 6] BadTLP
[ 1604.923899] ixgbe 0000:05:00.1:    [ 7] BadDLLP
[ 1604.924523] ixgbe 0000:05:00.1:    [12] Timeout
[ 1604.925145] ixgbe 0000:05:00.1: AER:   Error of this Agent is reported first
[ 1610.702615] ixgbe 0000:05:00.0 enp5s0f0: NETDEV WATCHDOG: CPU: 1: transmit queue 11 timed out 5788 ms
[ 1610.703255] ixgbe 0000:05:00.0 enp5s0f0: initiating reset due to tx timeout
[ 1610.703922] ixgbe 0000:05:00.0 enp5s0f0: Reset adapter
[ 1610.738230] ixgbe 0000:05:00.0 enp5s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
[ 1610.915903] ixgbe 0000:05:00.0: primary disable timed out
[ 1611.117242] vmbr1: port 1(enp5s0f0) entered disabled state
[ 1615.727657] ixgbe 0000:05:00.0 enp5s0f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 1615.728353] vmbr1: port 1(enp5s0f0) entered blocking state
[ 1615.728994] vmbr1: port 1(enp5s0f0) entered forwarding state
[ 1621.390419] ixgbe 0000:05:00.0 enp5s0f0: Detected Tx Unit Hang
                 Tx Queue             <11>
                 TDH, TDT             <0>, <2>
                 next_to_use          <2>
                 next_to_clean        <0>
               tx_buffer_info[next_to_clean]
                 time_stamp           <100141bac>
                 jiffies              <1001429c0>
[ 1621.395268] ixgbe 0000:05:00.0 enp5s0f0: tx hang 2 detected on queue 11, resetting adapter
[ 1621.395876] ixgbe 0000:05:00.0 enp5s0f0: initiating reset due to tx timeout
[ 1621.396531] ixgbe 0000:05:00.0 enp5s0f0: Reset adapter
[ 1621.430844] ixgbe 0000:05:00.0 enp5s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
[ 1621.464912] ixgbe 0000:05:00.0 enp5s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
[ 1621.642822] ixgbe 0000:05:00.0: primary disable timed out
[ 1621.848719] vmbr1: port 1(enp5s0f0) entered disabled state
[ 1626.457105] ixgbe 0000:05:00.0 enp5s0f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 1626.457772] vmbr1: port 1(enp5s0f0) entered blocking state
[ 1626.458396] vmbr1: port 1(enp5s0f0) entered forwarding state
[ 1630.094255] ixgbe 0000:05:00.0 enp5s0f0: Detected Tx Unit Hang
                 Tx Queue             <5>
                 TDH, TDT             <0>, <2>
                 next_to_use          <2>
                 next_to_clean        <0>
               tx_buffer_info[next_to_clean]
                 time_stamp           <1001440a3>
                 jiffies              <100144bc0>
[ 1630.094255] ixgbe 0000:05:00.0 enp5s0f0: Detected Tx Unit Hang
                 Tx Queue             <11>
                 TDH, TDT             <0>, <2>
                 next_to_use          <2>
                 next_to_clean        <0>
               tx_buffer_info[next_to_clean]
                 time_stamp           <100143e3d>
                 jiffies              <100144bc0>
[ 1630.094265] ixgbe 0000:05:00.0 enp5s0f0: tx hang 3 detected on queue 11, resetting adapter
[ 1630.099224] ixgbe 0000:05:00.0 enp5s0f0: tx hang 3 detected on queue 5, resetting adapter
[ 1630.103892] ixgbe 0000:05:00.0 enp5s0f0: initiating reset due to tx timeout
[ 1630.104472] ixgbe 0000:05:00.0 enp5s0f0: initiating reset due to tx timeout
[ 1630.105056] ixgbe 0000:05:00.0 enp5s0f0: Reset adapter


As a comparison, I have tried to put other PCIe cards I have available into this riser on the new system. Some cards seem to work without issue, and other cards are behaving even worse (especially the card that runs at PCIe 3.0 speeds.)

Intel PRO/1000 PT (PCIe 1.0 x1): No issues, no AER messages. Tested with continuous iperf3 data, bidirectionally.

Mellanox MNPA19-XTR ConnectX-2 EN (PCIe 2.0 x4): No AER messages during idle. I currently cannot test it in connected mode.

LSI 2008 (RAID FW), no disks attached (PCIe 2.0 x4): Steady stream of AER messages during idle (6-8 "correctable" errors per second, very few "multiple correctable" errors).

LSI 3008 (RAID FW), no disks attached (PCIe 3.0 x4): Flood of AER messages during idle (20 "correctable" and 400 "multiple correctable" errors per second). System cannot even reboot or shut down successfully, and needs to be forcefully powered off.
 
lspci -vv output for X540-AT2:
code_language.shell:
05:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
        Subsystem: Intel Corporation Ethernet Converged Network Adapter X540-T2
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 17
        IOMMU group: 16
        Region 0: Memory at a0200000 (64-bit, prefetchable) [size=2M]
        Region 4: Memory at a0404000 (64-bit, prefetchable) [size=16K]
        Expansion ROM at 8f900000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend+
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <1us, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x4 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number b4-96-91-ff-ff-0f-aa-14
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration- 10BitTagReq- IntMsgNum 0
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ 10BitTagReq-
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 128, stride: 2, Device ID: 1515
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 000000008fa00000 (64-bit, non-prefetchable)
                Region 3: Memory at 000000008fb00000 (64-bit, non-prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1d0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: ixgbe
        Kernel modules: ixgbe

05:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
        Subsystem: Intel Corporation Ethernet Converged Network Adapter X540-T2
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        IOMMU group: 17
        Region 0: Memory at a0000000 (64-bit, prefetchable) [size=2M]
        Region 4: Memory at a0400000 (64-bit, prefetchable) [size=16K]
        Expansion ROM at 8f980000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <1us, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x4 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number b4-96-91-ff-ff-0f-aa-14
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration- 10BitTagReq- IntMsgNum 0
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- 10BitTagReq-
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 01
                VF offset: 128, stride: 2, Device ID: 1515
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 000000008fc00000 (64-bit, non-prefetchable)
                Region 3: Memory at 000000008fd00000 (64-bit, non-prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1d0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: ixgbe
        Kernel modules: ixgbe
 
First of all, the LSI cards are (nearly all) known for having severe issues with PCIe ASPM. Once that kicks in, you'll be flooded with AER error messages. So, not a good object for comparison.
Now, concerning your Intel X540: Did not test those so far, so I cannot say, if they have the same issue(s) as the LSI HBAs or not.
BUT, one step after the other:
First you should double-check, if it actually could be signal integrity issues. Especially since you mentioned, that a cable adapter is involved! That means extra connections, which means more resistance for the signals, which nearly always makes things worse. If it is worse enough for actual failure is the big question.

If you can, set the PCIe Speed for the corresponding port in BIOS to Gen2 or even Gen1 and retest. I do not know of a quick way to achieve the same on the Linux level. You might be able to do so, by editing PCI config space and running a PCIe-reset, but that might be a bit too far. So check BIOS first!

If it should run stable with the lower speeds, then chances are high, that you have a signal interity issue. That means a hardware problem, that you probably have to tackle in hardware.
If it still behaves the same, then it might really be a logic issue. As mentioned, for example issues with ASPM, or many more.

In case of strange issues, having a look at the so called Errata Sheet from the silicon vendor, might also be a good idea.
See here for your Intel 82599:
https://cdrdv2-public.intel.com/331521/331521_82599_SpecUpdate_Rev4.3.pdf

Good luck in the meantime!
 
Thanks for your reply.

ASPM is disabled and should not interfere here. It shows as disabled in all lspci outputs.

Yes, the cable (or rather the combination of this particular motherboard's M.2 port and this cable) is the main concern, especially since not all cards produce the errors, or the same amount of errors. It would be a bummer to not be able to use it for PCIe expansion.

The port itself seems to work well with an M.2 NVME SSD at PCIe 3.0 x4 speeds. Consistent read performance at over 2 GB/s; the full 256 GB drive is read sequentially with DD in less than 2 minutes. No alerts or errors in the console/dmesg.

I have again verified that I have no problems with this same adapter cable in my Xeon D-1518 system, running Proxmox 8.4.9. Its M.2 slot can reach PCIe 3.0 x4 speeds, and no errors reported for any card, including the LSI 3008 HBA and when running iperf3 to stress the 10Gb NIC.

There is no option to lower the PCIe version for the M.2 slot in BIOS, from what I have been able to find. Only the main PCIe x16 slot exposes this option.