PCIe Bus Errors with Sonnet Twin 10G sFP+ Thunderbolt 3 and Intel Nuc 12 Extreme

alanjohnwilliams

New Member
Feb 2, 2022
4
0
1
53
Hi Proxmox community,

Hi Support,



I have the Sonnet Twin 10G SFP+ Thunderbolt 3 to Dual 10 Gigabit Ethernet Adapter connected to an Intel Nuc 12 Extreme (NUC12DCMi9). I’m running Proxmox (linux) on this server and am having PCI issues with the thunderbolt / pci bus. The only thunderbolt device connected is the Sonnet 10G SFP+ adapter.

I’m seeing sporatic PCI bus failures that are corrected:

Code:
Oct 13 12:47:05 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 12:47:05 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 12:47:05 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 12:47:05 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 12:52:13 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 12:52:13 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 12:52:13 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 12:52:13 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 13:17:45 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 13:17:45 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 13:17:45 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 13:17:45 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 13:17:54 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 13:17:54 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 13:17:54 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 13:17:54 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 13:33:03 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 13:33:03 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 13:33:03 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 13:33:03 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 13:58:25 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 13:58:25 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 13:58:25 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 13:58:25 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 13:58:29 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 13:58:29 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 13:58:29 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 13:58:29 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 14:29:19 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 14:29:19 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 14:29:19 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 14:29:19 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 14:35:49 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 14:35:49 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Oct 13 14:35:49 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00001000/00002000
Oct 13 14:35:49 pve2 kernel: pcieport 0000:03:00.0: [12] Timeout
Oct 13 15:06:03 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 15:06:03 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Oct 13 15:06:03 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00000080/00002000
Oct 13 15:06:03 pve2 kernel: pcieport 0000:03:00.0: [ 7] BadDLLP
Oct 13 15:15:32 pve2 kernel: pcieport 0000:00:1b.0: AER: Corrected error received: 0000:03:00.0
Oct 13 15:15:32 pve2 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Oct 13 15:15:32 pve2 kernel: pcieport 0000:03:00.0: device [8086:1136] error status/mask=00001000/00002000
Oct 13 15:15:32 pve2 kernel: pcieport 0000:03:00.0: [12] Timeout

And then the occasional error that takes the system out:

Code:
Oct 13 16:44:47 pve2 kernel: pcieport 0000:00:1b.0: AER: Uncorrected (Fatal) error received: 0000:03:00.0
Oct 13 16:44:47 pve2 kernel: pcieport 0000:03:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
Oct 13 16:44:47 pve2 kernel: pcieport 0000:00:1b.0: AER: Root Port link has been reset (0)
Oct 13 16:44:47 pve2 kernel: pcieport 0000:00:1b.0: AER: device recovery failed

And

Code:
Oct 13 22:02:00 pve2 kernel: pcieport 0000:00:1b.0: AER: Uncorrected (Fatal) error received: 0000:03:00.0
Oct 13 22:02:00 pve2 kernel: pcieport 0000:03:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
Oct 13 22:02:00 pve2 kernel: thunderbolt 0000:05:00.0: AER: can't recover (no error_detected callback)
Oct 13 22:02:00 pve2 kernel: xhci_hcd 0000:39:00.0: AER: can't recover (no error_detected callback)
Oct 13 22:02:00 pve2 kernel: pcieport 0000:00:1b.0: AER: Root Port link has been reset (0)

The failure is coming from the Intel Thunderbolt 4 related pci devices:

Code:
00:00.0 Host bridge: Intel Corporation 12th Gen Core Processor Host Bridge/DRAM Registers (rev 02)
00:01.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x16 Controller #1 (rev 02)
00:02.0 VGA compatible controller: Intel Corporation AlderLake-S GT1 (rev 0c)
00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 02)
00:08.0 System peripheral: Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator (rev 02)
00:14.0 USB controller: Intel Corporation Alder Lake-S PCH USB 3.2 Gen 2x2 XHCI Controller (rev 11)
00:14.2 RAM memory: Intel Corporation Alder Lake-S PCH Shared SRAM (rev 11)
00:14.3 Network controller: Intel Corporation Alder Lake-S PCH CNVi WiFi (rev 11)
00:16.0 Communication controller: Intel Corporation Alder Lake-S PCH HECI Controller #1 (rev 11)
00:17.0 SATA controller: Intel Corporation Alder Lake-S PCH SATA Controller [AHCI Mode] (rev 11)
00:1b.0 PCI bridge: Intel Corporation Device 7ac4 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Device 7abe (rev 11)
00:1d.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root Port #9 (rev 11)
00:1d.4 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root Port #13 (rev 11)
00:1f.0 ISA bridge: Intel Corporation Z690 Chipset LPC/eSPI Controller (rev 11)
00:1f.3 Audio device: Intel Corporation Alder Lake-S HD Audio Controller (rev 11)
00:1f.4 SMBus: Intel Corporation Alder Lake-S PCH SMBus Controller (rev 11)
00:1f.5 Serial bus controller: Intel Corporation Alder Lake-S PCH SPI Controller (rev 11)
01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
02:00.0 Non-Volatile memory controller: Sandisk Corp Western Digital WD Black SN850X NVMe SSD (rev 01)
03:00.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] (rev 02)
04:00.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] (rev 02)
04:01.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] (rev 02)
04:02.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] (rev 02)
04:03.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] (rev 02)
05:00.0 USB controller: Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020]
06:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
07:01.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
07:04.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
08:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
08:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
39:00.0 USB controller: Intel Corporation Thunderbolt 4 USB Controller [Maple Ridge 4C 2020]
6d:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-LM (rev 03)
6e:00.0 Ethernet controller: Aquantia Corp. AQC113C NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] (rev 03)
6f:00.0 Non-Volatile memory controller: Sandisk Corp Western Digital WD Black SN850X NVMe SSD (rev 01)

I’ve tried a number of different kernel boot options – but none of these have resolved the PCI bus errors:

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7"
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 nvidia-drm.modeset=0 pcie_aspm=off"
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off pci=noaer intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 nvidia-drm.modeset=0 intel_idle.max_cstate=1 pcie_ports=native pci=assign-busses,hpbussize=0x40,realloc,hpmmiosize=128M,hpmmioprefsize=4G"
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off pci=noaer intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 nvidia-drm.modeset=0 intel_idle.max_cstate=1 pcie_ports=native pci=nommconf,assign-busses,hpbussize=0x40,realloc,hpmmiosize=128M,hpmmioprefsize=4G"
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 nvidia-drm.modeset=0 pcie_aspm=off"

I’m running 6.2.16 linux kernel:

Code:
Linux pve2 6.2.16-15-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-15 (2023-09-28T13:53Z) x86_64 GNU/Linux

I have ASPM turned off in the BIOS (as well as the kernel options).

Have you seen this issue before and any suggestions on resolving it?



Thanks!

-Alan
 
0000:3a:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge DD 2018] (rev 06) 0000:3a:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge DD 2018] (rev 06) 0000:3a:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge DD 2018] (rev 06)

JHL7540 same error
 
FWIW, same on NUC7's with USB3 ethernet plugged into USB/Thunderbolt port.

Code:
Mar 03 23:48:56 pve11 kernel: pcieport 0000:00:1c.4: AER: Corrected error received: 0000:03:00.0
Mar 03 23:48:56 pve11 kernel: pcieport 0000:03:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Mar 03 23:48:56 pve11 kernel: pcieport 0000:03:00.0:   device [8086:1576] error status/mask=00000080/00002000
Mar 03 23:48:56 pve11 kernel: pcieport 0000:03:00.0:    [ 7] BadDLLP


Thunderbolt 3 bus
Code:
03:00.0 PCI bridge [0604]: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015] [8086:1576]
04:00.0 PCI bridge [0604]: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015] [8086:1576]
04:01.0 PCI bridge [0604]: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015] [8086:1576]
04:02.0 PCI bridge [0604]: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015] [8086:1576]
 
Install Windows on the pc, change the tb cable ! , upd firmware of the tb , bios set to legacy, disconnect modem, then install prox iso, remove the iomm-pt.
overall is bad choice to go tb with network stuff and many time some tb unit do not work well or are half of the function.
 
  • Like
Reactions: Kingneutron
(from random internet searches)

Code:
pci=nommconf

to /etc/default/grub - then update-grub - then reboot, stops the errors, but breaks IOMMU remapping.

does NOT FIX the problem.

however THIS:

Code:
pcie_aspm=off

DOES appear to solve the problem.