ThunderBolt GPU keeps disconnecting every 16 seconds!

rudydevolder

New Member
Nov 28, 2023
5
0
1
60
Philippines
I have a AMD RX550 GPU in a Thunderbolt dock that is tested to work perfect with my MinisForum MS-01 (Intel 12900H) under W11 on bare metal.
I have a Erying 13900H D5 motherboard with 64Gb RAM that's working flawless with ProxMox the past 3 months.
I have upgraded my ProxMox-VE to the latest version 8.2.4

When I try to passthrough my AMD-GPU in the thunderbolt dock, it is recognized and I can start Windows 11 and use it for about 5 to 15 seconds but then the screen and the cursor freezes. I observed on the Console screen and in the journal the following:

Jul 18 12:40:58 pve kernel: thunderbolt 0-1: device disconnected
Jul 18 12:40:58 pve kernel: pcieport 0000:0a:00.0: ready 1023ms after resume
Jul 18 12:40:58 pve kernel: pcieport 0000:00:07.0: PME: Spurious native interrupt!
Jul 18 12:41:00 pve kernel: pci_bus 0000:20: Allocating resources
Jul 18 12:41:04 pve kernel: thunderbolt 0-1: new device found, vendor=0x8086 device=0x2
Jul 18 12:41:04 pve kernel: thunderbolt 0-1: Intel Tamales Module 2
Jul 18 12:41:24 pve kernel: thunderbolt 0-1: device disconnected
Jul 18 12:41:24 pve kernel: pcieport 0000:00:07.0: PME: Spurious native interrupt!
Jul 18 12:41:25 pve kernel: pci_bus 0000:20: Allocating resources
Jul 18 12:41:30 pve kernel: thunderbolt 0-1: new device found, vendor=0x8086 device=0x2
Jul 18 12:41:30 pve kernel: thunderbolt 0-1: Intel Tamales Module 2
Jul 18 12:41:50 pve kernel: thunderbolt 0-1: device disconnected
Jul 18 12:41:50 pve kernel: pcieport 0000:00:07.0: PME: Spurious native interrupt!
Jul 18 12:41:51 pve kernel: pci_bus 0000:20: Allocating resources
Jul 18 12:41:56 pve kernel: thunderbolt 0-1: new device found, vendor=0x8086 device=0x2
Jul 18 12:41:56 pve kernel: thunderbolt 0-1: Intel Tamales Module 2
Jul 18 12:42:16 pve kernel: thunderbolt 0-1: device disconnected
Jul 18 12:42:16 pve kernel: pcieport 0000:0a:00.0: ready 1023ms after resume
Jul 18 12:42:16 pve kernel: pcieport 0000:00:07.0: PME: Spurious native interrupt!
Jul 18 12:42:17 pve kernel: pci_bus 0000:20: Allocating resources
Jul 18 12:42:22 pve kernel: thunderbolt 0-1: new device found, vendor=0x8086 device=0x2
Jul 18 12:42:22 pve kernel: thunderbolt 0-1: Intel Tamales Module 2


I don't need to start my W11 virtual machine; It keeps doing this until I disconnect my Thunderbolt GPU.
I followed all the instructions for blacklisting and I have some more outputs if someone can make something from these or give some tips to try out I would be very thankfull.

#cat /etc/modules =>
Code:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
thunderbolt
#cat pve-blacklist.conf =>
Code:
blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist nvidia_drm
#cat/etc/modprobe.d/vfio.conf =>
Code:
optionsvfio-pci ids=1002:699f,1002:aae0,8086:15ef,8086:15f0 disable_vga=0
#0c:00.0VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] LexaPRO [Radeon 540/540X/550/550X / RX 540X/550/550X] [1002:699f] (rev c7)
#0c:00.1Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DPAudio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
#0a:00.0PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan RidgeDD 2018] [8086:15ef] (rev 06)
#20:00.0USB controller [0c03]: Intel Corporation JHL7540 Thunderbolt 3 USB Controller[Titan Ridge DD 2018] [8086:15f0] (rev 06)


Outputs:
pvesh get /nodes/pve/hardware/pci --pci-class-blacklist ""
=> output in attachment
dmesg | grep 'remapping' =>
Code:
[    0.166898] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.167778] DMAR-IR: Enabled IRQ remapping in x2apic mode
dmesg | grep -e DMAR -e IOMMU =>
Code:
[    0.012916] ACPI: DMAR 0x0000000030E64000 000088 (v02 INTEL  EDK2     00000002      01000013)
[    0.012937] ACPI: Reserving DMAR table memory at [mem 0x30e64000-0x30e64087]
[    0.166879] DMAR: Host address width 39
[    0.166880] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[    0.166885] DMAR: dmar0: reg_base_addr fed90000 ver 4:0 cap 1c0000c40660462 ecap 29a00f0505e
[    0.166888] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[    0.166891] DMAR: dmar1: reg_base_addr fed91000 ver 5:0 cap d2008c40660462 ecap f050da
[    0.166893] DMAR: RMRR base: 0x0000003b000000 end: 0x0000003f7fffff
[    0.166896] DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 1
[    0.166897] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[    0.166898] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.167778] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    0.474430] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
[    1.676788] DMAR: No ATSR found
[    1.676789] DMAR: No SATC found
[    1.676789] DMAR: IOMMU feature fl1gp_support inconsistent
[    1.676790] DMAR: IOMMU feature pgsel_inv inconsistent
[    1.676791] DMAR: IOMMU feature nwfs inconsistent
[    1.676792] DMAR: IOMMU feature dit inconsistent
[    1.676792] DMAR: IOMMU feature sc_support inconsistent
[    1.676793] DMAR: IOMMU feature dev_iotlb_support inconsistent
[    1.676794] DMAR: dmar0: Using Queued invalidation
[    1.676796] DMAR: dmar1: Using Queued invalidation
[    1.679016] DMAR: Intel(R) Virtualization Technology for Directed I/O

What bothers me in this output:
pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
I cannot find: iommu=on in the output, but according to this article :

https://vfio.blogspot.com/2016/09/intel-iommu-enabled-it-doesnt-mean-what.html

I should not look for that but the last line in my output that says: DMAR: Intel(R) Virtualization Technology for Directed I/O
Question:
Someone can confirm me my IOMMU is actually active?

cat /proc/modules | grep pci =>
Code:
snd_sof_pci_intel_tgl 12288 0 - Live 0xffffffffc1466000
snd_sof_intel_hda_common 208896 1 snd_sof_pci_intel_tgl, Live 0xffffffffc1a55000
snd_sof_pci 24576 2 snd_sof_pci_intel_tgl,snd_sof_intel_hda_common, Live 0xffffffffc1a76000
snd_sof 360448 3 snd_sof_intel_hda_common,snd_sof_intel_hda,snd_sof_pci, Live 0xffffffffc19f9000
snd_soc_acpi_intel_match 102400 2 snd_sof_pci_intel_tgl,snd_sof_intel_hda_common, Live 0xffffffffc128d000
vfio_pci 16384 1 - Live 0xffffffffc0d7d000
vfio_pci_core 86016 1 vfio_pci, Live 0xffffffffc0dfa000
irqbypass 12288 3 kvm,vfio_pci_core, Live 0xffffffffc0df2000
vfio 69632 7 vfio_pci,vfio_pci_core,vfio_iommu_type1, Live 0xffffffffc0dc0000
xhci_pci 24576 0 - Live 0xffffffffc02ec000
xhci_pci_renesas 16384 1 xhci_pci, Live 0xffffffffc0279000
xhci_hcd 364544 1 xhci_pci, Live 0xffffffffc041c000
intel_lpss_pci 24576 0 - Live 0xffffffffc04ed000
spi_intel_pci 12288 0 - Live 0xffffffffc03bd000
intel_lpss 12288 1 intel_lpss_pci, Live 0xffffffffc02e6000
spi_intel 32768 1 spi_intel_pci, Live 0xffffffffc02d7000



Some things I already tried:


1. Blacklisted also the Thunderbolt controller :


2. I tried a tip that said
->
Update /etc/kernel/cmdline and add pcie_aspm=off to disable active state power management. => To no avail


3. Still on my todo list: But waiting for my order OnexGPU I am going to test first on my Minisforum MS-01 while still running baremetal W11 before converting also to a ProxMox server to test. Maybe it's a motherboard or bios problem.

4.5.6... Any thoughts, someone????
 

Attachments

  • pvesh --pci-class-blacklist.png
    pvesh --pci-class-blacklist.png
    81.5 KB · Views: 1

1 step closer, no more Thunderbolt disconnect every 16 seconds,​

but my problem with my VM with W11 freezing is still present.

I am not sure what did the trick, I think putting disable_vga=1 in the vfio.conf file.
But my problem is not yet solved, I don't see any error messages in my journalctl, so I wonder why my VM works for a while and than freezes and I can only force to stop.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!