my experience with proxmox + thunderbolt eGPU

drmatt · Wednesday at 03:54

Hi, I wanted to post information here about how I got my eGPU passthrough setup (nearly) working, for the possible benefit of others who may have a similar situation. Also hoping to tap into the community's experience and get ideas on how to improve stability.

Hardware:
Thinkpad P16 gen 1 laptop host
Thinkpad Thunderbolt 4 workstation dock (attached to laptop TB4 port)
GMKtec AD-GP1 eGPU dock with AMD Radeon RX 7600M XT (attached to dock TB4 port)
Dual monitors (attached to DisplayPort outputs of eGPU)

Software:
PVE 8.4.1
Kernel 6.8.12-10
Windows 11 VM

The Thinkpad dock exposes more ports than the laptop itself, including wired ethernet. It also has a single thunderbolt 4 connector, to which I'm attaching the eGPU. The host BIOS has only a single setting related to Thunderbolt 4, which is "enable PCIe tunneling", and that is turned on by default. If it is disabled, some of the dock devices like the NIC do not show up on the host at all, so I leave it on. Either way, interestingly, running boltctl list --all only shows the dock, not the other attached devices like the eGPU.

With PCIe tunneling enabled, the eGPU video and audio devices, as well as a few other dock functions, are listed in lspci. However, with my initial grub settings, the NIC no longer worked - the device wasn't seen by the host and there was no external network access. There must have been a conflict or timing issue at startup, so I needed to add pci=realloc,assign-busses to get both the eGPU and NIC to be recognized. This also renumbered a few of the PCI device assignments.

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pci=realloc,assign-busses"

Now the eGPU was visible and I could successfully pass it through to the Windows 11 VM. The AMD device was listed in Windows guest device manager under Display Adapters, and I was able to update to the latest AMD driver, but the device showed the dreaded Error 43 and there was no monitor output.

I needed to blacklist the GPU so that the host didn't grab it first and cause multiple resets when passing through to the guest. So I added the following to /etc/modprobe.d/pci-blacklist.conf:

Code:

blacklist amdgpu
blacklist radeon

After this change and host reboot, the Windows guest loaded the driver successfully - no more Error 43, but still no monitor outputs, only the AMD vDisplay adapter which allows RDP connections. I found this post very helpful - I needed to hide virtualization from the AMD driver for it to output to any physical displays, so I added hidden=1 and kvm=0 flags to the VM config. After that, Windows started displaying on the dual monitors.

The next issue was stability of the display. Many windows were "twitchy" - sporadically warping their contents for an instant and then going back to normal. This seemed to mostly affect individual windows rather than the entire desktop, and was more apparent when moving windows around, but also occurred on its own as well. The displayed image was mostly stable, but these spasms happened frequently enough to make it nearly impossible to work on code or analysis.

To check whether the hardware was faulty, I plugged the eGPU into a Surface laptop running native Windows 11. It was detected right away and the monitor display was rock solid, so no obvious hardware issues. I did notice that the AMD driver recognized that it was a thunderbolt-connected GPU, which did not happen on the Windows VM, and that made me wonder if maybe I should pass through the entire thunderbolt port / bridge instead of the GPU itself?

I tried this. Removed the GPU from the VM's PCI devices, and added the Thunderbolt NHI and Thunderbolt USB controller. The GPU itself then didn't show up in the VM, but when I added the GPU back in - alongside the thunderbolt devices - the AMD driver recognized the thunderbolt connection as it did on the native Windows laptop, and the display was now also rock solid in the VM.

Great! All is well, everything runs very smoothly. Except now, after some amount of normal VM usage (usually tens of minutes), the screen goes black, the VM crashes, and the entire host system locks up due to PCI-related errors. Some of the dmesg / journalctl messages I see:

At startup

Code:

QEMU[7588]: kvm: VFIO_MAP_DMA failed: Invalid argument
QEMU[7588]: kvm: vfio_container_dma_map(0x61e87a82f800, 0x380a00000000, 0x10000000, 0x78fd20000000) = -22 (Invalid argument)
QEMU[7588]: kvm: VFIO_MAP_DMA failed: Invalid argument
QEMU[7588]: kvm: vfio_container_dma_map(0x61e87a82f800, 0x382000000000, 0x40000, 0x79078ba00000) = -22 (Invalid argument)

Periodically

Code:

kernel: pcieport 0000:0e:01.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
kernel: pcieport 0000:0e:01.0:   device [8086:15ef] error status/mask=000000c0/00002000
kernel: pcieport 0000:0e:01.0:    [ 6] BadTLP
kernel: pcieport 0000:0e:01.0:    [ 7] BadDLLP

When the crash happens, I start to see these:

Code:

kernel: pcieport 0000:00:1d.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:01.0
kernel: pcieport 0000:0e:01.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Data Link Layer, (Receiver ID)
kernel: pcieport 0000:0e:01.0:   device [8086:15ef] error status/mask=00000010/00000000
kernel: pcieport 0000:0e:01.0:    [ 4] DLP                    (First)
kernel: pcieport 0000:0e:02.0: Unable to change power state from D3hot to D0, device inaccessible
kernel: igc 0000:16:00.0 eth0: PCIe link lost, device now detached
kernel: pcieport 0000:0f:00.0: not ready 1023ms after bus reset; waiting
kernel: pcieport 0000:0f:00.0: not ready 2047ms after bus reset; waiting
kernel: pcieport 0000:0f:00.0: not ready 4095ms after bus reset; waiting
kernel: pcieport 0000:0f:00.0: not ready 8191ms after bus reset; waiting
kernel: pcieport 0000:0f:00.0: not ready 16383ms after bus reset; waiting
kernel: watchdog: BUG: soft lockup - CPU#20 stuck for 26s! [kworker/20:0:188898]
kernel: Modules linked in: tcp_diag inet_diag dm_snapshot rpcsec_gss_krb5 nfsv4 nfs netfs vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd nfsd auth_rpcgss nfs_acl lockd grace veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls qrtr sunrpc nfnetlink_log nfnetlink binfmt_misc hid_logitech_hidpp hid_logitech_dj usbkbd snd_usb_audio snd_usbmidi_lib wacom snd_ump snd_rawmidi snd_seq_device usbmouse xe drm_gpuvm drm_exec gpu_sched drm_suballoc_helper drm_ttm_helper intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iwlmvm irqbypass snd_hda_codec_hdmi crct10dif_pclmul polyval_clmulni polyval_generic mac80211 ghash_clmulni_intel sha256_ssse3 btusb sha1_ssse3 btrtl aesni_intel btintel crypto_simd btbcm snd_hda_intel cryptd btmtk thinkpad_acpi libarc4 i915 uvcvideo snd_intel_dspcfg snd_intel_sdw_acpi videobuf2_vmalloc uvc snd_hda_codec uas videobuf2_memops
kernel:  videobuf2_v4l2 usb_storage videodev snd_hda_core snd_hwdep drm_buddy videobuf2_common processor_thermal_device_pci intel_rapl_msr snd_pcm nvram bluetooth mc processor_thermal_device ttm snd_timer rapl processor_thermal_wt_hint ecdh_generic nxp_nci_i2c iwlwifi ecc drm_display_helper nxp_nci processor_thermal_rfim spi_nor snd nci processor_thermal_rapl think_lmi ledtrig_audio intel_cstate cec firmware_attributes_class mtd intel_rapl_common pcspkr wmi_bmof soundcore platform_profile nfc int3403_thermal processor_thermal_wt_req ucsi_acpi cfg80211 mei_me intel_pmc_core rc_core processor_thermal_power_floor typec_ucsi processor_thermal_mbox i2c_algo_bit intel_hid mei typec int340x_thermal_zone int3400_thermal intel_vsec pmt_telemetry sparse_keymap input_leds acpi_thermal_rel pmt_class acpi_pad acpi_tad joydev serio_raw mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap msr efi_pstore dmi_sysfs ip_tables x_tables autofs4 usbhid btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison
kernel:  dm_bufio libcrc32c hid_multitouch hid_generic rtsx_pci_sdmmc xhci_pci nvme spi_intel_pci i2c_i801 xhci_pci_renesas crc32_pclmul psmouse video thunderbolt i2c_smbus igc spi_intel nvme_core xhci_hcd intel_lpss_pci rtsx_pci i2c_hid_acpi intel_lpss i2c_hid nvme_auth idma64 hid wmi pinctrl_alderlake
kernel: CPU: 20 PID: 188898 Comm: kworker/20:0 Tainted: P           O       6.8.12-10-pve #1
kernel: Hardware name: LENOVO 21D6CTO1WW/21D6CTO1WW, BIOS N3FET43W (1.28 ) 07/02/2024
kernel: Workqueue: pm pm_runtime_work
kernel: RIP: 0010:pci_mmcfg_read+0xcb/0x110
kernel: Code: 45 31 c9 c3 cc cc cc cc 4c 01 e8 66 8b 00 0f b7 c0 41 89 04 24 eb c9 4c 01 e8 8a 00 0f b6 c0 41 89 04 24 eb bb 4c 01 e8 8b 00 <41> 89 04 24 eb b0 e8 da 58 11 ff 41 c7 04 24 ff ff ff ff 48 83 c4
kernel: RSP: 0018:ffffb95fc603fbb8 EFLAGS: 00000286
kernel: RAX: 00000000ffffffff RBX: 0000000000e10000 RCX: 0000000000000ffc
kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
kernel: RBP: ffffb95fc603fbe8 R08: 0000000000000004 R09: ffffb95fc603fc0c
kernel: R10: 000000000000000e R11: ffffffff890a2600 R12: ffffb95fc603fc0c
kernel: R13: 0000000000000ffc R14: 0000000000010000 R15: 0000000000000004
kernel: FS:  0000000000000000(0000) GS:ffff92e96f600000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00003ac800f2c080 CR3: 0000001b50436000 CR4: 0000000000f52ef0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <IRQ>
kernel:  ? show_regs+0x6d/0x80
kernel:  ? watchdog_timer_fn+0x206/0x290
kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
kernel:  ? __hrtimer_run_queues+0x105/0x280
kernel:  ? clockevents_program_event+0xb3/0x140
kernel:  ? hrtimer_interrupt+0xf6/0x250
kernel:  ? __sysvec_apic_timer_interrupt+0x4e/0x120
kernel:  ? sysvec_apic_timer_interrupt+0x8d/0xd0
kernel:  </IRQ>
kernel:  <TASK>
kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel:  ? __pfx_pci_mmcfg_read+0x10/0x10
kernel:  ? pci_mmcfg_read+0xcb/0x110
kernel:  ? pci_mmcfg_read+0x52/0x110
kernel:  pci_read+0x52/0x90
kernel:  pci_bus_read_config_dword+0x47/0x90
kernel:  pci_read_config_dword+0x27/0x50
kernel:  pci_find_next_ext_capability+0x83/0xe0
kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
kernel:  pci_find_ext_capability+0x12/0x20
kernel:  pci_restore_vc_state+0x29/0x80
kernel:  pci_restore_state.part.0+0x11f/0x3a0
kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
kernel:  pci_restore_state+0x1e/0x30
kernel:  pci_pm_runtime_resume+0x46/0x100
kernel:  __rpm_callback+0x4d/0x170
kernel:  rpm_callback+0x6d/0x80
kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
kernel:  rpm_resume+0x594/0x7e0
kernel:  pm_runtime_work+0x80/0xe0
kernel:  process_one_work+0x17f/0x3a0
kernel:  worker_thread+0x306/0x440
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0xef/0x120
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x44/0x70
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>
kernel: pcieport 0000:0f:00.0: not ready 32767ms after bus reset; waiting
kernel: pcieport 0000:0e:04.0: Unable to change power state from D3hot to D0, device inaccessible
kernel: pcieport 0000:0f:00.0: not ready 65535ms after bus reset; giving up
kernel: pcieport 0000:0e:01.0: AER: Downstream Port link has been reset (-25)
kernel: pcieport 0000:0e:01.0: AER: subordinate device reset failed
kernel: pcieport 0000:0e:01.0: AER: device recovery failed
QEMU[7723]: kvm: vfio_err_notifier_handler(0000:11:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
systemd-udevd[689]: 0000:11:00.0: Worker [190114] processing SEQNUM=9635 is taking a long time
QEMU[174026]: kvm: warning: Spice: main:0 (0x6274f5b31d50): rcc 0x6274f75e2d90 has been unresponsive for more than 30000 ms, disconnecting
pvestatd[1571]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout
systemd-udevd[689]: 0000:11:00.1: Worker [190115] processing SEQNUM=9636 is taking a long time
dhclient[1282]: DHCPREQUEST for 10.10.10.100 on vmbr0 to 10.10.10.1 port 67
QEMU[7723]: kvm: vfio_err_notifier_handler(0000:11:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest

So that's my saga thus far. I've managed to work through many roadblocks along the way, but at this point I'm out of ideas - not sure how to proceed or even whether this setup ever had a chance of fully working. Any thoughts, advice, things to check, etc are welcome!

drmatt · Wednesday at 16:27

Quick follow-up. I did not need to pass through the thunderbolt USB controller, only the NHI, for a stable display. This allows the host to retain access to many USB devices that would otherwise be passed through inadvertently to the VM. The system has been running cleanly for hours now without a lockup - not sure if it will eventually crash, or perhaps this latest change has at least partially addressed the instability.

drmatt · 2025-05-22T05:20:52+0200

Update: the VM eventually did crash again with the same PCI bus errors. Since then I've blacklisted the thunderbolt driver and more importantly added the pcie_aspm=off kernel flag, and that seems to have made the system even more stable. Where before I was periodically seeing those Correctable PCI errors until the Uncorrectable ones happened, now there haven't been any errors at all for several hours since reboot. So that's a good sign.

Turning off ASPM entirely is not an ideal solution though, so I'll investigate whether it's possible to just disable it for the eGPU itself and/or the thunderbolt bridge.

Search

Search

my experience with proxmox + thunderbolt eGPU

drmatt

New Member

drmatt

New Member

drmatt

New Member

We value your privacy