[SOLVED] Regression in Thunderbolt connected eGPU functionality between 6.8.4-2-pve and 6.8.12-4-pve

scyto

Active Member
Aug 8, 2023
485
90
28
In proxmox installed from ISO with kernel 6.8.4-2-pve nouveau NVidia driver loads fine.

If kernel is upgraded to 6.8.12-4-pve then NVidia driver will not load with a proceeding D3COLD error in dmesg.

This is an upstream kernel regression of some sort as i hit the same in ubuntu on bare metal https://forums.developer.nvidia.com...-from-d3cold-to-d0-device-inaccessible/304459 and have experienced the same regression on ZimaOS and TrueNas 24.10.

It is useful to be able to pass though eGPU to VMs when needed, i know this is probably a niche scenario.

For now i have worked around this by pinning the older version of the kernel.
 
In proxmox installed from ISO with kernel 6.8.4-2-pve nouveau NVidia driver loads fine.

If kernel is upgraded to 6.8.12-4-pve then NVidia driver will not load with a proceeding D3COLD error in dmesg.
Please test with more Proxmox kernel versions, like 6.8.8-4-pve and 6.8.12-1-pve (and maybe 6.8.4-3-pve, 6.8.8-1-pve, 6.8.8-2-pve, 6.8.8-3-pve, 6.8.12-2-pve, 6.8.12-3-pve) to narrow down when it broke.
 
Please test with more Proxmox kernel versions, like 6.8.8-4-pve and 6.8.12-1-pve (and maybe 6.8.4-3-pve, 6.8.8-1-pve, 6.8.8-2-pve, 6.8.8-3-pve, 6.8.12-2-pve, 6.8.12-3-pve) to narrow down when it broke.
will do
 
Regression was between 6.8.44-pve and 6.8.8-1-pve

Code:
6.8.4-2-pve = working
6.8.4-3-pve = working
6.8.4-4-pve = working
6.8.8-1-pve = broken
6.8.8-4-pve = broken

Hope that helps.

more info
---------

noveau working looks like this:

Code:
[   13.296521] nouveau 0000:0f:00.0: fb: 11264 MiB GDDR6
[   13.314694] nouveau 0000:0f:00.0: DRM: VRAM: 11264 MiB
[   13.314697] nouveau 0000:0f:00.0: DRM: GART: 536870912 MiB
[   13.314698] nouveau 0000:0f:00.0: DRM: BIT table 'A' not found
[   13.314699] nouveau 0000:0f:00.0: DRM: BIT table 'L' not found
[   13.314700] nouveau 0000:0f:00.0: DRM: TMDS table version 2.0
[   13.316334] nouveau 0000:0f:00.0: DRM: MM: using COPY for buffer copies

Nouveau broken looks like this:

Code:
[   12.943373] nouveau 0000:0f:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   12.943522] nouveau 0000:0f:00.0: unknown chipset (ffffffff)

Given i have seen this on later kernels on other distros with official nvidia drivers i don't think it is a driver versioning issue (but maybe?)
Also i saw this on truenas 6.6 LTS where an older 6.6 LTS worked but a newe 6.6 LTS didn't. Just FYI i know those are not your kenel

I assume this is either something to do with the churn in USB4/thunderbolt code or in PCIE or PCIE / ACPI power management code given there are bunch of weird recent D3COLD entries in lore.kernel.org.

This is me making utterly wild and uneducated guesses on my reading lore.kernel.org and not knowing what the F i am talking about :)

Also pinning the older kernel is bad workaround as it kills network speeds by a couple of orders of magnitude.... which is understandable so this is just an FYI for anyone else who ends up here.

what's amusing to me is i can tell if it works or broken long before ssh or web ui is up, when it is working my GPU is nice and quiet, when not it blows a gale on the fans, rofl
 
Last edited:
I was looking at your source tree changes over time, i see the rebasing of the ubuntu kernel, i will download and install that on the same platform and see if i can narrow to a regression in a specific ubuntu kernel (if that will help)
 
Previous versions of 6.8.12-x-pve series requires this boot option `thunderbolt.host_reset=false` to work (I'm using grub).

But that option don't appear to work with 6.8.12-4-pve. Never mind. Reboot the node and it's back online. So the above option still works (Unsure why it didn't work immediately after the update).
 
Last edited:
  • Like
Reactions: scyto
@leesteken
I found the regressions occurs in the 6.8.8 ubuntu kernel. 6.8.4 through 6.8.7 are ok, 6.8.8 onwards have the regression

@snakeoilos
i will give that a go on ubunti and proxmox
 
Previous versions of 6.8.12-x-pve series requires this boot option `thunderbolt.host_reset=false` to work
Thanks, this fixed the issue on latest Ubuntu 24.04 kernel and the lastest promox kernel.

May i ask where you came across this nugget of gold, my google-fu obviously wasn't good enough :-( been banging my head against this one for months, on various dsitros...

And to hijack my own thread, what's the best repo to add to install the latest nvidia 560 non-open drivers and tools on the host?
 
Last edited:
Thanks, this fixed the issue on latest Ubuntu 24.04 kernel and the lastest promox kernel.

May i ask where you came across this nugget of gold, my google-fu obviously wasn't good enough :-( been banging my head against this one for months, on various dsitros...
From this very forum. Credit goes to @gfngfn256. He's a google-fu master. :cool:

And to hijack my own thread, what's the best repo to add to install the latest nvidia 560 non-open drivers and tools on the host?
I always install via shell installer direct from Nvidia. bad idea?
 
  • Like
Reactions: scyto and gfngfn256
how did you fix this? Mine is still broken. I'm running proxmox 6.8.12-5-pve and my kernel boot options are:

root@proxmox:~# cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs
thunderbolt.host_reset=false

I also ran proxmox-boot-tool refresh after editing /etc/kernel/cmdline and no change with the thunderbolt egpu connected. lspci doesn't show it but now the other devices in the eGPU enclosure show up. Also running lspci makes the eGPU fan speed up inside the enclosure. I want to pass the nvidia 3060 through to a VM but its not showing up in proxmox.
 
Last edited:
Are you sure you are having the same issue as this thread.
If you are not seeing the device and not seeing D3COLD errors it sounds like you have diff issue.

I did find for some enclosures (specifically the Sonnet 750EX ones) have horrible TB4/USB4 compatibility and Sonnet have ZERO interest in solving.
Workarounds:
  • plug a TB hub into your machine and then the Sonnet into the Hub - I accidentally found this when I plugged the eGPU enclosure into an OWC PCIE enclosure that was connected to the machine - the OWC PCIE enclosure (can't be used for gpus) acted as a hub and solve the weird issues with the sonnet eclosure not being visible.
  • buy something like this - i found this generic one worked great https://www.amazon.com/gp/product/B0D2V5YFMH they can be found cheaper on ali-express, but i know this one works 100
Also, i didn't use the command line file, i made my edit directly to grub file, so i can only confirm that method of setting the kernel command works - either way if you see the kernel command reported at top of dmesg you should be good.
 
I added the thunderbolt.host_reset=false in the /etc/default/grub file then updated the kernel again and it showed up finally. I am not sure if it was the new kernel or the reset issue.

I have a legion eGPU enclosure it seems fine but only supports the older thunderbolt 3.

Sadly after all this i realized llama 3.3 won't run on 12gb of vram lol. Oh well at least one issue solved.
 
are you using ollama - 3.2 model will absolutely run in less that 10GB vram, i have had that running on my 3080.
 
Its broken again in proxmox 6.8.12-9-pve thunderbolt GPU not detecting now.
can you check if this kernel option was removed when you upgraded? `thunderbolt.host_reset=false`

also you didn't post much detail about what you see in dmesg, lscpi etc so a little hard to help

i don't have a system to test this on at the moment
 
I can confirm `thunderbolt.host_reset=false` is in /etc/default/grub

It just doesn't show under lspci or in the proxmox add hardware menu for VM's. Looks like the kernel issue is back in 6.8.12-9 but keen for someone to test this too.
 
Some more diagnostics. Dmesg output attached as text file the bottom to avoid cluttering.

Code:
root@proxmox2:~# lspci
00:00.0 Host bridge: Intel Corporation Raptor Lake-P 6p+8e cores Host Bridge/DRAM Controller
00:02.0 VGA compatible controller: Intel Corporation Raptor Lake-P [Iris Xe Graphics] (rev 04)
00:06.0 PCI bridge: Intel Corporation Raptor Lake PCIe 4.0 Graphics Port
00:06.2 PCI bridge: Intel Corporation Device a73d
00:07.0 PCI bridge: Intel Corporation Raptor Lake-P Thunderbolt 4 PCI Express Root Port
00:07.2 PCI bridge: Intel Corporation Raptor Lake-P Thunderbolt 4 PCI Express Root Port
00:0d.0 USB controller: Intel Corporation Raptor Lake-P Thunderbolt 4 USB Controller
00:0d.2 USB controller: Intel Corporation Raptor Lake-P Thunderbolt 4 NHI
00:0d.3 USB controller: Intel Corporation Raptor Lake-P Thunderbolt 4 NHI
00:14.0 USB controller: Intel Corporation Alder Lake PCH USB 3.2 xHCI Host Controller (rev 01)
00:14.2 RAM memory: Intel Corporation Alder Lake PCH Shared SRAM (rev 01)
00:16.0 Communication controller: Intel Corporation Alder Lake PCH HECI Controller (rev 01)
00:16.3 Serial controller: Intel Corporation Alder Lake AMT SOL Redirection (rev 01)
00:1c.0 PCI bridge: Intel Corporation Alder Lake-P PCH PCIe Root Port (rev 01)
00:1c.4 PCI bridge: Intel Corporation Alder Lake PCI Express x4 Root Port (rev 01)
00:1d.0 PCI bridge: Intel Corporation Alder Lake PCI Express x1 Root Port (rev 01)
00:1d.3 PCI bridge: Intel Corporation Alder Lake PCI Express Root Port (rev 01)
00:1f.0 ISA bridge: Intel Corporation Raptor Lake LPC/eSPI Controller (rev 01)
00:1f.3 Audio device: Intel Corporation Raptor Lake-P/U/H cAVS (rev 01)
00:1f.4 SMBus: Intel Corporation Alder Lake PCH-P SMBus Host Controller (rev 01)
00:1f.5 Serial bus controller: Intel Corporation Alder Lake-P PCH SPI Controller (rev 01)
01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. OM8PGP4 NVMe PCIe SSD (DRAM-less)
02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
02:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
57:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 04)
58:00.0 Non-Volatile memory controller: Micron/Crucial Technology P310 NVMe PCIe SSD (DRAM-less) (rev 01)
59:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-LM (rev 04)
5a:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI Express Wireless Network Adapter
root@proxmox2:~#



Code:
root@proxmox2:~# cat /etc/default/grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
GRUB_CMDLINE_LINUX="thunderbolt.host_reset=false"


# If your computer has multiple operating systems installed, then you
# probably want to run os-prober. However, if your computer is a host
# for guest OSes installed via LVM or raw disk devices, running
# os-prober can cause damage to those guest OSes as it mounts
# filesystems to look for things.
#GRUB_DISABLE_OS_PROBER=false

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"
 

Attachments

It looks like the boot option thunderbolt.host_reset=false isn't in being passed to the kernel on boot. I just can't see why.

what file should this appear in?