[SOLVED] Regression in Thunderbolt connected eGPU functionality between 6.8.4-2-pve and 6.8.12-4-pve

scyto · Monday at 23:20

In proxmox installed from ISO with kernel 6.8.4-2-pve nouveau NVidia driver loads fine.

If kernel is upgraded to 6.8.12-4-pve then NVidia driver will not load with a proceeding D3COLD error in dmesg.

This is an upstream kernel regression of some sort as i hit the same in ubuntu on bare metal https://forums.developer.nvidia.com...-from-d3cold-to-d0-device-inaccessible/304459 and have experienced the same regression on ZimaOS and TrueNas 24.10.

It is useful to be able to pass though eGPU to VMs when needed, i know this is probably a niche scenario.

For now i have worked around this by pinning the older version of the kernel.

leesteken · Monday at 23:27

scyto said:
In proxmox installed from ISO with kernel 6.8.4-2-pve nouveau NVidia driver loads fine.

If kernel is upgraded to 6.8.12-4-pve then NVidia driver will not load with a proceeding D3COLD error in dmesg.

Please test with more Proxmox kernel versions, like 6.8.8-4-pve and 6.8.12-1-pve (and maybe 6.8.4-3-pve, 6.8.8-1-pve, 6.8.8-2-pve, 6.8.8-3-pve, 6.8.12-2-pve, 6.8.12-3-pve) to narrow down when it broke.

scyto · 2024-11-12T00:00:01+0100

leesteken said:
Please test with more Proxmox kernel versions, like 6.8.8-4-pve and 6.8.12-1-pve (and maybe 6.8.4-3-pve, 6.8.8-1-pve, 6.8.8-2-pve, 6.8.8-3-pve, 6.8.12-2-pve, 6.8.12-3-pve) to narrow down when it broke.

will do

scyto · 2024-11-12T01:00:04+0100

Regression was between 6.8.44-pve and 6.8.8-1-pve

Code:

6.8.4-2-pve = working
6.8.4-3-pve = working
6.8.4-4-pve = working
6.8.8-1-pve = broken
6.8.8-4-pve = broken

Hope that helps.

more info
---------

noveau working looks like this:

Code:

[   13.296521] nouveau 0000:0f:00.0: fb: 11264 MiB GDDR6
[   13.314694] nouveau 0000:0f:00.0: DRM: VRAM: 11264 MiB
[   13.314697] nouveau 0000:0f:00.0: DRM: GART: 536870912 MiB
[   13.314698] nouveau 0000:0f:00.0: DRM: BIT table 'A' not found
[   13.314699] nouveau 0000:0f:00.0: DRM: BIT table 'L' not found
[   13.314700] nouveau 0000:0f:00.0: DRM: TMDS table version 2.0
[   13.316334] nouveau 0000:0f:00.0: DRM: MM: using COPY for buffer copies

Nouveau broken looks like this:

Code:

[   12.943373] nouveau 0000:0f:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   12.943522] nouveau 0000:0f:00.0: unknown chipset (ffffffff)

Given i have seen this on later kernels on other distros with official nvidia drivers i don't think it is a driver versioning issue (but maybe?)
Also i saw this on truenas 6.6 LTS where an older 6.6 LTS worked but a newe 6.6 LTS didn't. Just FYI i know those are not your kenel

I assume this is either something to do with the churn in USB4/thunderbolt code or in PCIE or PCIE / ACPI power management code given there are bunch of weird recent D3COLD entries in lore.kernel.org.

This is me making utterly wild and uneducated guesses on my reading lore.kernel.org and not knowing what the F i am talking about

Also pinning the older kernel is bad workaround as it kills network speeds by a couple of orders of magnitude.... which is understandable so this is just an FYI for anyone else who ends up here.

what's amusing to me is i can tell if it works or broken long before ssh or web ui is up, when it is working my GPU is nice and quiet, when not it blows a gale on the fans, rofl

scyto · 2024-11-12T01:26:49+0100

I was looking at your source tree changes over time, i see the rebasing of the ubuntu kernel, i will download and install that on the same platform and see if i can narrow to a regression in a specific ubuntu kernel (if that will help)

snakeoilos · 2024-11-12T02:16:51+0100

Previous versions of 6.8.12-x-pve series requires this boot option `thunderbolt.host_reset=false` to work (I'm using grub).

~~But that option don't appear to work with 6.8.12-4-pve.~~ Never mind. Reboot the node and it's back online. So the above option still works (Unsure why it didn't work immediately after the update).

scyto · 2024-11-12T03:28:53+0100

@leesteken
I found the regressions occurs in the 6.8.8 ubuntu kernel. 6.8.4 through 6.8.7 are ok, 6.8.8 onwards have the regression

@snakeoilos
i will give that a go on ubunti and proxmox

scyto · 2024-11-12T04:46:19+0100

snakeoilos said:
Previous versions of 6.8.12-x-pve series requires this boot option `thunderbolt.host_reset=false` to work

Thanks, this fixed the issue on latest Ubuntu 24.04 kernel and the lastest promox kernel.

May i ask where you came across this nugget of gold, my google-fu obviously wasn't good enough :-( been banging my head against this one for months, on various dsitros...

And to hijack my own thread, what's the best repo to add to install the latest nvidia 560 non-open drivers and tools on the host?

snakeoilos · 2024-11-12T05:17:50+0100

scyto said:
Thanks, this fixed the issue on latest Ubuntu 24.04 kernel and the lastest promox kernel.

May i ask where you came across this nugget of gold, my google-fu obviously wasn't good enough :-( been banging my head against this one for months, on various dsitros...

From this very forum. Credit goes to @gfngfn256. He's a google-fu master.

scyto said:
And to hijack my own thread, what's the best repo to add to install the latest nvidia 560 non-open drivers and tools on the host?

I always install via shell installer direct from Nvidia. bad idea?

scyto · 2024-11-12T05:53:35+0100

snakeoilos said:
I always install via shell installer direct from Nvidia. bad idea?

i think i made the mistake of following this, next time i will just download the package from nvidia and run their install script, i did that before on an ubuntu system and it worked great there

Search

Search

[SOLVED] Regression in Thunderbolt connected eGPU functionality between 6.8.4-2-pve and 6.8.12-4-pve

scyto

Active Member

leesteken

Distinguished Member

scyto

Active Member

scyto

Active Member

scyto

Active Member

snakeoilos

Member

scyto

Active Member

scyto

Active Member

snakeoilos

Member

scyto

Active Member