[SOLVED] Regression in Thunderbolt connected eGPU functionality between 6.8.4-2-pve and 6.8.12-4-pve

scyto

Active Member
Aug 8, 2023
376
69
28
In proxmox installed from ISO with kernel 6.8.4-2-pve nouveau NVidia driver loads fine.

If kernel is upgraded to 6.8.12-4-pve then NVidia driver will not load with a proceeding D3COLD error in dmesg.

This is an upstream kernel regression of some sort as i hit the same in ubuntu on bare metal https://forums.developer.nvidia.com...-from-d3cold-to-d0-device-inaccessible/304459 and have experienced the same regression on ZimaOS and TrueNas 24.10.

It is useful to be able to pass though eGPU to VMs when needed, i know this is probably a niche scenario.

For now i have worked around this by pinning the older version of the kernel.
 
In proxmox installed from ISO with kernel 6.8.4-2-pve nouveau NVidia driver loads fine.

If kernel is upgraded to 6.8.12-4-pve then NVidia driver will not load with a proceeding D3COLD error in dmesg.
Please test with more Proxmox kernel versions, like 6.8.8-4-pve and 6.8.12-1-pve (and maybe 6.8.4-3-pve, 6.8.8-1-pve, 6.8.8-2-pve, 6.8.8-3-pve, 6.8.12-2-pve, 6.8.12-3-pve) to narrow down when it broke.
 
Regression was between 6.8.44-pve and 6.8.8-1-pve

Code:
6.8.4-2-pve = working
6.8.4-3-pve = working
6.8.4-4-pve = working
6.8.8-1-pve = broken
6.8.8-4-pve = broken

Hope that helps.

more info
---------

noveau working looks like this:

Code:
[   13.296521] nouveau 0000:0f:00.0: fb: 11264 MiB GDDR6
[   13.314694] nouveau 0000:0f:00.0: DRM: VRAM: 11264 MiB
[   13.314697] nouveau 0000:0f:00.0: DRM: GART: 536870912 MiB
[   13.314698] nouveau 0000:0f:00.0: DRM: BIT table 'A' not found
[   13.314699] nouveau 0000:0f:00.0: DRM: BIT table 'L' not found
[   13.314700] nouveau 0000:0f:00.0: DRM: TMDS table version 2.0
[   13.316334] nouveau 0000:0f:00.0: DRM: MM: using COPY for buffer copies

Nouveau broken looks like this:

Code:
[   12.943373] nouveau 0000:0f:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   12.943522] nouveau 0000:0f:00.0: unknown chipset (ffffffff)

Given i have seen this on later kernels on other distros with official nvidia drivers i don't think it is a driver versioning issue (but maybe?)
Also i saw this on truenas 6.6 LTS where an older 6.6 LTS worked but a newe 6.6 LTS didn't. Just FYI i know those are not your kenel

I assume this is either something to do with the churn in USB4/thunderbolt code or in PCIE or PCIE / ACPI power management code given there are bunch of weird recent D3COLD entries in lore.kernel.org.

This is me making utterly wild and uneducated guesses on my reading lore.kernel.org and not knowing what the F i am talking about :)

Also pinning the older kernel is bad workaround as it kills network speeds by a couple of orders of magnitude.... which is understandable so this is just an FYI for anyone else who ends up here.

what's amusing to me is i can tell if it works or broken long before ssh or web ui is up, when it is working my GPU is nice and quiet, when not it blows a gale on the fans, rofl
 
Last edited:
I was looking at your source tree changes over time, i see the rebasing of the ubuntu kernel, i will download and install that on the same platform and see if i can narrow to a regression in a specific ubuntu kernel (if that will help)
 
Previous versions of 6.8.12-x-pve series requires this boot option `thunderbolt.host_reset=false` to work (I'm using grub).

But that option don't appear to work with 6.8.12-4-pve. Never mind. Reboot the node and it's back online. So the above option still works (Unsure why it didn't work immediately after the update).
 
Last edited:
  • Like
Reactions: scyto
Previous versions of 6.8.12-x-pve series requires this boot option `thunderbolt.host_reset=false` to work
Thanks, this fixed the issue on latest Ubuntu 24.04 kernel and the lastest promox kernel.

May i ask where you came across this nugget of gold, my google-fu obviously wasn't good enough :-( been banging my head against this one for months, on various dsitros...

And to hijack my own thread, what's the best repo to add to install the latest nvidia 560 non-open drivers and tools on the host?
 
Last edited:
Thanks, this fixed the issue on latest Ubuntu 24.04 kernel and the lastest promox kernel.

May i ask where you came across this nugget of gold, my google-fu obviously wasn't good enough :-( been banging my head against this one for months, on various dsitros...
From this very forum. Credit goes to @gfngfn256. He's a google-fu master. :cool:

And to hijack my own thread, what's the best repo to add to install the latest nvidia 560 non-open drivers and tools on the host?
I always install via shell installer direct from Nvidia. bad idea?
 
  • Like
Reactions: scyto and gfngfn256

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!