[SOLVED] "Lost" GPU, probably after an upgrade

garzzz

New Member
Oct 8, 2024
9
1
3
Hi,

i have (had... sob) a wonderful proxmox server, with some containers with working gpu passthrough.

This weekend i updated proxmox, with the web interface (apt update, apt upgrade and such). Then i rebooted it, and as far as i remember, no issue (but i can remember wrong).

Then yesterday, probably due to bad weather, i had a power outage and possibly some lightning issues. I had other PCs in the same room, plugged in the same outlet, and everything seems fine so far.

I've figured out that something is wrong because the jellyfin LXC won't start due:
Code:
TASK ERROR: Device /dev/dri/renderD128 does not exist

Now, if i run nvtop on the host, i see No GPU to monitor. Then i fear that is something with the GPU, maybe even hardware damages.

Luckily, i've also run lspci and i see:

Code:
26:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] (rev a1)

26:00.1 Audio device: NVIDIA Corporation GA106 High Definition Audio Controller (rev a1)

So apparently the GPU is detected and therefore alive.

I don't even know where to start to debug this issue. I saw the jellyfin error on a number of posts, but the usual reply is something to fix the container and or reinstall it, and it is fixed. I fear that my case is worse, since the GPU is not "available" to the host (nvtop output). What shoud i do? Thanks in advance...

Edit: tried to upgrade the nvidia drivers, since lspci shows the GPU, but i have this error:

Code:
  ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have      
         installed the kernel source files for your kernel and that they are properly configured; on Red Hat     
         Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed.  If   
         you know the correct kernel source files are installed, you may specify the kernel source path with the 
         '--kernel-source-path' command line option.
 
Last edited:
Try to run this first
Bash:
apt install -y pve-headers gcc make dkms
Done, nothing apparently changed.

Edit: actually, the error message while installing the new driver changed into this.

Code:
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X
           module path '/usr/lib/xorg/modules'; these paths were not queryable from
           the system.  If X fails to find the NVIDIA X driver module, please      
           install the `pkg-config` utility and the X.Org SDK/development package
           for your distribution and reinstall the driver.

i don't get why it is complaining about X, since it's a server without a GUI.
Maybe that's something relevant.


Edit again: after the error message, which is actually a warning, i was able to reinstall the drivers and the GPU is managed again.

THANK YOU SO MUCH! :)
 
Last edited: