[SOLVED] Fix: NVIDIA Drivers Failing after upgrade to Proxmox 9.2.2 (Kernel 7.0.2-6-pve) / NovaCore Conflict

Z1nG

Member
Mar 27, 2022
2
1
23
I recently upgraded to Proxmox 9.2.2 and my NVIDIA drivers completely stopped working.

In the process of fixing this, post-upgrade. I found two issues.
  • DKMS failed to rebuild driver for new kernel
    • I think this was my fault. I had previously installed headers for a specific kernel version using apt install pve-headers-$(uname -r). I should have probably used apt install pve-headers.
  • My GPU was hijacked by NovaCore.
    • For context, Nova is the new open-source, Rust-based kernel driver meant to replace Nouveau for modern (Turing/RTX 2000+) GPUs that use GSP firmware. It seems to essentially be the new nouveau.


Here is the fix that worked for me!

1. Validate your NVIDIA DKMS Module was built for your new kernel version
Check your installed driver and kernel version:
Bash:
dkms status
uname -r

**If your DKMS status output indicates the driver is already built for your active kernel, Skip to Step 3.**

Code:
# example 1: DKMS Module built for current active kernel 
root@box:~# dkms status && uname -r
nvidia/595.71.05, 7.0.2-6-pve, x86_64: installed
7.0.2-6-pve

# example 2:  DKMS Module NOT built for current active kernel 
root@box:~# dkms status && uname -r
nvidia/595.71.05, 6.17.13-7-pve, x86_64: installed
7.0.2-6-pve



2. Rebuild the NVIDIA Module (DKMS)
Install the headers for the new kernel so DKMS can compile the driver.

Bash:
apt update
apt install pve-headers-$(uname -r)

Force DKMS to rebuild and install the module for the new kernel
Bash:
# example for a card with nvidia driver 595.71.05 ('dkms status' output) on kernel 7.0.2-6-pve ('uname -r' output)
dkms install -m nvidia -v 595.71.05 -k 7.0.2-6-pve

3. Blacklist the NovaCore Driver
To prevent NovaCore from taking control of the GPU during early boot, you need to blacklist it.

If you already have a blacklist entry in one of your /etc/modprobe.d/ files for nouveau, just add the lines specific to nova/nova_core.

Bash:
echo "blacklist nova" >> /etc/modprobe.d/nvidia.conf
echo "blacklist nova_core" >> /etc/modprobe.d/nvidia.conf
echo "options nova modeset=0" >> /etc/modprobe.d/nvidia.conf
echo "blacklist nouveau" >> /etc/modprobe.d/nvidia.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/nvidia.conf

4. Update Initramfs and Reboot
Finally, rebuild your initial ramdisk so the blacklist applies during the boot sequence, and restart the host.

Bash:
update-initramfs -u
reboot

5. Validate fix
Once the host comes back online, run nvidia-smi, it should be back to normal!
 
I generally recommend these commands
Bash:
apt install -y proxmox-default-headers proxmox-headers-$(uname -r) gcc make dkms
Reason being that proxmox-headers-$(uname -r) takes care of the current kernel (think pinned/not yet rebooted) and proxmox-default-headers is a meta package that targets the newest one so no future headaches.
Same reason why I'd recommend update-initramfs -ukall to update the initramfs for all kernels. dpkg-reconfigure nvidia-kernel-dkms should force a rebuild as well.
I haven't heard of nova_core before but the NVIDIA driver packages usually ship with a noveau blacklist already.