I recently upgraded to Proxmox 9.2.2 and my NVIDIA drivers completely stopped working.
In the process of fixing this, post-upgrade. I found two issues.
Here is the fix that worked for me!
1. Validate your NVIDIA DKMS Module was built for your new kernel version
Check your installed driver and kernel version:
**If your DKMS status output indicates the driver is already built for your active kernel, Skip to Step 3.**
2. Rebuild the NVIDIA Module (DKMS)
Install the headers for the new kernel so DKMS can compile the driver.
Force DKMS to rebuild and install the module for the new kernel
3. Blacklist the NovaCore Driver
To prevent NovaCore from taking control of the GPU during early boot, you need to blacklist it.
If you already have a blacklist entry in one of your /etc/modprobe.d/ files for nouveau, just add the lines specific to nova/nova_core.
4. Update Initramfs and Reboot
Finally, rebuild your initial ramdisk so the blacklist applies during the boot sequence, and restart the host.
5. Validate fix
Once the host comes back online, run
EDIT: intramfs command
EDIT2: Updated blacklist location from
EDIT3: Added a snippet from my host's journalctl that pointed to NovaCore being the issue
In the process of fixing this, post-upgrade. I found two issues.
- DKMS failed to rebuild driver for new kernel
- I think this was my fault. I had previously installed headers for a specific kernel version using
apt install pve-headers-$(uname -r). I should have probably usedapt install pve-headers.
- I think this was my fault. I had previously installed headers for a specific kernel version using
- The NovaCore module bound itself to the card before the nvidia driver/module could.
Code:
## journalctl (UI: Host > System > System Log)
May 23 17:20:30 box kernel: NVRM: GPU 0000:65:00.0 is already bound to NovaCore.
May 23 17:20:30 box kernel: NVRM: The NVIDIA probe routine was not called for 1 device(s).
May 23 17:20:30 box kernel: NVRM: This can occur when another driver was loaded and
NVRM: obtained ownership of the NVIDIA device(s).
May 23 17:20:30 box kernel: NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
- For context, Nova is the new open-source, Rust-based kernel driver meant to replace Nouveau for modern (Turing/RTX 2000+) GPUs that use GSP firmware. It seems to essentially be the new nouveau.
Here is the fix that worked for me!
1. Validate your NVIDIA DKMS Module was built for your new kernel version
Check your installed driver and kernel version:
Bash:
dkms status
uname -r
**If your DKMS status output indicates the driver is already built for your active kernel, Skip to Step 3.**
Code:
# example 1: DKMS Module built for current active kernel
root@box:~# dkms status && uname -r
nvidia/595.71.05, 7.0.2-6-pve, x86_64: installed
7.0.2-6-pve
# example 2: DKMS Module NOT built for current active kernel
root@box:~# dkms status && uname -r
nvidia/595.71.05, 6.17.13-7-pve, x86_64: installed
7.0.2-6-pve
2. Rebuild the NVIDIA Module (DKMS)
Install the headers for the new kernel so DKMS can compile the driver.
Bash:
apt update
apt install pve-headers-$(uname -r)
Force DKMS to rebuild and install the module for the new kernel
Bash:
# example for a card with nvidia driver 595.71.05 ('dkms status' output) on kernel 7.0.2-6-pve ('uname -r' output)
dkms install -m nvidia -v 595.71.05 -k 7.0.2-6-pve
3. Blacklist the NovaCore Driver
To prevent NovaCore from taking control of the GPU during early boot, you need to blacklist it.
If you already have a blacklist entry in one of your /etc/modprobe.d/ files for nouveau, just add the lines specific to nova/nova_core.
Bash:
echo "blacklist nova" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nova_core" >> /etc/modprobe.d/blacklist.conf
echo "options nova modeset=0" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist.conf
4. Update Initramfs and Reboot
Finally, rebuild your initial ramdisk so the blacklist applies during the boot sequence, and restart the host.
Bash:
update-initramfs -u -k all
reboot
5. Validate fix
Once the host comes back online, run
nvidia-smi, it should be back to normal!EDIT: intramfs command
EDIT2: Updated blacklist location from
nvidia.conf to a generic blacklist.conf file. Aligns with proxmox official blacklist instructions & prevents revert of blacklist during nvidia driver upgrade.EDIT3: Added a snippet from my host's journalctl that pointed to NovaCore being the issue
Last edited: