PVE9 in-place upgrade failed due to mellanox/nvidia networking

wickleighter

New Member
Jan 19, 2024
8
3
3
Hi all,

Egg on my face, I blindly followed the upgrade instructions to pve9. Turns out for those of us running mellanox/nvidia networking, there is no support for debian 13:
Currently, Debian 13 (Trixie) is not yet officially supported in the released versions of MLNX_OFED or DOCA-OFED. As Debian 13 was only recently released (July 2025), it has not yet been added to the list of supported operating systems.

Other than not restarting (and praying drivers come out in the next weeks), any advice for uninstalling the spider that is MLNX/DOCA?


Code:
Setting up proxmox-kernel-6.14.11-3-pve-signed (6.14.11-3) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/dkms 6.14.11-3-pve /boot/vmlinuz-6.14.11-3-pve
Sign command: /lib/modules/6.14.11-3-pve/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Autoinstall of module mlnx-ofed-kernel/25.01.OFED.25.01.0.6.0.1 for kernel 6.14.11-3-pve (x86_64)
Building module(s).................(bad exit status: 1)
Failed command:
./ofed_scripts/pre_build.sh 6.14.11-3-pve /lib/modules/6.14.11-3-pve/build mlnx-ofed-kernel 25.01.OFED.25.01.0.6.0.1

Error! Bad return status for module build on kernel: 6.14.11-3-pve (x86_64)
Consult /var/lib/dkms/mlnx-ofed-kernel/25.01.OFED.25.01.0.6.0.1/build/make.log for more information.

Autoinstall on 6.14.11-3-pve succeeded for module(s) kernel-mft-dkms knem.
Autoinstall on 6.14.11-3-pve failed for module(s) mlnx-ofed-kernel(10).

Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
run-parts: /etc/kernel/postinst.d/dkms exited with return code 1
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-6.14.11-3-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-6.14.11-3-pve-signed (--configure):
 installed proxmox-kernel-6.14.11-3-pve-signed package post-installation script subprocess returned error exit status 2
dpkg: dependency problems prevent configuration of proxmox-kernel-6.14:
 proxmox-kernel-6.14 depends on proxmox-kernel-6.14.11-3-pve-signed | proxmox-kernel-6.14.11-3-pve; however:
  Package proxmox-kernel-6.14.11-3-pve-signed is not configured yet.
  Package proxmox-kernel-6.14.11-3-pve is not installed.
  Package proxmox-kernel-6.14.11-3-pve-signed which provides proxmox-kernel-6.14.11-3-pve is not configured yet.

dpkg: error processing package proxmox-kernel-6.14 (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-default-kernel:
 proxmox-default-kernel depends on proxmox-kernel-6.14; however:
  Package proxmox-kernel-6.14 is not configured yet.

dpkg: error processing package proxmox-default-kernel (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-ve:
 proxmox-ve depends on proxmox-default-kernel; however:
  Package proxmox-default-kernel is not configured yet.

dpkg: error processing package proxmox-ve (--configure):
 dependency problems - leaving unconfigured
Errors were encountered while processing:
 proxmox-kernel-6.14.11-3-pve-signed
 proxmox-kernel-6.14
 proxmox-default-kernel
 proxmox-ve
Removing subscription nag from UI...
Error: Sub-process /usr/bin/dpkg returned an error code (1)


root@snap:~# /usr/sbin/ofed_uninstall.sh
Log: /tmp/ofed.uninstall.log

This program will uninstall all OFED-internal-25.01-0.6.0 packages on your machine.

Do you want to continue?[y/N]:y

Removing MLNX_OFED packages

Error: One or more packages depends on MLNX_OFED.
Those packages should be removed before uninstalling MLNX_OFED:

collectx-clxapi ceph-fuse pve-container qemu-server librados2-perl librados2 libradosstriper1 python3-rgw librbd1 ceph-common libpve-guest-common-perl pve-qemu-kvm libcephfs2 libpve-storage-perl librgw2 proxmox-ve spiceterm libiscsi7 python3-rados mft-autocomplete python3-cephfs pve-manager pve-ha-manager python3-rbd

To force uninstallation use '--force' flag
 
It's one of the things I try my best to avoid on a Proxmox host, installing 3rd party software/driver packages that have a large number of very specific dependencies, can screw up a hell of a lot of things come upgrade time, ie. nailbiting cleaning up and all kinds of dependency oxymoronic conflicts along with GCC versioning going out of sync etc etc, list goes on.

Best of luck.
 
hi,

just for your information, the nvidia driver does not have to directly support debian 13, but actually the kernel version is more the reason here. Since we base our kernel on ubuntus and not on debians this is different from debian 13.

This can always happen with out of tree modules and DKMS, but usually nvidia is relatively fast with providing updates for newer kernels (at least on their gpu side, I can't speak much to the network side though)

in any case do you want to post the log file so we can take a look?

Error! Bad return status for module build on kernel: 6.14.11-3-pve (x86_64) Consult /var/lib/dkms/mlnx-ofed-kernel/25.01.OFED.25.01.0.6.0.1/build/make.log for more information.

also i'm not super into the mellanox network stack, is it absolutely necessary that you use the out-of-tree modules instead the ones upstream kernel ones?

EDIT:

also according to this:
https://docs.nvidia.com/networking/display/mlnxofedv24100700

the MLNX_OFED drivers are replaced by the doca-ofed drivers:

https://docs.nvidia.com/doca/sdk/mlnx_ofed+to+doca-ofed+transition+guide/index.html

did you try with them too?
 
Last edited: