PVE9 in-place upgrade failed due to mellanox/nvidia networking

wickleighter

New Member
Jan 19, 2024
8
3
3
Hi all,

Egg on my face, I blindly followed the upgrade instructions to pve9. Turns out for those of us running mellanox/nvidia networking, there is no support for debian 13:
Currently, Debian 13 (Trixie) is not yet officially supported in the released versions of MLNX_OFED or DOCA-OFED. As Debian 13 was only recently released (July 2025), it has not yet been added to the list of supported operating systems.

Other than not restarting (and praying drivers come out in the next weeks), any advice for uninstalling the spider that is MLNX/DOCA?


Code:
Setting up proxmox-kernel-6.14.11-3-pve-signed (6.14.11-3) ...
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/dkms 6.14.11-3-pve /boot/vmlinuz-6.14.11-3-pve
Sign command: /lib/modules/6.14.11-3-pve/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Autoinstall of module mlnx-ofed-kernel/25.01.OFED.25.01.0.6.0.1 for kernel 6.14.11-3-pve (x86_64)
Building module(s).................(bad exit status: 1)
Failed command:
./ofed_scripts/pre_build.sh 6.14.11-3-pve /lib/modules/6.14.11-3-pve/build mlnx-ofed-kernel 25.01.OFED.25.01.0.6.0.1

Error! Bad return status for module build on kernel: 6.14.11-3-pve (x86_64)
Consult /var/lib/dkms/mlnx-ofed-kernel/25.01.OFED.25.01.0.6.0.1/build/make.log for more information.

Autoinstall on 6.14.11-3-pve succeeded for module(s) kernel-mft-dkms knem.
Autoinstall on 6.14.11-3-pve failed for module(s) mlnx-ofed-kernel(10).

Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
run-parts: /etc/kernel/postinst.d/dkms exited with return code 1
Failed to process /etc/kernel/postinst.d at /var/lib/dpkg/info/proxmox-kernel-6.14.11-3-pve-signed.postinst line 20.
dpkg: error processing package proxmox-kernel-6.14.11-3-pve-signed (--configure):
 installed proxmox-kernel-6.14.11-3-pve-signed package post-installation script subprocess returned error exit status 2
dpkg: dependency problems prevent configuration of proxmox-kernel-6.14:
 proxmox-kernel-6.14 depends on proxmox-kernel-6.14.11-3-pve-signed | proxmox-kernel-6.14.11-3-pve; however:
  Package proxmox-kernel-6.14.11-3-pve-signed is not configured yet.
  Package proxmox-kernel-6.14.11-3-pve is not installed.
  Package proxmox-kernel-6.14.11-3-pve-signed which provides proxmox-kernel-6.14.11-3-pve is not configured yet.

dpkg: error processing package proxmox-kernel-6.14 (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-default-kernel:
 proxmox-default-kernel depends on proxmox-kernel-6.14; however:
  Package proxmox-kernel-6.14 is not configured yet.

dpkg: error processing package proxmox-default-kernel (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-ve:
 proxmox-ve depends on proxmox-default-kernel; however:
  Package proxmox-default-kernel is not configured yet.

dpkg: error processing package proxmox-ve (--configure):
 dependency problems - leaving unconfigured
Errors were encountered while processing:
 proxmox-kernel-6.14.11-3-pve-signed
 proxmox-kernel-6.14
 proxmox-default-kernel
 proxmox-ve
Removing subscription nag from UI...
Error: Sub-process /usr/bin/dpkg returned an error code (1)


root@snap:~# /usr/sbin/ofed_uninstall.sh
Log: /tmp/ofed.uninstall.log

This program will uninstall all OFED-internal-25.01-0.6.0 packages on your machine.

Do you want to continue?[y/N]:y

Removing MLNX_OFED packages

Error: One or more packages depends on MLNX_OFED.
Those packages should be removed before uninstalling MLNX_OFED:

collectx-clxapi ceph-fuse pve-container qemu-server librados2-perl librados2 libradosstriper1 python3-rgw librbd1 ceph-common libpve-guest-common-perl pve-qemu-kvm libcephfs2 libpve-storage-perl librgw2 proxmox-ve spiceterm libiscsi7 python3-rados mft-autocomplete python3-cephfs pve-manager pve-ha-manager python3-rbd

To force uninstallation use '--force' flag
 
It's one of the things I try my best to avoid on a Proxmox host, installing 3rd party software/driver packages that have a large number of very specific dependencies, can screw up a hell of a lot of things come upgrade time, ie. nailbiting cleaning up and all kinds of dependency oxymoronic conflicts along with GCC versioning going out of sync etc etc, list goes on.

Best of luck.
 
hi,

just for your information, the nvidia driver does not have to directly support debian 13, but actually the kernel version is more the reason here. Since we base our kernel on ubuntus and not on debians this is different from debian 13.

This can always happen with out of tree modules and DKMS, but usually nvidia is relatively fast with providing updates for newer kernels (at least on their gpu side, I can't speak much to the network side though)

in any case do you want to post the log file so we can take a look?

Error! Bad return status for module build on kernel: 6.14.11-3-pve (x86_64) Consult /var/lib/dkms/mlnx-ofed-kernel/25.01.OFED.25.01.0.6.0.1/build/make.log for more information.

also i'm not super into the mellanox network stack, is it absolutely necessary that you use the out-of-tree modules instead the ones upstream kernel ones?

EDIT:

also according to this:
https://docs.nvidia.com/networking/display/mlnxofedv24100700

the MLNX_OFED drivers are replaced by the doca-ofed drivers:

https://docs.nvidia.com/doca/sdk/mlnx_ofed+to+doca-ofed+transition+guide/index.html

did you try with them too?
 
Last edited:
I am using doca kernel hardware acceleration for openvswitch on proxmox 8.4 using doca-networking package.
I upgraded to proxmox 9 and 9.1. I tried the new doca 3.2 drivers with my connectx4-lx. on both the 6.14 and 6.17 kernel. I am getting the following error the moment I enable hardware acceleration:

I tried both 6.14 and 6.17 kernels.

Code:
Nov 27 09:34:08 server ovs-vsctl[12442]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
Nov 27 09:34:08 server ovs-vsctl[12443]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . other_config:lacp-fallback-ab=true
Nov 27 09:34:09 server kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Nov 27 09:34:09 server kernel: #PF: supervisor read access in kernel mode
Nov 27 09:34:09 server kernel: #PF: error_code(0x0000) - not-present page
Nov 27 09:34:09 server kernel: PGD 0 P4D 0
Nov 27 09:34:09 server kernel: Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
Nov 27 09:34:09 server kernel: CPU: 28 UID: 0 PID: 12347 Comm: handler1 Tainted: P           OE      6.14.11-4-pve #1
Nov 27 09:34:09 server kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Nov 27 09:34:09 server kernel: Hardware name: Supermicro Super Server/H12SSL-C, BIOS 3.3 03/28/2025
Nov 27 09:34:09 server kernel: RIP: 0010:mlx5_fs_pool_release_index+0x2e/0x140 [mlx5_core]
Nov 27 09:34:09 server kernel: Code: 00 55 48 89 e5 41 56 4c 8d 77 18 41 55 49 89 f5 41 54 53 4c 8b 26 48 89 fb 4c 89 f7 e8 2b 4e ce d7 49 63 45 08 49 8b 54 24 18 <48> 0f a3 02 0f 82 ed 00 00 00 49 8b 54 24 18 f0 48 0f ab 02 83 43
Nov 27 09:34:09 server kernel: RSP: 0018:ffffd2b6113b7150 EFLAGS: 00010246
Nov 27 09:34:09 server kernel: RAX: 0000000000000000 RBX: ffff8f3e01392290 RCX: 0000000000000000
Nov 27 09:34:09 server kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Nov 27 09:34:09 server kernel: RBP: ffffd2b6113b7170 R08: 0000000000000000 R09: 0000000000000000
Nov 27 09:34:09 server kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f3e08366ae0
Nov 27 09:34:09 server kernel: R13: ffffd2b6113b7180 R14: ffff8f3e013922a8 R15: 0000000000000000
Nov 27 09:34:09 server kernel: FS:  00007176a5f6c6c0(0000) GS:ffff8f5c87c00000(0000) knlGS:0000000000000000
Nov 27 09:34:09 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 27 09:34:09 server kernel: CR2: 0000000000000000 CR3: 0000000169fd3000 CR4: 0000000000350ef0
Nov 27 09:34:09 server kernel: Call Trace:
Nov 27 09:34:09 server kernel:  <TASK>
Nov 27 09:34:09 server kernel:  mlx5_fc_release+0xc3/0xe0 [mlx5_core]
Nov 27 09:34:09 server kernel:  mlx5_fc_destroy+0x42/0x70 [mlx5_core]
Nov 27 09:34:09 server kernel:  mlx5_free_flow_attr_actions+0x24e/0x2b0 [mlx5_core]
Nov 27 09:34:09 server kernel:  mlx5e_tc_del_fdb_flow+0x1d1/0x310 [mlx5_core]
Nov 27 09:34:09 server kernel:  ? down_read+0x12/0xc0
Nov 27 09:34:09 server kernel:  mlx5e_tc_del_flow+0x60/0x2d0 [mlx5_core]
Nov 27 09:34:09 server kernel:  mlx5e_flow_put+0x3a/0x90 [mlx5_core]
Nov 27 09:34:09 server kernel:  __mlx5e_add_fdb_flow+0x116/0x4b0 [mlx5_core]
Nov 27 09:34:09 server kernel:  mlx5e_configure_flower+0x6ec/0x1290 [mlx5_core]
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? __kmalloc_noprof+0x24f/0x510
Nov 27 09:34:09 server kernel:  mlx5e_setup_tc_block_cb+0x79/0x100 [mlx5_core]
Nov 27 09:34:09 server kernel:  tc_setup_cb_add+0xed/0x210
Nov 27 09:34:09 server kernel:  fl_hw_replace_filter+0x163/0x220 [cls_flower]
Nov 27 09:34:09 server kernel:  fl_change+0xda3/0x1540 [cls_flower]
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? start_poll_synchronize_rcu_common+0x66/0xa0
Nov 27 09:34:09 server kernel:  tc_new_tfilter+0x436/0xe50
Nov 27 09:34:09 server kernel:  ? __pfx_tc_new_tfilter+0x10/0x10
Nov 27 09:34:09 server kernel:  rtnetlink_rcv_msg+0x357/0x3f0
Nov 27 09:34:09 server kernel:  ? kmem_cache_free+0x408/0x470
Nov 27 09:34:09 server kernel:  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Nov 27 09:34:09 server kernel:  netlink_rcv_skb+0x55/0x100
Nov 27 09:34:09 server kernel:  rtnetlink_rcv+0x15/0x30
Nov 27 09:34:09 server kernel:  netlink_unicast+0x248/0x390
Nov 27 09:34:09 server kernel:  netlink_sendmsg+0x214/0x460
Nov 27 09:34:09 server kernel:  ____sys_sendmsg+0x3b4/0x3f0
Nov 27 09:34:09 server kernel:  ___sys_sendmsg+0x9a/0xf0
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  __sys_sendmsg+0x8d/0xf0
Nov 27 09:34:09 server kernel:  __x64_sys_sendmsg+0x1d/0x30
Nov 27 09:34:09 server kernel:  x64_sys_call+0x6f9/0x2310
Nov 27 09:34:09 server kernel:  do_syscall_64+0x7e/0x170
Nov 27 09:34:09 server kernel:  ? ___sys_recvmsg+0xd4/0xf0
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? __sys_recvmsg+0x9a/0xf0
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0xcb/0x120
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? syscall_exit_to_user_mode+0x38/0x1d0
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? do_syscall_64+0x8a/0x170
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0xcb/0x120
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? syscall_exit_to_user_mode+0x38/0x1d0
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? do_syscall_64+0x8a/0x170
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? syscall_exit_to_user_mode+0x38/0x1d0
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? do_syscall_64+0x8a/0x170
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? do_syscall_64+0x8a/0x170
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? do_syscall_64+0x8a/0x170
Nov 27 09:34:09 server kernel:  ? syscall_exit_to_user_mode+0x38/0x1d0
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? do_syscall_64+0x8a/0x170
Nov 27 09:34:09 server kernel:  ? srso_return_thunk+0x5/0x5f
Nov 27 09:34:09 server kernel:  ? do_syscall_64+0x8a/0x170
Nov 27 09:34:09 server kernel:  ? ret_from_fork+0x29/0x70
Nov 27 09:34:09 server kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Nov 27 09:34:09 server kernel: RIP: 0033:0x7176a638e9ee
Nov 27 09:34:09 server kernel: Code: 08 0f 85 f5 4b ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 80 00 00 00 00 48 83 ec 08
Nov 27 09:34:09 server kernel: RSP: 002b:00007176a5f0e948 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
Nov 27 09:34:09 server kernel: RAX: ffffffffffffffda RBX: 00007176a5f6c6c0 RCX: 00007176a638e9ee
Nov 27 09:34:09 server kernel: RDX: 0000000000000000 RSI: 00007176a5f0ea10 RDI: 0000000000000014
Nov 27 09:34:09 server kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
Nov 27 09:34:09 server kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00005b20f81f3de0
Nov 27 09:34:09 server kernel: R13: 0000000000000012 R14: 00007176a5f0ea10 R15: 0000000000000001
Nov 27 09:34:09 server kernel:  </TASK>