Upgrade 7 to 8, Connect-4 dkms module installed

I'm trying to pave the way for migrating Proxmox 7 to 8. One warning thrown by pve7to8 is about an installed dkms module - which is version 4.18.0 of the mellanox drivers, as previously, my ConnectX-4 was not working reliably. I googled and found that this module will probably cause issues during the update, as it won't work with the 6 kernel: https://forum.proxmox.com/threads/update-von-7-auf-8-fehlgeschlagen.129327/

Question is: Does anyone have experience running a ConnectX-4 under pve 8 without additional drivers? Is is stable also under high load and with long uotimes? If so, any best practice for when/how to remove the extra module?

Code:
c1:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
        Subsystem: Mellanox Technologies MCX4421A-ACQN ConnectX-4 Lx EN OCP,2x25G
        Flags: bus master, fast devsel, latency 0, IRQ 591, IOMMU group 12
        Memory at 1801c000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at c0100000 [disabled] [size=1M]
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [48] Vital Product Data
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1c0] Secondary PCI Express
        Capabilities: [230] Access Control Services
        Kernel driver in use: mlx5_core
        Kernel modules: mlx5_core

Code:
dkms status
kernel-mft-dkms, 4.18.0, 5.10.0-19-amd64, x86_64: installed
kernel-mft-dkms, 4.18.0, 5.15.116-1-pve, x86_64: installed
kernel-mft-dkms, 4.18.0, 5.15.126-1-pve, x86_64: installed
kernel-mft-dkms, 4.18.0, 5.15.131-2-pve, x86_64: installed

Thanks.
 
Any update on this? I install a CX4 MT7710 and it's having issues with PVE8 6.5 kernel.
 
The card never properly loads... well it loads and then unloads itself. I live booted this computer with ubuntu and the card works as expected, so I don't think it's hardware. I'm also using the 6.5.11-7 default kernel from a new 8.1.3 install.

Code:
[    1.455990] mlx5_core 0000:01:00.0: firmware version: 14.32.1010
[    1.456052] mlx5_core 0000:01:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    1.766570] mlx5_core 0000:01:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
[    1.772283] mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged
[    2.162247] mlx5_core 0000:01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
[    2.164089] mlx5_core 0000:01:00.1: firmware version: 14.32.1010
[    2.164120] mlx5_core 0000:01:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    2.514691] mlx5_core 0000:01:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
[    2.522851] mlx5_core 0000:01:00.1: Port module event: module 1, Cable unplugged
[    2.931840] mlx5_core 0000:01:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
[    3.000990] mlx5_core 0000:01:00.0 eno1np0: renamed from eth0
[    3.017144] mlx5_core 0000:01:00.1 enp1s0f1np1: renamed from eth1
[   17.745789] mlx5_core 0000:01:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   17.781133] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   22.182987] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   22.924290] mlx5_core 0000:01:00.0: E-Switch: cleanup
[   24.009196] mlx5_core 0000:01:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   24.055793] mlx5_core 0000:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   28.435662] mlx5_core 0000:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   29.264535] mlx5_core 0000:01:00.1: E-Switch: cleanup
 
Looks better here:

Code:
[   10.320111] mlx5_core 0000:c1:00.0: firmware version: 14.32.1010
[   10.320160] mlx5_core 0000:c1:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[   10.625522] mlx5_core 0000:c1:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
[   10.629737] mlx5_core 0000:c1:00.0: Port module event: module 0, Cable plugged
[   10.962324] mlx5_core 0000:c1:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
[   10.963760] mlx5_core 0000:c1:00.1: firmware version: 14.32.1010
[   10.963860] mlx5_core 0000:c1:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[   11.299516] mlx5_core 0000:c1:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
[   11.305171] mlx5_core 0000:c1:00.1: Port module event: module 1, Cable plugged
[   11.676014] mlx5_core 0000:c1:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
[   11.709223] mlx5_core 0000:c1:00.0 enp193s0f0np0: renamed from eth0
[   11.757139] mlx5_core 0000:c1:00.1 enp193s0f1np1: renamed from eth1
[   19.942214] mlx5_core 0000:c1:00.0 enp193s0f0np0: Link up
[   20.674890] mlx5_core 0000:c1:00.1 enp193s0f1np1: Link up
[   20.719454] mlx5_core 0000:c1:00.0 enp193s0f0np0: entered allmulticast mode
[   20.719967] mlx5_core 0000:c1:00.1 enp193s0f1np1: entered allmulticast mode
[   26.907098] mlx5_core 0000:c1:00.0: lag map: port 1:1 port 2:2
[   26.907154] mlx5_core 0000:c1:00.0: shared_fdb:0 mode:queue_affinity
[   27.013065] mlx5_core 0000:c1:00.0 enp193s0f0np0: entered promiscuous mode
[   27.013111] mlx5_core 0000:c1:00.1 enp193s0f1np1: entered promiscuous mode
[   27.053619] mlx5_core 0000:c1:00.1: mlx5e_fs_set_rx_mode_work:842:(pid 733): S-tagged traffic will be dropped while C-tag vlan stripping is enabled
 
So I was able to fix this by completely rebuilding my proxmox node, even though the original was only weeks old and I didn’t do anything to it besides regular updates.

Funny enough after several hours the note interfaces dropped offline and now aren’t detected. I just saw a message of both interfaces getting cleaned up and removed.
 
Did you update the firmware? That can make a huge difference with Mellanox NICs https://network.nvidia.com/support/firmware/mlxup-mft/

IIRC we have some ConnectX-4 NICs in our testlab hardware and so far, it usually works fine without extra drivers. Updating the firmware on a somewhat regular basis though helps to avoid issues.
 
Thanks, I just checked and the firmware up to date. I have not seen any thermal messages and it’s in a cool room without case fan but feeling it these cards run hot.
 
On 6, I needed mellanox drivers for stability. Otherwise, under higher load the card was doing strange things, like taking a coffee break. It hasn't done that, yet, with pve8 and the kernel driver. <knockonwood>
I cant get anything past 800mb on a connectx-3 or 4 on pve8. With multiple streams I can top off around 3gb, far cry from the potential 10 or 25.
 
The card never properly loads... well it loads and then unloads itself. I live booted this computer with ubuntu and the card works as expected, so I don't think it's hardware. I'm also using the 6.5.11-7 default kernel from a new 8.1.3 install.

Code:
[    1.455990] mlx5_core 0000:01:00.0: firmware version: 14.32.1010
[    1.456052] mlx5_core 0000:01:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    1.766570] mlx5_core 0000:01:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
[    1.772283] mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged
[    2.162247] mlx5_core 0000:01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
[    2.164089] mlx5_core 0000:01:00.1: firmware version: 14.32.1010
[    2.164120] mlx5_core 0000:01:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    2.514691] mlx5_core 0000:01:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
[    2.522851] mlx5_core 0000:01:00.1: Port module event: module 1, Cable unplugged
[    2.931840] mlx5_core 0000:01:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0 basic)
[    3.000990] mlx5_core 0000:01:00.0 eno1np0: renamed from eth0
[    3.017144] mlx5_core 0000:01:00.1 enp1s0f1np1: renamed from eth1
[   17.745789] mlx5_core 0000:01:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   17.781133] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   22.182987] mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   22.924290] mlx5_core 0000:01:00.0: E-Switch: cleanup
[   24.009196] mlx5_core 0000:01:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   24.055793] mlx5_core 0000:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   28.435662] mlx5_core 0000:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
[   29.264535] mlx5_core 0000:01:00.1: E-Switch: cleanup
Hi,Have you solved this problem? I encountered the same problem as you described.
 
I cant get anything past 800mb on a connectx-3 or 4 on pve8. With multiple streams I can top off around 3gb, far cry from the potential 10 or 25.
That's not related to an update from PVE7 to PVE8, is it?

I did the upgrade here, and performance is as before (but without a separate kernel module installed now in PVE8:

Code:
iperf -c 192.168.1.6 -P 8
------------------------------------------------------------
Client connecting to 192.168.1.6, TCP port 5001
TCP window size:  208 KByte (default)
------------------------------------------------------------
[ 10] local 192.168.1.2 port 63702 connected with 192.168.1.6 port 5001
[  5] local 192.168.1.2 port 63697 connected with 192.168.1.6 port 5001
[  3] local 192.168.1.2 port 63695 connected with 192.168.1.6 port 5001
[  7] local 192.168.1.2 port 63699 connected with 192.168.1.6 port 5001
[  6] local 192.168.1.2 port 63698 connected with 192.168.1.6 port 5001
[  4] local 192.168.1.2 port 63696 connected with 192.168.1.6 port 5001
[  9] local 192.168.1.2 port 63701 connected with 192.168.1.6 port 5001
[  8] local 192.168.1.2 port 63700 connected with 192.168.1.6 port 5001
[ ID] Interval       Transfer     Bandwidth
[ 10]  0.0-10.0 sec  1.12 GBytes   963 Mbits/sec
[  5]  0.0-10.0 sec  1.19 GBytes  1.02 Gbits/sec
[  3]  0.0-10.0 sec  1.25 GBytes  1.08 Gbits/sec
[  7]  0.0-10.0 sec  1.06 GBytes   906 Mbits/sec
[  6]  0.0-10.0 sec  1.24 GBytes  1.06 Gbits/sec
[  4]  0.0-10.0 sec  1.24 GBytes  1.06 Gbits/sec
[  9]  0.0-10.0 sec  1.17 GBytes  1.00 Gbits/sec
[  8]  0.0-10.0 sec  1.18 GBytes  1.01 Gbits/sec
[SUM]  0.0-10.0 sec  9.45 GBytes  8.10 Gbits/sec

(Only have a 10GBit connection from the client.)
 
That's not related to an update from PVE7 to PVE8, is it?

Turns out it had nothing to do with PVE-8... I added my PVE box to a different vlan after upgrading. This created two problems, crossing vlans on unifi gear introduces a huge penalty. The 2nd problem was randomly negotiating to 1gig, forced 10 gig on the switch. Its not the best but its somewhat serviceable at 9.3gbits per second many retries though.

Testing On same VLAN 18.1 to 18.1 :
Connecting to host 172.18.1.8, port 5201
[ 5] local 172.18.1.102 port 60706 connected to 172.18.1.8 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.08 GBytes 9.24 Gbits/sec 251 970 KBytes
[ 5] 1.00-2.00 sec 1.09 GBytes 9.33 Gbits/sec 699 942 KBytes
[ 5] 2.00-3.00 sec 1.09 GBytes 9.36 Gbits/sec 176 967 KBytes
[ 5] 3.00-4.00 sec 1.09 GBytes 9.35 Gbits/sec 200 942 KBytes
[ 5] 4.00-5.00 sec 1.08 GBytes 9.31 Gbits/sec 218 963 KBytes
[ 5] 5.00-6.00 sec 1.09 GBytes 9.34 Gbits/sec 341 679 KBytes
[ 5] 6.00-7.00 sec 1.09 GBytes 9.34 Gbits/sec 773 509 KBytes
[ 5] 7.00-8.00 sec 1.09 GBytes 9.36 Gbits/sec 496 936 KBytes
[ 5] 8.00-9.00 sec 1.09 GBytes 9.37 Gbits/sec 299 1.03 MBytes
[ 5] 9.00-10.00 sec 1.09 GBytes 9.36 Gbits/sec 381 987 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 10.9 GBytes 9.34 Gbits/sec 3834 sender
[ 5] 0.00-10.00 sec 10.9 GBytes 9.34 Gbits/sec receiver




Testing on separate VLANS 18.1 to 18.0 :

Connecting to host 172.18.0.216, port 5201
[ 5] local 172.18.0.214 port 60244 connected to 172.18.0.216 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 333 MBytes 2.78 Gbits/sec 29 165 KBytes
[ 5] 1.00-2.00 sec 325 MBytes 2.73 Gbits/sec 0 168 KBytes
[ 5] 2.00-3.00 sec 295 MBytes 2.48 Gbits/sec 0 171 KBytes
[ 5] 3.00-4.00 sec 292 MBytes 2.45 Gbits/sec 0 171 KBytes
[ 5] 4.00-5.00 sec 312 MBytes 2.62 Gbits/sec 0 187 KBytes
[ 5] 5.00-6.00 sec 300 MBytes 2.52 Gbits/sec 0 201 KBytes
[ 5] 6.00-7.00 sec 300 MBytes 2.52 Gbits/sec 0 201 KBytes
[ 5] 7.00-8.00 sec 302 MBytes 2.53 Gbits/sec 0 201 KBytes
[ 5] 8.00-9.00 sec 301 MBytes 2.54 Gbits/sec 0 201 KBytes
[ 5] 9.00-10.00 sec 298 MBytes 2.49 Gbits/sec 0 201 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.99 GBytes 2.57 Gbits/sec 29 sender
[ 5] 0.00-10.00 sec 2.99 GBytes 2.57 Gbits/sec receiver
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!