PVE 6 and Mellanox 4.x drivers

jermudgeon

Well-Known Member
Apr 7, 2016
30
1
48
45
I'm having massive instability with the built-in mellanox 4.0.0 drivers (mlx4_en). However, I don't seem to be able to compile the Mellanox drivers (for Debian 9.6).

Has anyone else had success or failure with this setup?

My particular cards are

lspci -k
82:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
Subsystem: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s]
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core

Thanks.
 
82:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
If I am not mistaken, then it is an old card. The mlx4 module seems to be for Kernel 4.4 on the Mellanox website. But you could check for a firmware update, maybe that's all what is needed to get rid of the instability.
 
Thanks, Alwin. I've checked the firmware; it is current. This card was stable under PVE versions 3, 4 and 5. It also seems to be stable when not running openvswitch bonding. (Openvswitch is running stably on other adapters that are not this card.)

So the failure seems to be a combination of these cards, openvswitch, PVE6, and bonding.

Any other thoughts? The current Mellanox drivers are version 4.6 (vs 4.0 installed) but I haven't managed to get them built.
 
The mlnx module is already build-in (modinfo modinfo mlx4_core).
 
For what it's worth, the devices have been stable since I removed the openvswitch bonding and switching layer.

I may try adding back in just the switching layer without the LACP.

Haven't worked any more on compiling a newer kernel module.

Ultimately I will need all components working again as they were before -- these cards, openvswitch, bonding, and PVE. Openvswitch is the most suspect component at this point -- there were instances where I could restart openvswitch and the link(s) would come back up. On other occasions the kernel would freeze. I have rsyslog running to be able to capture these instances locally and remotely.
 
Hello, what do you mean with instabilities ? VM losing network connection ?

Just asking because i have the same setup but without mellanox cards (chelsio 10G on some boxes and broadcom 10G on others). Two of the boxes have active-passive bonding, two others use LACP bonding. Same issue on both configuration.

All vm I migrate to these proxmox 6.x boxes running kernel 5.0 lose network after a bit of time. When falling back to host kernel 4.15 on the same boxes, no issues.
Topic here: https://forum.proxmox.com/threads/proxmox-5-4-to-6-0-strange-network-issues.56086/
 
In this case it was not a bond used for VM access, only ceph (storage). 10G interfaces -- similar to your configuration. I have removed both the bonding and the OpenVswitch bridge (both OpenVSwitch features), and it has been stable for two days. Previously we could usually last 8-12 hours before those network interfaces would become unusable.

I have not tried reverting to kernel 4.15 -- I can't because of ceph, IIRC.
 
I have not tried reverting to kernel 4.15 -- I can't because of ceph, IIRC.
Only if you use the KRBD for VM/CT, then the client version might be too low.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!