Networking problems with Mellanox Cards with newest Kernel 5.13

Pjotr

Member
Dec 5, 2018
3
0
6
52
Stuttgart
Proxmoxcluster with Mellanox ConnectX-4 Lx networkcards:


Worked under kernel 5.11, everything fine so far. After Upgrade to newest kernel 5.13 massive problems on networking. Impossible to dump sfp infos with ethtool, got bit errors massive problems with local ceph instance installed and so on. Reverting back to 5.11 Kernel everything seems to be fine again…..

# ethtool -i enp1s0f1np1
driver: mlx5_core
version: 5.13.19-4-pve
firmware-version: 14.29.1016 (MT_2420110004)
expansion-rom-version:
bus-info: 0000:01:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes


# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-4-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-12
pve-kernel-5.13: 7.1-7
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-3
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-6
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-5
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1


A quick look at the corresponding network drivers at the different kernel versions for this card:

lib/modules/5.11.22-7-pve/kernel/drivers/net/ethernet/mellanox/mlx5/core# ls -la
-rw-r--r-- 1 root root 2309448 Nov 7 21:46 mlx5_core.ko

lib/modules/5.13.19-4-pve/kernel/drivers/net/ethernet/mellanox/mlx5/core# ls -la
-rw-r--r-- 1 root root 2060120 Feb 7 11:01 mlx5_core.ko

tmp/kernel-ausgepackt/5.15.19-2/lib/modules/5.15.19-2-pve/kernel/drivers/net/ethernet/mellanox/mlx5/core# ls -la
-rw-r--r-- 1 root root 2522672 Feb 8 11:19 mlx5_core.ko

Ubuntu native Kernel:
tmp/kernel-ausgepackt/ubuntu-5.13.0-30/lib/modules/5.13.0-30-generic/kernel/drivers/net/ethernet/mellanox/mlx5/core# ls -la
-rw-r--r-- 1 root root 2652337 Feb 4 17:40 mlx5_core.ko



For me that looks a little bit strange. Does anyone else have similar problems with newest kernel and mellanox cards? Any explanations....
 

spirit

Famous Member
Apr 2, 2010
5,682
624
133
www.odiso.com
I have connectx4-lx with kernel 5.13, works fine for me.

Code:
82:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]


driver: mlx5_core
version: 5.13.19-2-pve
firmware-version: 14.23.1020 (MT_2420110004)
expansion-rom-version:
bus-info: 0000:82:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

another

Code:
 ethtool -i eth0
driver: mlx5_core
version: 5.13.19-4-pve
firmware-version: 14.31.1014 (MT_2420110004)
expansion-rom-version: 
bus-info: 0000:04:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
 
Last edited:

Pjotr

Member
Dec 5, 2018
3
0
6
52
Stuttgart
Hello spirit,

for your interface with driver version 5.13.19-4-pve:
Are you able to dump the interface module information e.g. "laser output power" ...?
And if you are able to, can you tell me what kind of SFP module you are using. We are using SFPs from Flexoptix within the Mellanox cards.

I'm wondering about the different size of the driver in this Kernelversion.....

Thanks in advance

Peter
 

spirit

Famous Member
Apr 2, 2010
5,682
624
133
www.odiso.com
Hello spirit,

for your interface with driver version 5.13.19-4-pve:
Are you able to dump the interface module information e.g. "laser output power" ...?
And if you are able to, can you tell me what kind of SFP module you are using. We are using SFPs from Flexoptix within the Mellanox cards.

I'm wondering about the different size of the driver in this Kernelversion.....

Thanks in advance

Peter
I don't use transceiver sfp module/fiber cable, I'm using copper DAC cable directly. (https://network.nvidia.com/related-docs/prod_cables/PB_MCP2M00-Axxx_25GbE_SFP28_DAC.pdf)
 

Pjotr

Member
Dec 5, 2018
3
0
6
52
Stuttgart
First of all, thank you for the the information and yes i tested opt-in Kernel 5.15.19.-1 with same result.

For me its not just a problem concerning the eeprom dump but i have bit errors on the interface and in consequence trouble concerning the local installed ceph-instance....

Next week i will change the SFPs and the network cards
 

spirit

Famous Member
Apr 2, 2010
5,682
624
133
www.odiso.com
I really don't known, sorry :( . I really don't have any network error or packet loss with 5.13. (I'm also using ceph with theses nodes, don't have see any latency or throughput problem)
 
We did a lot of research on this and here is the current result:

Linux (Vanilla) Kernels starting with Kernel 5.13.0 have this issue. Somewhere between Linux Kernel 5.15.12 and Linux Kernel 5.15.25 the issue has been fixed.

We assume that one of these Kernel Updates fixed the issue (we will do further testing tomorrow, those are the Kernels between 5.15.12 and 5.15.25 which contain changes for the mlx5 code):

Details can be found here (in German):
https://www.thomas-krenn.com/de/wiki/Mellanox_ConnectX-4/5/6_Bitfehler_ab_Linux_Kernel_5.13

I'll post an update as soon as we have further information on this.
 
  • Like
Reactions: jsterr
@spirit it seems you are right:
Our testing shows:
  • Vanilla Kernel 5.15.12 has the issue
  • Vanilla Kernel 5.15.25 works
We will do the following tests tomorrow:
  • Tests with Vanilla Kernel 5.15.19 (should have the issue)
  • Tests with Vanilla Kernel 5.15.20 (should work without the problem)
We keep you updated :)
 
Last edited:
  • Like
Reactions: jsterr and spirit

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!