[SOLVED] after upgrade Mellanox issue . can not connect to ceph network

RobFantini

Famous Member
May 24, 2012
2,084
116
133
Boston,Mass
hello - we use Mellanox connect-x4 for our ceph storage network.

after upgrading 1 of 7 nodes and rebooting, I can not get a connection to ceph network.

from dmesg the device name looks ok for /etc/network/interfaces :
Code:
[Thu Sep 10 14:19:57 2020] mlx5_core 0000:03:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[Thu Sep 10 14:19:57 2020] ixgbe 0000:05:00.1: Multiqueue Enabled: Rx Queue count = 48, Tx Queue count = 48 XDP Queue count = 0
[Thu Sep 10 14:19:57 2020] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0
[Thu Sep 10 14:19:57 2020] mlx5_core 0000:03:00.0 enp3s0f0: renamed from eth1
[Thu Sep 10 14:19:57 2020] mlx5_core 0000:03:00.1 enp3s0f1: renamed from eth2

Code:
# /etc/network/interfaces
..
iface enp3s0f0 inet manual
iface enp3s0f1 inet manual
auto bond2
iface bond2 inet static
      address 10.11.12.8
      netmask  255.255.255.0
      slaves  enp3s0f0 enp3s0f1
      bond_miimon 100
      bond_mode active-backup
      mtu 9000

Code:
# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-1
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.1-13
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-1
pve-xtermjs: 4.7.0-2
pve-zsync: 2.0-3
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

any suggestions appreciated.
 
Do you have any error message?

We have no problems with the same kernel and also mlx cards:

root@prox2:~# dmesg |grep -e mlx [ 3.724951] mlx5_core 0000:af:00.0: firmware version: 14.23.1020 [ 3.724981] mlx5_core 0000:af:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link) [ 4.221505] mlx5_core 0000:af:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) [ 4.225986] mlx5_core 0000:af:00.0: Port module event: module 0, Cable plugged [ 4.241749] mlx5_core 0000:af:00.1: firmware version: 14.23.1020 [ 4.241795] mlx5_core 0000:af:00.1: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link) [ 4.747349] mlx5_core 0000:af:00.1: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) [ 4.752699] mlx5_core 0000:af:00.1: Port module event: module 1, Cable plugged [ 4.768556] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 4.991011] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 5.234099] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0 [ 5.234426] mlx5_core 0000:af:00.1 ens6f1: renamed from eth1 [ 5.251796] mlx5_core 0000:af:00.0 ens6f0: renamed from eth0 [ 14.126333] mlx5_core 0000:af:00.0 ens6f0: Link down [ 14.133886] mlx5_core 0000:af:00.0 ens6f0: Link up [ 14.517751] mlx5_core 0000:af:00.1 ens6f1: Link down [ 14.526010] mlx5_core 0000:af:00.1 ens6f1: Link up [ 14.601835] mlx5_core 0000:af:00.0 ens6f0: S-tagged traffic will be dropped while C-tag vlan stripping is enabled [ 16.972955] mlx5_core 0000:af:00.0: lag map port 1:1 port 2:1 root@prox2:~# lspci |grep -e "Mella" af:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] af:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] root@prox2:~# uname -a Linux prox2 5.4.60-1-pve #1 SMP PVE 5.4.60-2 (Fri, 04 Sep 2020 10:24:50 +0200) x86_64 GNU/Linux
 
when using the prior kernel the ceph network works
Linux sys8 5.4.55-1-pve #1 SMP PVE 5.4.55-1 (Mon, 10 Aug 2020 10:26:27 +0200) x86_64 GNU/Linux
 
Sorry, but without an error message i can't help you ¯\_(ツ)_/¯
try sneaking around in /var/log/ceph and look at the logfiles, or try pinging nodes on the ceph public network, or check if the interface came up .. or or or
 
  • Like
Reactions: RobFantini
So it could be that with the current mlx5_ib kernel module that there may be a different bond mode that works? just guessing.. I remember a bond mode change needed on a kernel update a few years ago. I can test suggestions if any on the weekend
 
Have you tried updating the NICs firmware to the latest version? AFAICT the latest firmware for Connectx 4 LX is 14.28.
 
the upgrade, which was very easy using mlxup from NVIDIA/Mellanox , fixed the issue. for me this was the easiest firmware upgrade I've done. thank you both for the advice.
 
Last edited:
  • Like
Reactions: aaron