[SOLVED] after upgrade Mellanox issue . can not connect to ceph network

RobFantini

Famous Member
May 24, 2012
2,009
102
133
Boston,Mass
hello - we use Mellanox connect-x4 for our ceph storage network.

after upgrading 1 of 7 nodes and rebooting, I can not get a connection to ceph network.

from dmesg the device name looks ok for /etc/network/interfaces :
Code:
[Thu Sep 10 14:19:57 2020] mlx5_core 0000:03:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[Thu Sep 10 14:19:57 2020] ixgbe 0000:05:00.1: Multiqueue Enabled: Rx Queue count = 48, Tx Queue count = 48 XDP Queue count = 0
[Thu Sep 10 14:19:57 2020] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0
[Thu Sep 10 14:19:57 2020] mlx5_core 0000:03:00.0 enp3s0f0: renamed from eth1
[Thu Sep 10 14:19:57 2020] mlx5_core 0000:03:00.1 enp3s0f1: renamed from eth2

Code:
# /etc/network/interfaces
..
iface enp3s0f0 inet manual
iface enp3s0f1 inet manual
auto bond2
iface bond2 inet static
      address 10.11.12.8
      netmask  255.255.255.0
      slaves  enp3s0f0 enp3s0f1
      bond_miimon 100
      bond_mode active-backup
      mtu 9000

Code:
# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-1
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.1-13
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-1
pve-xtermjs: 4.7.0-2
pve-zsync: 2.0-3
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

any suggestions appreciated.
 
Do you have any error message?

We have no problems with the same kernel and also mlx cards:

root@prox2:~# dmesg |grep -e mlx [ 3.724951] mlx5_core 0000:af:00.0: firmware version: 14.23.1020 [ 3.724981] mlx5_core 0000:af:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link) [ 4.221505] mlx5_core 0000:af:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) [ 4.225986] mlx5_core 0000:af:00.0: Port module event: module 0, Cable plugged [ 4.241749] mlx5_core 0000:af:00.1: firmware version: 14.23.1020 [ 4.241795] mlx5_core 0000:af:00.1: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link) [ 4.747349] mlx5_core 0000:af:00.1: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) [ 4.752699] mlx5_core 0000:af:00.1: Port module event: module 1, Cable plugged [ 4.768556] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 4.991011] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 5.234099] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0 [ 5.234426] mlx5_core 0000:af:00.1 ens6f1: renamed from eth1 [ 5.251796] mlx5_core 0000:af:00.0 ens6f0: renamed from eth0 [ 14.126333] mlx5_core 0000:af:00.0 ens6f0: Link down [ 14.133886] mlx5_core 0000:af:00.0 ens6f0: Link up [ 14.517751] mlx5_core 0000:af:00.1 ens6f1: Link down [ 14.526010] mlx5_core 0000:af:00.1 ens6f1: Link up [ 14.601835] mlx5_core 0000:af:00.0 ens6f0: S-tagged traffic will be dropped while C-tag vlan stripping is enabled [ 16.972955] mlx5_core 0000:af:00.0: lag map port 1:1 port 2:1 root@prox2:~# lspci |grep -e "Mella" af:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] af:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] root@prox2:~# uname -a Linux prox2 5.4.60-1-pve #1 SMP PVE 5.4.60-2 (Fri, 04 Sep 2020 10:24:50 +0200) x86_64 GNU/Linux
 
when using the prior kernel the ceph network works
Linux sys8 5.4.55-1-pve #1 SMP PVE 5.4.55-1 (Mon, 10 Aug 2020 10:26:27 +0200) x86_64 GNU/Linux
 
Sorry, but without an error message i can't help you ¯\_(ツ)_/¯
try sneaking around in /var/log/ceph and look at the logfiles, or try pinging nodes on the ceph public network, or check if the interface came up .. or or or
 
  • Like
Reactions: RobFantini
So it could be that with the current mlx5_ib kernel module that there may be a different bond mode that works? just guessing.. I remember a bond mode change needed on a kernel update a few years ago. I can test suggestions if any on the weekend
 
Have you tried updating the NICs firmware to the latest version? AFAICT the latest firmware for Connectx 4 LX is 14.28.
 
the upgrade, which was very easy using mlxup from NVIDIA/Mellanox , fixed the issue. for me this was the easiest firmware upgrade I've done. thank you both for the advice.
 
Last edited:
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!