[SOLVED] connectx-4 ceph cluster network issue

RobFantini

Famous Member
May 24, 2012
2,009
102
133
Boston,Mass
Hello

I am trying to move our ceph network to connectx-4 cards in ethernet mode .

I can mount osd's . ping/ssh between systems.

however " ceph -s " hangs. it seems like cluster communication is blocked somehow.

Ive tried 2 types of switches - thinking the issue was with a switch.

I tested multicast on 1st switch and that tested OK .
Multicast:
Code:
omping -c 200 -Cq   10.11.12.3 10.11.12.8 10.11.12.9 10.11.12.10  10.11.12.15

10.11.12.3  : multicast, xmt/rcv/%loss = 128/128/0%, min/avg/max/std-dev = 0.069/0.151/0.268/0.045
10.11.12.8  : multicast, xmt/rcv/%loss = 129/129/0%, min/avg/max/std-dev = 0.057/0.152/0.263/0.053
10.11.12.9  : multicast, xmt/rcv/%loss = 129/129/0%, min/avg/max/std-dev = 0.043/0.162/0.260/0.040
10.11.12.10 : multicast, xmt/rcv/%loss = 129/129/0%, min/avg/max/std-dev = 0.087/0.159/0.230/0.034



here is some info about the cards:
Code:
# ethtool  -i  enp133s0f1
driver: mlx5_core
version: 5.0-0
firmware-version: 14.14.1100 (MT_2420110034)
expansion-rom-version:
bus-info: 0000:85:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes


# ethtool   enp133s0f1
Settings for enp133s0f1:
        Supported ports: [ FIBRE Backplane ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Advertised link modes:  1000baseKX/Full
                                10000baseKR/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: 10000Mb/s
        Duplex: Full
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000004 (4)
                               link
        Link detected: yes
interfaces:
Code:
iface enp130s0f0 inet manual
iface enp130s0f1 inet manual
auto bond2
iface bond2 inet static
      address 10.11.12.15
      netmask  255.255.255.0
      slaves enp130s0f0 enp130s0f1
      bond_miimon 100
      bond_mode active-backup
      mtu 9000

This is my 1ST time using these cards.

the switch is a Quanta LB6M 24-Port 10GbE SFP+ 4x 1GbE L2/L3 Switch .

it is flashed to Brocade TurboIron . multicast is set to passive.


Any clues to get this working?
 
Last edited:
I'm using around 100 servers with connect-x4 without any problem or tuning. (I'm using mellanox switch too, sn2100 && sn2700).

Code:
# ethtool enp130s0f0
Settings for enp130s0f0:
    Supported ports: [ FIBRE Backplane ]
    Supported link modes:   1000baseKX/Full 
                            10000baseKR/Full 
                            25000baseCR/Full 
                            25000baseKR/Full 
                            25000baseSR/Full 
    Supported pause frame use: Symmetric
    Supports auto-negotiation: Yes
    Advertised link modes:  1000baseKX/Full 
                            10000baseKR/Full 
                            25000baseCR/Full 
                            25000baseKR/Full 
                            25000baseSR/Full 
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Link partner advertised link modes:  Not reported
    Link partner advertised pause frame use: No
    Link partner advertised auto-negotiation: Yes
    Speed: 10000Mb/s
    Duplex: Full
    Port: Direct Attach Copper
    PHYAD: 0
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: d
    Wake-on: d
    Current message level: 0x00000004 (4)
                   link
    Link detected: yes

Code:
# ping -f x.x.x.x
PING x.x.x.x(x.x.x.x) 56(84) bytes of data.
.^C
--- otherserver.odiso.net ping statistics ---
92611 packets transmitted, 92610 received, 0% packet loss, time 2370ms
rtt min/avg/max/mdev = 0.014/0.016/1.302/0.013 ms, ipg/ewma 0.025/0.018 ms


note that ceph don't use multicast, no need to test with omping.


if ceph -s hang, it's because you can't join the monitor. (do you cephx? if yes, do you have copied the key ?)
 
Not using cephx

Also

I did not restart any services.

Just did

ifdown bond1

using cli - reconfigured network so the same ceph network address was used on a different bond

then ifup bond2


Should I restart some services to get connected to ceph monitors? if so which ones?
 
Last edited:
are the ceph services on same server from where you launch "ceph -s" ?(and from where you have change ip address to mellanox card ?)

if yes, I don't known how ceph services works in this case.

you can try to restart ceph-mon service first.

(systemctl restart ceph-mon.target).

then osd

(systemctl restart ceph-osd.target).

you can also restart them from proxmox ceph gui.
 
' are the ceph services on same server from where you launch "ceph -s" ?'

Yes we run ceph on proxmox - On the same server where we run ceph -s and change address.
7 node cluster, 2 mainly for vm's, 5 mainly osd , 3 mons.

tried the restarts - still not working.


Let me know if you think of anything else to try.

Perhaps with ceph running on proxmox all systems need to be shutdown then started up one at a time..


Question - do you have ceph running on non proxmox systems ?
Or off of pve cluster?
 
Last edited:
Code:
# ping -f -c 1000 10.11.12.3
PING 10.11.12.3 (10.11.12.3) 56(84) bytes of data.
--- 10.11.12.3 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 25ms
rtt min/avg/max/mdev = 0.014/0.017/0.101/0.006 ms, ipg/ewma 0.025/0.018 ms

the switch seems OK to you?
 
Code:
# ping -f -c 1000 10.11.12.3
PING 10.11.12.3 (10.11.12.3) 56(84) bytes of data.
--- 10.11.12.3 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 25ms
rtt min/avg/max/mdev = 0.014/0.017/0.101/0.006 ms, ipg/ewma 0.025/0.018 ms

the switch seems OK to you?
no problem
 
' are the ceph services on same server from where you launch "ceph -s" ?'

Yes we run ceph on proxmox - On the same server where we run ceph -s and change address.
7 node cluster, 2 mainly for vm's, 5 mainly osd , 3 mons.

tried the restarts - still not working.


Let me know if you think of anything else to try.

Perhaps with ceph running on proxmox all systems need to be shutdown then started up one at a time..
for now, you have only change network only on 1 node ?
if yes, does "ceph -s" is working from other nodes ?
do you use standard mtu or bigger mtu like 9000 ?

maybe check in ceph logs (/var/log/ceph/...)

Question - do you have ceph running on non proxmox systems ?
Or off of pve cluster?

both. small cluster on proxmox nodes for some customers, big clusters on seperate node.
 
for now, you have only change network only on 1 node ?
if yes, does "ceph -s" is working from other nodes ?
do you use standard mtu or bigger mtu like 9000 ?

maybe check in ceph logs (/var/log/ceph/...)



both. small cluster on proxmox nodes for some customers, big clusters on seperate node.

All nodes use the new network.

mtu 9000 on all


I am working from off site - and noticed 1 node is not linked up per ethtool.

and NFS is on same network - ethtool shows a link but does not ping.

So I've got to get some sleep then go in and fix those two items.

working interface example on cluster
Code:
iface ens6f1 inet manual
auto bond2
iface bond2 inet static
      address 10.11.12.10
      netmask  255.255.255.0
      slaves ens6f0 ens6f1
      bond_miimon 100
      bond_mode active-backup
      mtu 9000

nfs on same switch and not part of cluster
Code:
iface enp3s0f0 inet manual
iface enp3s0f1 inet manual
auto bond2
iface bond2 inet static
      address 10.11.12.41
      netmask  255.255.255.0
      slaves enp3s0f0 enp3s0f1
      bond_miimon 100
      bond_mode active-backup
      mtu 9000

weird that nfs can not ping the switch or cluster nodes.

looks like I've got at least one hardware issue. I'll fix that when i get to server room .

however it seems like the 5 connected to sfp switch should have ceph mon working.

So I'll fix hardware issues

If you have any other suggestions / ideas please just post away!

Edit - NFS is probably a switch port setting - it is on a port that may have diff settings...
 
Last edited:
from a mon log
Code:
2018-11-29 02:45:32.186359 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:46:32.186600 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:47:32.186804 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:48:32.187050 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:49:32.187283 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:50:32.187514 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:51:32.187747 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:52:32.187971 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:53:32.188195 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:54:32.188369 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:55:32.188555 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:56:32.188781 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:57:31.137485 7fbee9f14700  0 -- 10.11.12.8:6789/0 >> 10.11.12.10:6789/0 conn(0x5583cd36b000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 4 vs existing csq=3 existing_state=STATE_OPEN
2018-11-29 02:57:32.189007 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:58:32.189173 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 02:59:32.189404 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:00:32.189624 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:00:43.374509 7fbeea715700  0 -- 10.11.12.8:6789/0 >> 10.11.12.3:6789/0 conn(0x5583cca01000 :-1 s=STATE_OPEN pgs=65432 cs=1 l=0).fault initiating reconnect
2018-11-29 03:01:32.189846 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:02:32.190014 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:03:32.190257 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:04:32.190437 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:05:32.190655 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:06:32.190871 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:07:32.191116 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:08:32.191337 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:09:32.191507 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB
2018-11-29 03:10:32.191765 7fbeeff20700  0 mon.sys8@1(electing).data_health(0) update_stats avail 99% total 422GiB, used 3.37GiB, avail 418GiB

and another:
Code:
2018-11-29 02:55:55.082091 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 02:56:55.082273 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 02:57:55.082474 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 02:58:55.082679 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 02:59:55.082917 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:00:41.601972 7fb51cd93700  0 -- 10.11.12.3:6789/0 >> 10.11.12.10:6789/0 conn(0x556769dd1000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_OPEN
2018-11-29 03:00:43.375164 7fb51c592700  0 -- 10.11.12.3:6789/0 >> 10.11.12.8:6789/0 conn(0x556769dcf800 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_OPEN
2018-11-29 03:00:55.083167 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:01:55.083471 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:02:55.083763 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:03:55.084037 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:04:55.084293 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:05:55.084587 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:06:55.084835 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:07:55.085090 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:08:55.085325 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:09:55.085564 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:10:55.085831 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:11:55.086056 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
2018-11-29 03:12:55.086321 7fb52259e700  0 mon.pve3@0(electing).data_health(0) update_stats avail 98% total 422GiB, used 4.55GiB, avail 417GiB
 
we run a cron jjob to check health

after 5 minutes it gave up and sent an error report.

perhaps the error message has a clue on why 'ceph -s' is not working

Code:
Subject: Cron <root@fbcadmin> ssh pve3 "ceph -s | grep HEALTH_WARN > /dev/null && ceph -s"

2018-11-29 04:22:02.198029 7f1a21d5c700  0 monclient(hunting): authenticate timed out after 300
2018-11-29 04:22:02.198278 7f1a21d5c700  0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster
 
This is not just a ceph issue. NFS mounts are also broken.


These work:
ping -f works fine between nfs and clients.
ssh works both ways between nfs-client and nfs-server.
osd mounts

These do not work:
ceph -s
nfs storage : trying to access nfs mount hangs.



I tried restarting nfs server and one of the clients. That did not fix the issue.

What changed from when nfs and ceph -s stopped working:
- changed nics from intel 10G to mellanox connectx-4
- changed switch from rj45 10G to a quantum-lb6m flashed to use brocade turboIron

Any suggestion welcome .
 
are you sure that mtu 9000 is set on your new switch ?

also, as you use active/backup for bonding, are you sure that all active primary interfaces are on the same switch ? (cat /proc/net/bonding/bond2)
 
to enable mtu 9000 on brocade
Code:
telnet@quanta-1#configure terminal
telnet@quanta-1(config)#jumbo
Jumbo mode setting requires a reload to take effect!
telnet@quanta-1(config)#write memory
Write startup-config done.
telnet@quanta-1(config)#end
telnet@quanta-1#reload
Are you sure? (enter 'y' or 'n'): y
[code]
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!