[SOLVED] New node cannot connect to external Ceph cluster

SilkBC

New Member
Mar 28, 2024
14
2
3
Hello,

I just installed a new node (v8.3.4) and added it to my Proxmox cluster (nodes running v8.2.7), but for some reason, it is not able to connect to my external Ceph cluster; the two storage drives I have just show with grey question marks on them, and nothing I have done will allow it to connect. I have the networking and MTUs set identically to my other two hosts.

Here is the interfaces file from the new node:

Code:
auto lo
iface lo inet loopback

auto eno2
iface eno2 inet manual
#1GbE   

auto eno1
iface eno1 inet manual
#1GbE

auto ens1f0
iface ens1f0 inet manual
        mtu 9000
#10GbE

auto ens1f1
iface ens1f1 inet manual
        mtu 9000
#10GbE

auto bond0
iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode active-backup
        bond-primary eno1
#Mgmt Network Bond interface

auto bond1
iface bond1 inet manual
        bond-slaves ens1f0 ens1f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        mtu 9000
#VM Network Bond interface

auto vmbr0
iface vmbr0 inet static
        address 10.3.127.16/24
        gateway 10.3.127.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
#Management Network

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9000
#VM Network

auto vmbr1.22
iface vmbr1.22 inet static
        address 10.22.0.16/24
        mtu 8972
#Storage Network

source /etc/network/interfaces.d/*

The vmbr1.22 VLAN interface is the connection to the storage VLAN where the Ceph cluster is located.

and here is the interfaces file from one of my nodes that can connect to the Ceph storage:

Code:
auto lo
iface lo inet loopback

auto eno8303
iface eno8303 inet manual
#1GbE

auto eno8403
iface eno8403 inet manual
#1GbE

auto eno12399np0
iface eno12399np0 inet manual
        mtu 9000
#10GbE

auto eno12409np1
iface eno12409np1 inet manual
        mtu 9000
#10GbE

auto bond1
iface bond1 inet manual
        bond-slaves eno12399np0 eno12409np1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        mtu 9000
#VM Network Bond interface

auto bond0
iface bond0 inet manual
        bond-slaves eno8303
        bond-miimon 100
        bond-mode active-backup
        bond-primary eno8303
#Mgmt Network Bond interface

auto vmbr0
iface vmbr0 inet static
        address 10.3.127.14/24
        gateway 10.3.127.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
#Management Network

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9000
#VM Network

auto vmbr1.22
iface vmbr1.22 inet static
        address 10.22.0.14/24
        mtu 8972
#Storage Network

Except for the obvious things like interface names and IP addresses, I am not seeing any difference, but maybe another set of eyes or two can spot one?

I can, of course, ping through the vmbr1.22 interface IP to the 10.22.0.x IPs of the Ceph nodes, so there *is* connectivity to the Ceph cluster. I have verified with the network admin who manages the switches that the two ports the 10GbE interfaces are connected to are configured as an LACP bonded pair, and that the MTU is set to 9000 on both interfaces as well as the LACP bond itself (he even sent me a screenshot of the config)

I did also check the '/etc/ceph' directory and the admin.client.keyring and ceph.conf files were missing, so I copied them over from one of the working nodes and then rebooted the new node, but that didn't fix anything; the Ceph storage still show up with a grey question mark on them and are inacessible. The journalctl shows timeouts (not surprisingly), but otherwise nothing at all helpful.

I am not sure what else to look at, or why else the new host cannot connect to the Ceph cluster?

The only thing I can think of is that maybe the node is trying to connect through the management connection (which is only 1Gbit), which the management VLAN is able to access. The idea of adding the vmbr1.22 VLAN interface was so that the nodes have a direct connection to the storage VLAN, so any traffic destined for it *should* automatically go out that interface as it is a lower-cost route.

I can, of course, provide any other info you might need.



Your insight is appreciated :-)
 
Please, post /etc/pve/ceph.conf and /etc/pve/storage.cfg so we get the full picture of the setup.

I would also take a look with tcpdump in the new node to check if that host is getting some reply from the Ceph cluster when trying to access it.
 
Have you verified that the large MTU works?
Code:
ping -M do -s 8972 {target host}
The -s parameter is set to 9000 minus the 28 bytes of ICMP of IPv4 overhead.
 
Please, post /etc/pve/ceph.conf and /etc/pve/storage.cfg so we get the full picture of the setup.

I would also take a look with tcpdump in the new node to check if that host is getting some reply from the Ceph cluster when trying to access it.
Here they are:

Code:
root@vhost06:~# cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content vztmpl,iso
        shared 0

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

rbd: ds-ceph-standard
        content images,rootdir
        krbd 1
        monhost 10.22.0.41 10.22.0.42 10.22.0.43
        pool pve_rbd-hdd
        username admin

rbd: ds-ceph-performance
        content images,rootdir
        krbd 1
        monhost 10.22.0.41 10.22.0.42 10.22.0.43
        pool pve_rbd-ssd
        username admin

pbs: PBS1-T2
        datastore Backup-1
        server 10.3.127.201
        content backup
        fingerprint 22:2e:1f:73:91:ac:e4:ce:e1:b9:e7:74:39:0e:7f:9c:1e:6f:87:b7:74:ca:7e:ce:fa:2c:88:2a:42:c7:4a:b1
        namespace datacenter-CLGX-Tier2
        prune-backups keep-all=1
        username pbs-1@pbs

pbs: PBS1-T1
        datastore Backup-1
        server 10.3.127.201
        content backup
        fingerprint 22:2e:1f:73:91:ac:e4:ce:e1:b9:e7:74:39:0e:7f:9c:1e:6f:87:b7:74:ca:7e:ce:fa:2c:88:2a:42:c7:4a:b1
        namespace datacenter-CLGX-Tier1
        prune-backups keep-all=1
        username pbs-1@pbs

esxi: CLGX-VHOST01
        server 10.3.127.11
        username root
        content import
        skip-cert-verification 1

esxi: CLGX-VHOST02
        server 10.3.127.12
        username root
        content import
        skip-cert-verification 1

esxi: CLGX-VHOST03
        server 10.3.127.13
        username root
        content import
        skip-cert-verification 1

Code:
root@vhost06:~# cat /etc/ceph/ceph.conf
# minimal ceph.conf for 474264fe-b00e-11ee-b586-ac1f6b0ff21a
[global]
        fsid = 474264fe-b00e-11ee-b586-ac1f6b0ff21a
        mon_host = [v2:10.22.0.41:3300/0,v1:10.22.0.41:6789/0] [v2:10.22.0.42:3300/0,v1:10.22.0.42:6789/0] [v2:10.22.0.43:3300/0,v1:10.22.0.43:6789/0] [v2:10.22.0.44:3300/0,v1:10.22.0.44:6789/0] [v2:10.22.0.45:3300/0,v1:10.22.0.45:6789/0]

Doing a tcpdump on the vmbr1.22 interface I am seeing two-way traffic between the node and the Ceph cluster, but it actually looks the same as the tcpdump I see from a working node.
 
Have you verified that the large MTU works?
Code:
ping -M do -s 8972 {target host}
The -s parameter is set to 9000 minus the 28 bytes of ICMP of IPv4 overhead.
Huh, so this is odd. Using the 8972 byte size says "message too long", HOWEVER I get the exact same message from a working node, even if I use -I to specify the vmbr1.22 IP of the node to make sure the packet is going through.

But we can see that an MTU of 8972 is working on the "pre-existing" nodes in my Proxmox cluster (in fact, when I initially set the MTU to 9000 on the vmbr1.22 interface back when I first set this cluster up, the storage wouldn't connect until I did change it to 8972, but the new node of course still cannot connect with those settings)

Just for shiggles, I changed the MTU on the vmbr1.22 VLAN interface to 1472 and the Ceph storage *does* connect now, but I can't use that MTU because I am using jumbo frames for the storage everywhere else.

So I guess the question now is: is it the 10G cards that can't support Jumbo frames or is there a misconfiguration on the switch ports (though the guy who manages the switches sent me the config and it looks correct -- including the MTU set to 9000. I don't have access to the switches, so unfortunately I cannot confirm things for myself)

Here are the 10G cards from lspci:

Code:
81:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
81:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

A Google search confirms that jumbo frames is supported on this card. I will go back to the guy wo manages the switches and get him to confirm againa; maybe he didn't apply the settings or something :-(
 
As aaron pointed out, check the MTU. I've just noticed that you've set mtu 8972 in the nics for the 10.22.0/24 network, instead of the more typical 9000. In your case, ping -M do -s 8972 {target host} won't work, as you have to substract 28 from the interface MTU. ping -M do -s 8944 {target host} should work. Check with the monitors and every host that has OSD's as {target host}. Also check the MTU set in the Ceph hosts, it should match that set in PVE hosts.
 
Last edited:
As aaron pointed out, check the MTU. I've just noticed that you've set mtu 8972 in the nics for the 10.22.0/24 network, instead of the more typical 9000.
Actually I did set the physical NICs, the bond they are in, and the bridge the bond is connected to to MTU 9000. It is just the VLAN interface (vmbr1.22) that I set the MTU to 8972, as that is what I had to set it on on the working nodes for the cluster to set (which makes sense; VLAN header probably uses 28 bytes)

In any case, it looks like jumbo frames is not working as I do not get a ping reply until I use a byte size of 1472 (which would be 1480 bytes). I am going to confirm again with the guy who manages the switches that the config is actually applied, but other than that, not sure why else the jumbo frames would not be working.
 
  • Like
Reactions: aaron
So just as an update, on a whim, later on yesterday afternoon, I decided to try setting the MTU on the vmbr1.22 VLAN interface back to 8972 and the Ceph datastores remains connected and responsive. I left it for ten minutes and then rebooted, and the datastores were still connected and responsive.

I have no idea what changed; I *suspect* the guy who manages the switches realized an error when I asked him to check again, but that is pure speculation on my part. I did not change anything at all on my end. The only other thing it could be is that getting the datastores connected by using the 1472 MTU somehow made some change somewhere on the node that then allowed the jumbo frames to be recognized or work properly, but that is even more speculative than my thought about a mistake being corrected on the switches.

I will probably never know what the "fix" really was :-(