Ceph Cluster, can't create 2nd/3rd Monitor/Manager

nicedevil · Jan 10, 2024

Hey guys,

I have setup my new thunderbolt 4 cluster that can connect to each other over the Thunderbolt 4 cables. (SSH to each of the others from each other tested f.e., also iperf ping etc.).

Unfortunately I'm trying hard to get my Ceph up and running, I always run into timeouts on creating monitors or managers. Is there something I can do to get rid of the timeout errors? This one f.e. was for creating the 3rd monitor.

jsterr · Jan 11, 2024

Please post your /etc/pve/ceph.conf and your /etc/network/interfaces file and check that:

ping and jumbo-ping works without problems
that you have configured a timeserver, that is working
- see: https://www.thomas-krenn.com/de/wiki/Proxmox_Clock_Skrew_detected for chrony reference)
sometimes if you dont have a working ntp server, theres nothing working in ceph

nicedevil · Jan 17, 2024

@jsterr
Ok I setup everything from scratch and have my working thunderbolt network up and running. Unfortunately it leads all to the same error.
I installed it with the following commands:

pve01 = 10.0.0.81
pve02 = 10.0.0.82
pve03 = 10.0.0.83

Bash:

pveceph install --repository no-subscription --version reef
pveceph init --network 10.0.0.81/29
pveceph mon create --mon-address 10.0.0.81

The first and last step was done on all 3 nodes with their right IP adress ofc.
The 2nd step was only done once on the first node.

This is my ceph.conf

Bash:

root@pve01:~# cat /etc/pve/ceph.conf
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.0.0.81/29
         fsid = bfe7330e-54a8-47e7-9fff-9a0fe263d43b
         mon_allow_pool_delete = true
         mon_host = 10.0.0.81 10.0.0.82 10.0.0.83
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 10.0.0.81/29

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve01]
         public_addr = 10.0.0.81

[mon.pve02]
         public_addr = 10.0.0.82

[mon.pve03]
         public_addr = 10.0.0.83

interfaces:

Bash:

root@pve01:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto lo:0
iface lo:0 inet static
        address 10.0.0.81/32

auto en05
iface en05 inet static
        mtu 4000

auto en06
iface en06 inet static
        mtu 4000

iface enp86s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.21.1/29
        gateway 192.168.21.6
        bridge-ports enp86s0
        bridge-stp off
        bridge-fd 0

source /etc/network/interfaces.d/*

jsterr · Jan 17, 2024

Hello @nicedevil

im sorry im not familiar with thunderbold technology.

Code:

auto lo:0
iface lo:0 inet static
        address 10.0.0.81/32

Why did you put a /32 there, you need to use the correct cidr mask, in your case /29. Are you sure this ip should stay on your local loopback device, is this thunderbold related?

nicedevil · Jan 17, 2024

jsterr said:
Hello @nicedevil

im sorry im not familiar with thunderbold technology.

Code:

auto lo:0 iface lo:0 inet static address 10.0.0.81/32

Why did you put a /32 there, you need to use the correct cidr mask, in your case /29. Are you sure this ip should stay on your local loopback device, is this thunderbold related?

that is correct followed the guide from scyto (https://forum.proxmox.com/threads/i...etwork-ceph-cluster.131107/page-5#post-625262), will test with 29

jsterr · Jan 17, 2024

nicedevil said:
that is correct followed the guide from scyto (https://forum.proxmox.com/threads/i...etwork-ceph-cluster.131107/page-5#post-625262), will test with 29

OK. Did you follow: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_.28with_Fallback.29 as there are lots of things to consider when using frr.

nicedevil · Jan 17, 2024

jsterr said:
OK. Did you follow: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_.28with_Fallback.29 as there are lots of things to consider when using frr.

followed the guide from @scyto but will try to add the parts that are not mentioned by him to my config now

just for later comparison, this is my actual frr.conf:

Bash:

root@pve03:~# cat /etc/frr/frr.conf
frr version 9.1
frr defaults traditional
hostname pve03
no ipv6 forwarding
service integrated-vtysh-config
!
interface en05
 ip router openfabric 1
exit
!
interface en06
 ip router openfabric 1
exit
!
interface lo
 ip router openfabric 1
 openfabric passive
exit
!
router openfabric 1
 net 49.0000.0000.0003.00
exit
!

nicedevil · Jan 17, 2024

Ok adjusted everything, same problem, now it is done like documentation of your proxmox link. Here for comparison the new frr file:

Code:

root@pve03:~# cat /etc/frr/frr.conf
# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in
# /var/log/frr/frr.log
#
# Note:
# FRR's configuration shell, vtysh, dynamically edits the live, in-memory
# configuration while FRR is running. When instructed, vtysh will persist the
# live configuration to this file, overwriting its contents. If you want to
# avoid this, you can edit this file manually before starting FRR, or instruct
# vtysh to write configuration to a different file.

frr defaults traditional
hostname pve03
log syslog informational
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 10.0.0.83/32
 ip router openfabric 1
 openfabric passive
!
interface en05
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
interface en06
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
 net 49.0000.0000.0003.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180

maybe I'm just an idiot but not all of those settings are applied to the "show running-config" command with vtysh. Yes restarted frr or whole server ofc.

Bash:

root@pve03:~# vtysh -c "show running-config"
Building configuration...

Current configuration:
!
frr version 9.1
frr defaults traditional
hostname pve03
log syslog informational
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 10.0.0.83/32
 ip router openfabric 1
 openfabric passive
exit
!
interface en05
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
exit
!
interface en06
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
exit
!
router openfabric 1
 net 49.0000.0000.0003.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180
exit
!
end

The first monitor creation looks like this:

Bash:

root@pve01:~# pveceph mon create
unable to get monitor info from DNS SRV with service name: ceph-mon
rados_connect failed - No such file or directory
creating new monitor keyring
creating /etc/pve/priv/ceph.mon.keyring
importing contents of /etc/pve/priv/ceph.client.admin.keyring into /etc/pve/priv/ceph.mon.keyring
monmaptool: monmap file /tmp/monmap
monmaptool: generated fsid def02e58-26ee-4b69-87c5-369e31ad953d
setting min_mon_release = pacific
epoch 0
fsid def02e58-26ee-4b69-87c5-369e31ad953d
last_changed 2024-01-17T22:47:35.008724+0100
created 2024-01-17T22:47:35.008724+0100
min_mon_release 16 (pacific)
election_strategy: 1
0: [v2:10.0.0.81:3300/0,v1:10.0.0.81:6789/0] mon.pve01
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
created the first monitor, assume it's safe to disable insecure global ID reclaim for new setup
Created symlink /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pve01.service -> /lib/systemd/system/ceph-mon@.service.
creating manager directory '/var/lib/ceph/mgr/ceph-pve01'
creating keys for 'mgr.pve01'
setting owner for directory
enabling service 'ceph-mgr@pve01.service'
Created symlink /etc/systemd/system/ceph-mgr.target.wants/ceph-mgr@pve01.service -> /lib/systemd/system/ceph-mgr@.service.
starting service 'ceph-mgr@pve01.service'

and here my new /etc/network/interfaces file content:

Bash:

root@pve03:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

#auto lo:0
#iface lo:0 inet static
#        address 10.0.0.83/29

auto en05
iface en05 inet static
        mtu 4000

auto en06
iface en06 inet static
        mtu 4000

iface enp86s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.21.3/29
        gateway 192.168.21.6
        bridge-ports enp86s0
        bridge-stp off
        bridge-fd 0


source /etc/network/interfaces.d/*

post-up /usr/bin/systemctl restart frr.service

Ceph WebGUI Overview shows 3 warnings (I guess last one can be ignored, nothing created right now):

EDIT:

Ok enabling NTP, was blocked by firewall. Will report back if there is still an issue

aaron · Jan 17, 2024

Does the network work as expected? Running a 3-node cluster via thunderbolt is not something we have seen a lot and haven't tried ourselve.

Verify that the large MTU works as expected between the nodes (4000 minus IP & ICMP overhead):

Code:

ping -M do -s 3972 {target host}

Running pveceph mon create on the other nodes would be interesting to see at which step it fails.

nicedevil · Jan 17, 2024

aaron said:
Does the network work as expected? Running a 3-node cluster via thunderbolt is not something we have seen a lot and haven't tried ourselve.

Verify that the large MTU works as expected between the nodes (4000 minus IP & ICMP overhead):

Code:

ping -M do -s 3972 {target host}

Running pveceph mon create on the other nodes would be interesting to see at which step it fails.

just edited my post above, right now timesync is not working

will report back.
I did tests like ping/ssh from one node to each other, working on all 3 nodes, but this was also the case with @scyto documentation

nicedevil · Jan 17, 2024

Problem solved

NTP was the issue the whole day.... grrrrr, all monitors up and running, all managers up and running

aaron · Jan 17, 2024

nicedevil said:
I did tests like ping/ssh from one node to each other,

These commands tend to not utilize the full MTU, unless you do a lot over SSH. The more deterministic test if the MTU works is to use ping with the size parameter.

nicedevil · Jan 17, 2024

aaron said:
These commands tend to not utilize the full MTU, unless you do a lot over SSH. The more deterministic test if the MTU works is to use ping with the size parameter.

your command succeeded as well, this was/is the output:

Bash:

root@pve01:~# ping -M do -s 3972 10.0.0.82
PING 10.0.0.82 (10.0.0.82) 3972(4000) bytes of data.
3980 bytes from 10.0.0.82: icmp_seq=1 ttl=64 time=0.527 ms
3980 bytes from 10.0.0.82: icmp_seq=2 ttl=64 time=0.483 ms
3980 bytes from 10.0.0.82: icmp_seq=3 ttl=64 time=0.546 ms
3980 bytes from 10.0.0.82: icmp_seq=4 ttl=64 time=0.694 ms
3980 bytes from 10.0.0.82: icmp_seq=5 ttl=64 time=0.829 ms
3980 bytes from 10.0.0.82: icmp_seq=6 ttl=64 time=0.684 ms
3980 bytes from 10.0.0.82: icmp_seq=7 ttl=64 time=0.765 ms

aaron · Jan 17, 2024

nicedevil said:
your command succeeded as well, this was/is the output:

Great. And with an MTU of 4000, increasing the size by just 1 should cause the ping to fail.

nicedevil said:
NTP was the issue the whole day.... grrrrr, all monitors up and running, all managers up and running

How far off was the time between nodes?

nicedevil · Jan 17, 2024

aaron said:
Great. And with an MTU of 4000, increasing the size by just 1 should cause the ping to fail.

How far off was the time between nodes?

So I guess you want to teach me I should use an other MTU? is also the value from the guide

The time did differ to my local time over 8hours, but how much the difference between all 3 nodes was, I didn't take a look at.

aaron · Jan 17, 2024

nicedevil said:
So I guess you want to teach me I should use an other MTU? is also the value from the guide

Not really, how large the MTU can be, depends on the NICs used. Pinging with a size that should (after the IP + ICMP overhead) match the configured one is the test to verify it. Increasing it by one byte is the sanity check, as it should not work anymore.

How big the MTU can be on these thunderbold NICs is something you could figure out by configuring a larger MTU on them and then ping with the size of the ping packets increased accordingly.

The big issue you want to avoid is MTU mismatches across the network stack.

Those situations are difficult to troubleshoot because small packages will usually have no problem (regular ping for example), but larger packets won't work anymore. The resulting overall picture will be a rather weird one, and the only way to know is to check that the MTU is working.
In your setup it is rather simple, but as soon as switches are in between, it is possible that they don't support the large MTU. I did have support tickets where the switch config with the large MTU was set, but not stored in the boot config. Once the switch did a reboot, the MTU was set back to the default 1500 and Ceph was in a half broken state.

jsterr · Jan 17, 2024

nicedevil said:
Problem solved

NTP was the issue the whole day.... grrrrr, all monitors up and running, all managers up and running

Hehe told you in my first reply Im glad you got IT working!

Search

Search

Ceph Cluster, can't create 2nd/3rd Monitor/Manager

nicedevil

Member

jsterr

Renowned Member

nicedevil

Member

jsterr

Renowned Member

nicedevil

Member

jsterr

Renowned Member

nicedevil

Member

nicedevil

Member

aaron

Proxmox Staff Member

nicedevil

Member

nicedevil

Member

aaron

Proxmox Staff Member

nicedevil

Member

aaron

Proxmox Staff Member

nicedevil

Member

aaron

Proxmox Staff Member

jsterr

Renowned Member

We value your privacy