Ceph Cluster, can't create 2nd/3rd Monitor/Manager

nicedevil

Member
Aug 5, 2021
110
11
23
Hey guys,

I have setup my new thunderbolt 4 cluster that can connect to each other over the Thunderbolt 4 cables. (SSH to each of the others from each other tested f.e., also iperf ping etc.).

Unfortunately I'm trying hard to get my Ceph up and running, I always run into timeouts on creating monitors or managers. Is there something I can do to get rid of the timeout errors? This one f.e. was for creating the 3rd monitor.

1704922113830.png
 
Please post your /etc/pve/ceph.conf and your /etc/network/interfaces file and check that:

 
Last edited:
@jsterr
Ok I setup everything from scratch and have my working thunderbolt network up and running. Unfortunately it leads all to the same error.
I installed it with the following commands:

pve01 = 10.0.0.81
pve02 = 10.0.0.82
pve03 = 10.0.0.83

Bash:
pveceph install --repository no-subscription --version reef
pveceph init --network 10.0.0.81/29
pveceph mon create --mon-address 10.0.0.81

The first and last step was done on all 3 nodes with their right IP adress ofc.
The 2nd step was only done once on the first node.

This is my ceph.conf
Bash:
root@pve01:~# cat /etc/pve/ceph.conf
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 10.0.0.81/29
         fsid = bfe7330e-54a8-47e7-9fff-9a0fe263d43b
         mon_allow_pool_delete = true
         mon_host = 10.0.0.81 10.0.0.82 10.0.0.83
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 2
         osd_pool_default_size = 3
         public_network = 10.0.0.81/29

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve01]
         public_addr = 10.0.0.81

[mon.pve02]
         public_addr = 10.0.0.82

[mon.pve03]
         public_addr = 10.0.0.83

interfaces:
Bash:
root@pve01:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto lo:0
iface lo:0 inet static
        address 10.0.0.81/32

auto en05
iface en05 inet static
        mtu 4000

auto en06
iface en06 inet static
        mtu 4000

iface enp86s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.21.1/29
        gateway 192.168.21.6
        bridge-ports enp86s0
        bridge-stp off
        bridge-fd 0

source /etc/network/interfaces.d/*
 
Hello @nicedevil

im sorry im not familiar with thunderbold technology.

Code:
auto lo:0
iface lo:0 inet static
        address 10.0.0.81/32

Why did you put a /32 there, you need to use the correct cidr mask, in your case /29. Are you sure this ip should stay on your local loopback device, is this thunderbold related?
 
OK. Did you follow: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_.28with_Fallback.29 as there are lots of things to consider when using frr.
followed the guide from @scyto but will try to add the parts that are not mentioned by him to my config now :)

just for later comparison, this is my actual frr.conf:

Bash:
root@pve03:~# cat /etc/frr/frr.conf
frr version 9.1
frr defaults traditional
hostname pve03
no ipv6 forwarding
service integrated-vtysh-config
!
interface en05
 ip router openfabric 1
exit
!
interface en06
 ip router openfabric 1
exit
!
interface lo
 ip router openfabric 1
 openfabric passive
exit
!
router openfabric 1
 net 49.0000.0000.0003.00
exit
!
 
Ok adjusted everything, same problem, now it is done like documentation of your proxmox link. Here for comparison the new frr file:

Code:
root@pve03:~# cat /etc/frr/frr.conf
# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in
# /var/log/frr/frr.log
#
# Note:
# FRR's configuration shell, vtysh, dynamically edits the live, in-memory
# configuration while FRR is running. When instructed, vtysh will persist the
# live configuration to this file, overwriting its contents. If you want to
# avoid this, you can edit this file manually before starting FRR, or instruct
# vtysh to write configuration to a different file.

frr defaults traditional
hostname pve03
log syslog informational
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 10.0.0.83/32
 ip router openfabric 1
 openfabric passive
!
interface en05
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
interface en06
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
 net 49.0000.0000.0003.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180

maybe I'm just an idiot but not all of those settings are applied to the "show running-config" command with vtysh. Yes restarted frr or whole server ofc.

Bash:
root@pve03:~# vtysh -c "show running-config"
Building configuration...

Current configuration:
!
frr version 9.1
frr defaults traditional
hostname pve03
log syslog informational
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 10.0.0.83/32
 ip router openfabric 1
 openfabric passive
exit
!
interface en05
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
exit
!
interface en06
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
exit
!
router openfabric 1
 net 49.0000.0000.0003.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180
exit
!
end

The first monitor creation looks like this:

Bash:
root@pve01:~# pveceph mon create
unable to get monitor info from DNS SRV with service name: ceph-mon
rados_connect failed - No such file or directory
creating new monitor keyring
creating /etc/pve/priv/ceph.mon.keyring
importing contents of /etc/pve/priv/ceph.client.admin.keyring into /etc/pve/priv/ceph.mon.keyring
monmaptool: monmap file /tmp/monmap
monmaptool: generated fsid def02e58-26ee-4b69-87c5-369e31ad953d
setting min_mon_release = pacific
epoch 0
fsid def02e58-26ee-4b69-87c5-369e31ad953d
last_changed 2024-01-17T22:47:35.008724+0100
created 2024-01-17T22:47:35.008724+0100
min_mon_release 16 (pacific)
election_strategy: 1
0: [v2:10.0.0.81:3300/0,v1:10.0.0.81:6789/0] mon.pve01
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
created the first monitor, assume it's safe to disable insecure global ID reclaim for new setup
Created symlink /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pve01.service -> /lib/systemd/system/ceph-mon@.service.
creating manager directory '/var/lib/ceph/mgr/ceph-pve01'
creating keys for 'mgr.pve01'
setting owner for directory
enabling service 'ceph-mgr@pve01.service'
Created symlink /etc/systemd/system/ceph-mgr.target.wants/ceph-mgr@pve01.service -> /lib/systemd/system/ceph-mgr@.service.
starting service 'ceph-mgr@pve01.service'

and here my new /etc/network/interfaces file content:

Bash:
root@pve03:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

#auto lo:0
#iface lo:0 inet static
#        address 10.0.0.83/29

auto en05
iface en05 inet static
        mtu 4000

auto en06
iface en06 inet static
        mtu 4000

iface enp86s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.21.3/29
        gateway 192.168.21.6
        bridge-ports enp86s0
        bridge-stp off
        bridge-fd 0


source /etc/network/interfaces.d/*

post-up /usr/bin/systemctl restart frr.service

Ceph WebGUI Overview shows 3 warnings (I guess last one can be ignored, nothing created right now):

1705499388384.png

EDIT:

Ok enabling NTP, was blocked by firewall. Will report back if there is still an issue :)
 
Last edited:
Does the network work as expected? Running a 3-node cluster via thunderbolt is not something we have seen a lot and haven't tried ourselve.

Verify that the large MTU works as expected between the nodes (4000 minus IP & ICMP overhead):
Code:
ping -M do -s 3972 {target host}

Running pveceph mon create on the other nodes would be interesting to see at which step it fails.
 
  • Like
Reactions: nicedevil
Does the network work as expected? Running a 3-node cluster via thunderbolt is not something we have seen a lot and haven't tried ourselve.

Verify that the large MTU works as expected between the nodes (4000 minus IP & ICMP overhead):
Code:
ping -M do -s 3972 {target host}

Running pveceph mon create on the other nodes would be interesting to see at which step it fails.
just edited my post above, right now timesync is not working :) will report back.
I did tests like ping/ssh from one node to each other, working on all 3 nodes, but this was also the case with @scyto documentation
 
Problem solved :)

NTP was the issue the whole day.... grrrrr, all monitors up and running, all managers up and running
 
  • Like
Reactions: aaron
I did tests like ping/ssh from one node to each other,
These commands tend to not utilize the full MTU, unless you do a lot over SSH. The more deterministic test if the MTU works is to use ping with the size parameter.
 
These commands tend to not utilize the full MTU, unless you do a lot over SSH. The more deterministic test if the MTU works is to use ping with the size parameter.
your command succeeded as well, this was/is the output:

Bash:
root@pve01:~# ping -M do -s 3972 10.0.0.82
PING 10.0.0.82 (10.0.0.82) 3972(4000) bytes of data.
3980 bytes from 10.0.0.82: icmp_seq=1 ttl=64 time=0.527 ms
3980 bytes from 10.0.0.82: icmp_seq=2 ttl=64 time=0.483 ms
3980 bytes from 10.0.0.82: icmp_seq=3 ttl=64 time=0.546 ms
3980 bytes from 10.0.0.82: icmp_seq=4 ttl=64 time=0.694 ms
3980 bytes from 10.0.0.82: icmp_seq=5 ttl=64 time=0.829 ms
3980 bytes from 10.0.0.82: icmp_seq=6 ttl=64 time=0.684 ms
3980 bytes from 10.0.0.82: icmp_seq=7 ttl=64 time=0.765 ms
 
your command succeeded as well, this was/is the output:
Great. And with an MTU of 4000, increasing the size by just 1 should cause the ping to fail.

NTP was the issue the whole day.... grrrrr, all monitors up and running, all managers up and running
How far off was the time between nodes?
 
Great. And with an MTU of 4000, increasing the size by just 1 should cause the ping to fail.


How far off was the time between nodes?
So I guess you want to teach me I should use an other MTU? is also the value from the guide

The time did differ to my local time over 8hours, but how much the difference between all 3 nodes was, I didn't take a look at.
 
So I guess you want to teach me I should use an other MTU? is also the value from the guide
Not really, how large the MTU can be, depends on the NICs used. Pinging with a size that should (after the IP + ICMP overhead) match the configured one is the test to verify it. Increasing it by one byte is the sanity check, as it should not work anymore.

How big the MTU can be on these thunderbold NICs is something you could figure out by configuring a larger MTU on them and then ping with the size of the ping packets increased accordingly.

The big issue you want to avoid is MTU mismatches across the network stack.

Those situations are difficult to troubleshoot because small packages will usually have no problem (regular ping for example), but larger packets won't work anymore. The resulting overall picture will be a rather weird one, and the only way to know is to check that the MTU is working.
In your setup it is rather simple, but as soon as switches are in between, it is possible that they don't support the large MTU. I did have support tickets where the switch config with the large MTU was set, but not stored in the boot config. Once the switch did a reboot, the MTU was set back to the default 1500 and Ceph was in a half broken state.
 
Problem solved :)

NTP was the issue the whole day.... grrrrr, all monitors up and running, all managers up and running
Hehe told you in my first reply Im glad you got IT working! ✅
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!