HomeLab Ceph Testing and Timeout (500)

frankyyy · 2025-11-17T22:28:42+0100

I have reconfigured a small homelab from ZFS to Ceph just for interest, running a small bunch of services.
All has been running pretty smoothly over the last few days.
I have 3 nodes, each with a Samsung 980 Pro 2TB (50GB root etc, balance of 1.8TB used for Ceph). I know, not ideal, but ok for this purpose.
Running Thunderbolt networking between each node, two TB3 ports in each, each connected to the next.

management: 192.168.20.0/24
tb0/tb1 on each node: 192.168.245.x/30
I have setup OpenFabric via the Proxmox GUI on 192.168.248.0/24. Each is able to communicate with the next. Overall, seeing ~25Gb between nodes with iperf3
Config based on this: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Routed_Setup_(with_Fallback)

I have one issue I haven't been able to resolve - when I click either Host > Ceph > OSD > Manage Global Flags or Datacenter > Ceph I see the spinning wheel with a got timeout (500) error.

I have disabled firewall on the datacenter and nodes to no avail. For some reason, this feels OpenFabric related, but I am just guessing.

Any insight is appreciated.

ceph -s

Code:

  cluster:
    id:     dd1020e0-2316-436a-abd8-45ffe33aa28c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum proxmox-nuc01,proxmox-nuc02,proxmox-nuc03 (age 15m)
    mgr: proxmox-nuc01(active, since 24h), standbys: proxmox-nuc02, proxmox-nuc03
    osd: 3 osds: 3 up (since 10m), 3 in (since 25h)

  data:
    pools:   2 pools, 33 pgs
    objects: 2.59M objects, 666 GiB
    usage:   1.6 TiB used, 3.7 TiB / 5.3 TiB avail
    pgs:     33 active+clean

  io:
    client:   2.0 KiB/s rd, 2.3 MiB/s wr, 0 op/s rd, 315 op/s wr

ip route

Code:

default via 192.168.20.1 dev vmbr0 proto kernel onlink
192.168.20.0/24 dev vmbr0 proto kernel scope link src 192.168.20.12
192.168.245.0/30 dev tb0 proto kernel scope link src 192.168.245.1
192.168.245.8/30 dev tb1 proto kernel scope link src 192.168.245.10
192.168.248.13 nhid 26 via 192.168.245.2 dev tb0 proto openfabric src 192.168.248.12 metric 20 onlink
192.168.248.14 nhid 27 via 192.168.245.9 dev tb1 proto openfabric src 192.168.248.12 metric 20 onlink
192.168.250.0/24 dev vmbr0.250 proto kernel scope link src 192.168.250.12

/etc/frr/frr.conf from one host (others the same)

Code:

frr version 10.3.1
frr defaults datacenter
hostname proxmox-nuc01
log syslog informational
service integrated-vtysh-config
!
router openfabric tb99
 net 49.0001.1921.6824.8012.00
exit
!
interface dummy_tb99
 ip router openfabric tb99
 openfabric passive
exit
!
interface tb0
 ip router openfabric tb99
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
interface tb1
 ip router openfabric tb99
 openfabric hello-interval 1
 openfabric csnp-interval 2
exit
!
access-list pve_openfabric_tb99_ips permit 192.168.248.0/24
!
route-map pve_openfabric permit 100
 match ip address pve_openfabric_tb99_ips
 set src 192.168.248.12
exit
!
ip protocol openfabric route-map pve_openfabric
!
!
line vty
!

ceph config

Code:

[global]
    auth_client_required = cephx
    auth_cluster_required = cephx
    auth_service_required = cephx
    cluster_network = 192.168.248.0/24
    fsid = dd1020e0-2316-436a-abd8-45ffe33aa28c
    mon_allow_pool_delete = true
    mon_host = 192.168.248.12 192.168.248.13 192.168.248.14
    ms_bind_ipv4 = true
    ms_bind_ipv6 = false
    osd_pool_default_min_size = 2
    osd_pool_default_size = 2
    public_network = 192.168.248.0/24

EDIT:
Interestingly.... just reading the output above, I notice the osd_pool_default_min_size and osd_pool_default_size both at 2. The pool was switched to 3/2 via the GUI. It started at 2/2 while I was migrating data from the last host into the pool, before moving the 3rd host over to ceph. I'll look at that separately.

Johannes S · 2025-11-18T02:44:18+0100

How is the bandwith of your thunderbolts? Fo you have other network Adapters for corosync and guest traffic?

frankyyy · 2025-11-18T02:52:31+0100

TB4 speed is pretty good and appears to be consistent.

Code:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  25.8 GBytes  22.2 Gbits/sec  858            sender
[  5]   0.00-10.00  sec  25.8 GBytes  22.2 Gbits/sec                  receiver

VM traffic is over a 1GB link, VLANs etc.
Corosync traffic is over the same physical interface, although via a separate network. I don't appear to have any issues with the cluster itself as it's been running in the same setup (with exception of TB and Ceph) for over a year.

frankyyy · 2025-11-18T03:05:51+0100

It's probably worth mentioning... (just discovered it)
If I am using the GUI and click through Ceph > Pools > Create I receive the same got timeout (500) error.
If I create a pool via the CLI with ceph osd pool create..., it is created without issue and visible in the GUI immediately.

Search

Search

HomeLab Ceph Testing and Timeout (500)

frankyyy

New Member

Johannes S

Distinguished Member

frankyyy

New Member

frankyyy

New Member

We value your privacy