Ceph monitor - Status 'unknown' and refuses to start

fredu

New Member
Dec 2, 2023
12
0
1
Hi
New here! Have just completed the installation of 3 Proxmox instances, created a cluster, and installed Ceph on all three.
The same process for all three.
Network-wise, all is good, all three nodes seem perfectly operational.

For some reason, on only one of my nodes, I cannot get a monitor to start. It's continually at status Unknown, and when I try anything, it looks like the system doesn't even know about it.
Sorry, I'm not very skilled in this area.
There is nothing in the logs that's of any use.

Looking around, I suspect it's a communications thing with sockets. But why... they were all configured the same way.

Appreciate any help you can give me!
 
There is nothing in the logs that's of any use.
I can not imagine that. What does journalctl or the status of the service itself say? What is in the CEPH logs under /var/log/ceph? What is in the syslog?

Please post your CEPH Config here and a Screenshot from the Dashboard.
 
Thanks sb-jw. Sorry, I shouldn't have been dismissive of the logs (tbh didn't know where to look), although the Gui logs don't tell me an awful lot.

The service monitor status:
root@proxmoxfirebat2:~systemctl status ceph-mon@proxmoxfirebat2.service ● ceph-mon@proxmoxfirebat2.service - Ceph cluster monitor daemon Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d └─ceph-after-pve-cluster.conf Active: active (running) since Sat 2023-12-02 17:04:52 GMT; 16h ago Main PID: 836 (ceph-mon) Tasks: 25 Memory: 65.6M CPU: 1min 55.809s CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@proxmoxfirebat2.service └─836 /usr/bin/ceph-mon -f --cluster ceph --id proxmoxfirebat2 --setuser ceph --setgroup ceph Dec 02 17:04:52 proxmoxfirebat2 systemd[1]: Started ceph-mon@proxmoxfirebat2.service - Ceph cluster monitor daemon.

CEPH logs under /var/log/ceph
SysLog - Sorry, not sure where this is stored (unless this is the ceph.log above)
  • edit - just noticed in the gui system tab, it tells me syslog (syslog.service) not installed. Is this something I should install? Its like that for all nodes
  • Also, oddly, the same for systemd-timesyncd (systemd-timesyncd.service)
Please let me know if there are any further logs you need, or information required.
Appreciate the help :)
 
Last edited:
I don't seem to have a syslog directory at /var/log/syslog Must be related to the syslog.service not being installed?

ceph -w :
Code:
cluster:
    id:     004ebb4c-45f9-46d5-9ba9-b284d2a2b6aa
    health: HEALTH_OK
 
  services:
    mon: 2 daemons, quorum proxmoxfirebat1,proxmoxchuwi (age 17h)
    mgr: proxmoxfirebat1(active, since 18h), standbys: proxmoxfirebat2, proxmoxchuwi
    osd: 3 osds: 3 up (since 16h), 3 in (since 18h)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 2 objects, 705 KiB
    usage:   103 MiB used, 2.8 TiB / 2.8 TiB avail
    pgs:     33 active+clean

CEPH Config (Copied from the GUI Ceph.Configuration tab):
Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.0.4/24
     fsid = 004ebb4c-45f9-46d5-9ba9-b284d2a2b6aa
     mon_allow_pool_delete = true
     mon_host = 192.168.0.4 192.168.0.5
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.0.4/24


[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring


[mon.proxmoxchuwi]
     public_addr = 192.168.0.5


[mon.proxmoxfirebat1]
     public_addr = 192.168.0.4

Crush Map
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54


# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd


# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root


# buckets
host proxmoxchuwi {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 0.90970
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.90970
}
host proxmoxfirebat1 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 0.93149
    alg straw2
    hash 0    # rjenkins1
    item osd.1 weight 0.93149
}
host proxmoxfirebat2 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 0.93149
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.93149
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 2.77267
    alg straw2
    hash 0    # rjenkins1
    item proxmoxchuwi weight 0.90970
    item proxmoxfirebat1 weight 0.93149
    item proxmoxfirebat2 weight 0.93149
}


# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}


# end crush map
Server View
Logs
 

Attachments

  • ceph dashboard.jpg
    ceph dashboard.jpg
    136.7 KB · Views: 7
  • ceph Config.jpg
    ceph Config.jpg
    210.3 KB · Views: 5
If you click on Monitor on the left in the GUI, what does it look like?
 
Then simply remove the incorrect Mon via "destroy" and add it again.
 
What a journey! Its all sorted now, MANY thanks for the help.
In summary (for others) this is what I did:
  1. Stopped all mon services on allnodes ( systemctl stop ceph-mon.target )
    1. I did this from the GUI and CLI just to be sure
    2. Removing the monitor (ceph mon remove {mon-id} ) wasn't useful as the system thought it didn't exist
  2. Check (On all Targets) ceph.conf is accurate and contains no reference to the faulty monitor : /etc/ceph/ceph.conf
  3. Delete the directory of the faulty monitor (on the faulty monitor target) here: /var/lib/ceph/mon
    1. I didn't need to restart, it figured it out itself.
  4. In the GUI, recreate the new monitor on the existing faulty node.

et voila. Thanks!!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!