Ceph monitor - Status 'unknown' and refuses to start

fredu · Dec 2, 2023

Hi
New here! Have just completed the installation of 3 Proxmox instances, created a cluster, and installed Ceph on all three.
The same process for all three.
Network-wise, all is good, all three nodes seem perfectly operational.

For some reason, on only one of my nodes, I cannot get a monitor to start. It's continually at status Unknown, and when I try anything, it looks like the system doesn't even know about it.
Sorry, I'm not very skilled in this area.
There is nothing in the logs that's of any use.

Looking around, I suspect it's a communications thing with sockets. But why... they were all configured the same way.

Appreciate any help you can give me!

sb-jw · Dec 2, 2023

fredu said:
There is nothing in the logs that's of any use.

I can not imagine that. What does journalctl or the status of the service itself say? What is in the CEPH logs under /var/log/ceph? What is in the syslog?

Please post your CEPH Config here and a Screenshot from the Dashboard.

fredu · Dec 3, 2023

Thanks sb-jw. Sorry, I shouldn't have been dismissive of the logs (tbh didn't know where to look), although the Gui logs don't tell me an awful lot.

The service monitor status:


root@proxmoxfirebat2:~systemctl status ceph-mon@proxmoxfirebat2.service
● ceph-mon@proxmoxfirebat2.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Sat 2023-12-02 17:04:52 GMT; 16h ago
   Main PID: 836 (ceph-mon)
      Tasks: 25
     Memory: 65.6M
        CPU: 1min 55.809s
     CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@proxmoxfirebat2.service
             └─836 /usr/bin/ceph-mon -f --cluster ceph --id proxmoxfirebat2 --setuser ceph --setgroup ceph

Dec 02 17:04:52 proxmoxfirebat2 systemd[1]: Started ceph-mon@proxmoxfirebat2.service - Ceph cluster monitor daemon.

CEPH logs under /var/log/ceph

ceph-mon.proxmoxfirebat2.log -> https://pastebin.com/4DFSRPC7
Ceph.log -> https://pastebin.com/8bFJ96pz
ceph-volume-systemd.log -> https://pastebin.com/0eieu6VS

SysLog - Sorry, not sure where this is stored (unless this is the ceph.log above)

edit - just noticed in the gui system tab, it tells me syslog (syslog.service) not installed. Is this something I should install? Its like that for all nodes
Also, oddly, the same for systemd-timesyncd (systemd-timesyncd.service)

Please let me know if there are any further logs you need, or information required.
Appreciate the help

sb-jw · Dec 3, 2023

fredu said:
SysLog

/var/log/syslog

What is the output of ceph -w

sb-jw said:
Please post your CEPH Config here and a Screenshot from the Dashboard.

And this please.

fredu · Dec 3, 2023

I don't seem to have a syslog directory at /var/log/syslog Must be related to the syslog.service not being installed?

ceph -w :

Code:

cluster:
    id:     004ebb4c-45f9-46d5-9ba9-b284d2a2b6aa
    health: HEALTH_OK
 
  services:
    mon: 2 daemons, quorum proxmoxfirebat1,proxmoxchuwi (age 17h)
    mgr: proxmoxfirebat1(active, since 18h), standbys: proxmoxfirebat2, proxmoxchuwi
    osd: 3 osds: 3 up (since 16h), 3 in (since 18h)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 2 objects, 705 KiB
    usage:   103 MiB used, 2.8 TiB / 2.8 TiB avail
    pgs:     33 active+clean

CEPH Config (Copied from the GUI Ceph.Configuration tab):

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.0.4/24
     fsid = 004ebb4c-45f9-46d5-9ba9-b284d2a2b6aa
     mon_allow_pool_delete = true
     mon_host = 192.168.0.4 192.168.0.5
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.0.4/24


[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring


[mon.proxmoxchuwi]
     public_addr = 192.168.0.5


[mon.proxmoxfirebat1]
     public_addr = 192.168.0.4

Crush Map

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54


# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd


# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root


# buckets
host proxmoxchuwi {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 0.90970
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.90970
}
host proxmoxfirebat1 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 0.93149
    alg straw2
    hash 0    # rjenkins1
    item osd.1 weight 0.93149
}
host proxmoxfirebat2 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 0.93149
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.93149
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 2.77267
    alg straw2
    hash 0    # rjenkins1
    item proxmoxchuwi weight 0.90970
    item proxmoxfirebat1 weight 0.93149
    item proxmoxfirebat2 weight 0.93149
}


# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}


# end crush map
Server View
Logs

sb-jw · Dec 3, 2023

If you click on Monitor on the left in the GUI, what does it look like?

fredu · Dec 3, 2023

Monitor: added JPG.

Just installed syslog.. its here: https://pastebin.com/PtKgasmx

sb-jw · Dec 3, 2023

Then simply remove the incorrect Mon via "destroy" and add it again.

fredu · Dec 3, 2023

sb-jw said:
Then simply remove the incorrect Mon via "destroy" and add it again.

I wish it was that easy..

Correspondingly, trying to recreate:

sb-jw · Dec 3, 2023

Okay, then take a look at this Thread: https://forum.proxmox.com/threads/ceph-how-to-delete-dead-monitor.61172/

fredu · Dec 3, 2023

What a journey! Its all sorted now, MANY thanks for the help.
In summary (for others) this is what I did:

Stopped all mon services on allnodes ( systemctl stop ceph-mon.target )
1. I did this from the GUI and CLI just to be sure
2. Removing the monitor (ceph mon remove {mon-id} ) wasn't useful as the system thought it didn't exist
Check (On all Targets) ceph.conf is accurate and contains no reference to the faulty monitor : /etc/ceph/ceph.conf
Delete the directory of the faulty monitor (on the faulty monitor target) here: /var/lib/ceph/mon
1. I didn't need to restart, it figured it out itself.
In the GUI, recreate the new monitor on the existing faulty node.

et voila. Thanks!!

Search

Search

Ceph monitor - Status 'unknown' and refuses to start

fredu

New Member

sb-jw

Famous Member

fredu

New Member

sb-jw

Famous Member

fredu

New Member

Attachments

sb-jw

Famous Member

fredu

New Member

Attachments

sb-jw

Famous Member

fredu

New Member

sb-jw

Famous Member

fredu

New Member