Ceph Monitor showing "?" in Gui

vivekdelhi · Oct 23, 2019

I am running en experimental 3 node proxmox 6 VE - Ceph hyperconverged cluster (blade01, blade02, blade10) . I had an issue with ceph versions that got corrected. However now I am seeing an issue with Monitors on the Blade02.
GUI screenshot attached. I see a "?" and on hovering i get "Address Unknown / Stopped"

If I go to the Monitor screen, I see only only one monitor in Blade 02. Start,
Stop, and Restart Actions give a "Done " popup.

Why the discrepancy ??
I am unable to create a new Monitor on Blade02 as well.

Appreciate the help.
Thanks
Vivek

The syslog for Blade 02 shows

Code:

Oct 23 09:44:41  systemd[1]: Started Ceph cluster monitor daemon.
Oct 23 09:44:41  ceph-mon[39041]: 2019-10-23 09:44:41.764 7f36adf6a440 -1 rocksdb: IO error: while open a file for lock: /var/lib/ceph/mon/ceph-dell0104blade02/store.db/LOCK: Permission denied
Oct 23 09:44:41  ceph-mon[39041]: 2019-10-23 09:44:41.764 7f36adf6a440 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-dell0104blade02': (22) Invalid argument
Oct 23 09:44:41  systemd[1]: ceph-mon@dell0104blade02.service: Main process exited, code=exited, status=1/FAILURE
Oct 23 09:44:41  systemd[1]: ceph-mon@dell0104blade02.service: Failed with result 'exit-code'.
Oct 23 09:44:45  systemd[1]: Stopped Ceph cluster monitor daemon.
Oct 23 09:44:45  systemd[1]: Started Ceph cluster monitor daemon.

Oct 23 09:44:45  ceph-mon[39131]: 2019-10-23 09:44:45.244 7fd5fd4ad440 -1 rocksdb: IO error: while open a file for lock: /var/lib/ceph/mon/ceph-dell0104blade02/store.db/LOCK: Permission denied
Oct 23 09:44:45 dell0104blade10 ceph-mon[39131]: 2019-10-23 09:44:45.244 7fd5fd4ad440 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-dell0104blade02': (22) Invalid argument
Oct 23 09:44:45  systemd[1]: ceph-mon@dell0104blade02.service: Main process exited, code=exited, status=1/FAILURE
Oct 23 09:44:45  systemd[1]: ceph-mon@dell0104blade02.service: Failed with result 'exit-code'.
Oct 23 09:44:55  systemd[1]: ceph-mon@dell0104blade02.service: Service RestartSec=10s expired, sched
Why the discrepancy ??
I am unable to create a new Monitor on Blade02 as well.
uling restart.
Oct 23 09:44:55  systemd[1]: ceph-mon@dell0104blade02.service: Scheduled restart job, restart counter is at 1.
Oct 23 09:44:55  systemd[1]: Stopped Ceph cluster monitor daemon.
Oct 23 09:44:55  systemd[1]: Started Ceph cluster monitor daemon.

If I go to the directory /var/lib/ceph/mon on the Blade02, it is in fact empty. The mon directory is owned by ceph/ceph user/group. (rwx r-x --- permissions)

ceph -s shows only TWO monitors on Blade01 and Blade10.

Code:

root@dell0104blade02:~# ceph -s
  cluster:
    id:     09fc106c-d4cf-4edc-867f-db170301f857
    health: HEALTH_OK
 
  services:
    mon: 2 daemons, quorum dell0104blade01,dell0104blade10 (age 2w)
    mgr: dell0104blade01(active, since 2w), standbys: dell0104blade10, dell0104blade02
    osd: 3 osds: 3 up (since 2w), 3 in (since 2w)
 
  data:
Why the discrepancy ??
I am unable to create a new Monitor on Blade02 as well.
    pools:   1 pools, 128 pgs
    objects: 13.33k objects, 51 GiB
    usage:   121 GiB used, 995 GiB / 1.1 TiB avail
    pgs:     128 active+clean
 
  io:
    client:   1023 B/s wr, 0 op/s rd, 0 op/s wr

The Ceph Global Configuration in the GUI also shows Two Mons

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.15.31/24
     fsid = 09fc106c-d4cf-4edc-867f-db170301f857
     mon_allow_pool_delete = true
     mon_host = 192.168.15.31 192.168.15.204
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.15.31/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

Alwin · Oct 23, 2019

Check if the systemd services for those MONs still exist. If so, remove them.

vivekdelhi · Oct 23, 2019

Alwin said:
Check if the systemd services for those MONs still exist. If so, remove them.

Thanks Alwin !
Indeed I do see a systemd service

Code:

root@dell0104blade02:/etc/systemd/system/ceph-mon.target.wants# ls -al
total 10
drwxr-xr-x  2 root root  3 Sep 20 16:19 .
drwxr-xr-x 14 root root 19 Sep 23 17:04 ..
lrwxrwxrwx  1 root root 37 Sep 20 16:19 ceph-mon@dell0104blade02.service -> /lib/systemd/system/ceph-mon@.service

Should I just delete the symlink ?
Much appreciated
Vivek

Alwin · Oct 23, 2019

vivekdelhi said:
Should I just delete the symlink ?

Yes.

vivekdelhi · Nov 6, 2019

Alwin said:
Yes.

Thank You!

After following your advice, one Ceph-mon is gone.
i still have one pesky mon left. On hovering it shows up as present in dell0104blade10 while the name of the problematic mon is dell0104blade02 (Screenshot attached)

Code:

root@dell0104blade10:~# ceph -s
  cluster:
    id:     09fc106c-d4cf-4edc-867f-db170301f857
    health: HEALTH_OK
 
  services:
    mon: 2 daemons, quorum dell0104blade01,dell0104blade10 (age 4w)
    mgr: dell0104blade01(active, since 4w), standbys: dell0104blade10, dell0104blade02
    osd: 3 osds: 3 up (since 4w), 3 in (since 4w)
 
  data:
    pools:   1 pools, 128 pgs
    objects: 14.99k objects, 57 GiB
    usage:   143 GiB used, 973 GiB / 1.1 TiB avail
    pgs:     128 active+clean

There is no systemd serice on dell0104blade10 that can be deleted.

Code:

root@dell0104blade10:/etc/systemd/system/ceph-mon.target.wants# ls
ceph-mon@dell0104blade10.service

The Syslog is not showing any ceph related errors either.

Any suggestions please ?

Thanks
VIvek

Alwin · Nov 7, 2019

Restart the pvestatd daemons on the cluster, the entry should then go away. On the other hand it doesn't do anything it is just irritating.

vivekdelhi · Apr 11, 2020

Hi Alwin
I restarted the pvestatd service but the monitor still shows up as "?" Screenshot attached
I had to restart the cluster and had expected that this would disappear but it is still persisting
Any suggestions please.
Thanks Again
Vivek

Alwin · Apr 12, 2020

If that MON doesn't exist, then go into /var/lib/ceph/mon and delete the directory. The entry should then disappear on the GUI.

vivekdelhi · Apr 13, 2020

Thanks Alwin,
Indeed this made the GUI entry disappear.
Much appreciated
Vivek

Search

Search

Ceph Monitor showing "?" in Gui

vivekdelhi

New Member

Alwin

Proxmox Retired Staff

vivekdelhi

New Member

Alwin

Proxmox Retired Staff

vivekdelhi

New Member

Attachments

Alwin

Proxmox Retired Staff

vivekdelhi

New Member

Alwin

Proxmox Retired Staff

vivekdelhi

New Member