Ceph Monitor showing "?" in Gui

vivekdelhi

New Member
Sep 13, 2019
13
0
1
46
I am running en experimental 3 node proxmox 6 VE - Ceph hyperconverged cluster (blade01, blade02, blade10) . I had an issue with ceph versions that got corrected. However now I am seeing an issue with Monitors on the Blade02.
GUI screenshot attached. I see a "?" and on hovering i get "Address Unknown / Stopped"
Screenshot from 2019-10-23 09-41-52.png

If I go to the Monitor screen, I see only only one monitor in Blade 02. Start,
Stop, and Restart Actions give a "Done " popup.

Why the discrepancy ??
I am unable to create a new Monitor on Blade02 as well.

Appreciate the help.
Thanks
Vivek


The syslog for Blade 02 shows
Code:
Oct 23 09:44:41  systemd[1]: Started Ceph cluster monitor daemon.
Oct 23 09:44:41  ceph-mon[39041]: 2019-10-23 09:44:41.764 7f36adf6a440 -1 rocksdb: IO error: while open a file for lock: /var/lib/ceph/mon/ceph-dell0104blade02/store.db/LOCK: Permission denied
Oct 23 09:44:41  ceph-mon[39041]: 2019-10-23 09:44:41.764 7f36adf6a440 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-dell0104blade02': (22) Invalid argument
Oct 23 09:44:41  systemd[1]: ceph-mon@dell0104blade02.service: Main process exited, code=exited, status=1/FAILURE
Oct 23 09:44:41  systemd[1]: ceph-mon@dell0104blade02.service: Failed with result 'exit-code'.
Oct 23 09:44:45  systemd[1]: Stopped Ceph cluster monitor daemon.
Oct 23 09:44:45  systemd[1]: Started Ceph cluster monitor daemon.

Oct 23 09:44:45  ceph-mon[39131]: 2019-10-23 09:44:45.244 7fd5fd4ad440 -1 rocksdb: IO error: while open a file for lock: /var/lib/ceph/mon/ceph-dell0104blade02/store.db/LOCK: Permission denied
Oct 23 09:44:45 dell0104blade10 ceph-mon[39131]: 2019-10-23 09:44:45.244 7fd5fd4ad440 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-dell0104blade02': (22) Invalid argument
Oct 23 09:44:45  systemd[1]: ceph-mon@dell0104blade02.service: Main process exited, code=exited, status=1/FAILURE
Oct 23 09:44:45  systemd[1]: ceph-mon@dell0104blade02.service: Failed with result 'exit-code'.
Oct 23 09:44:55  systemd[1]: ceph-mon@dell0104blade02.service: Service RestartSec=10s expired, sched
Why the discrepancy ??
I am unable to create a new Monitor on Blade02 as well.
uling restart.
Oct 23 09:44:55  systemd[1]: ceph-mon@dell0104blade02.service: Scheduled restart job, restart counter is at 1.
Oct 23 09:44:55  systemd[1]: Stopped Ceph cluster monitor daemon.
Oct 23 09:44:55  systemd[1]: Started Ceph cluster monitor daemon.

If I go to the directory /var/lib/ceph/mon on the Blade02, it is in fact empty. The mon directory is owned by ceph/ceph user/group. (rwx r-x --- permissions)


ceph -s shows only TWO monitors on Blade01 and Blade10.

Code:
root@dell0104blade02:~# ceph -s
  cluster:
    id:     09fc106c-d4cf-4edc-867f-db170301f857
    health: HEALTH_OK
 
  services:
    mon: 2 daemons, quorum dell0104blade01,dell0104blade10 (age 2w)
    mgr: dell0104blade01(active, since 2w), standbys: dell0104blade10, dell0104blade02
    osd: 3 osds: 3 up (since 2w), 3 in (since 2w)
 
  data:
Why the discrepancy ??
I am unable to create a new Monitor on Blade02 as well.
    pools:   1 pools, 128 pgs
    objects: 13.33k objects, 51 GiB
    usage:   121 GiB used, 995 GiB / 1.1 TiB avail
    pgs:     128 active+clean
 
  io:
    client:   1023 B/s wr, 0 op/s rd, 0 op/s wr

The Ceph Global Configuration in the GUI also shows Two Mons

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.15.31/24
     fsid = 09fc106c-d4cf-4edc-867f-db170301f857
     mon_allow_pool_delete = true
     mon_host = 192.168.15.31 192.168.15.204
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.15.31/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring
 
Check if the systemd services for those MONs still exist. If so, remove them.
 
  • Like
Reactions: vivekdelhi
Check if the systemd services for those MONs still exist. If so, remove them.
Thanks Alwin !
Indeed I do see a systemd service
Code:
root@dell0104blade02:/etc/systemd/system/ceph-mon.target.wants# ls -al
total 10
drwxr-xr-x  2 root root  3 Sep 20 16:19 .
drwxr-xr-x 14 root root 19 Sep 23 17:04 ..
lrwxrwxrwx  1 root root 37 Sep 20 16:19 ceph-mon@dell0104blade02.service -> /lib/systemd/system/ceph-mon@.service

Should I just delete the symlink ?
Much appreciated
Vivek
 
Thank You!

After following your advice, one Ceph-mon is gone.
i still have one pesky mon left. On hovering it shows up as present in dell0104blade10 while the name of the problematic mon is dell0104blade02 (Screenshot attached)

Code:
root@dell0104blade10:~# ceph -s
  cluster:
    id:     09fc106c-d4cf-4edc-867f-db170301f857
    health: HEALTH_OK
 
  services:
    mon: 2 daemons, quorum dell0104blade01,dell0104blade10 (age 4w)
    mgr: dell0104blade01(active, since 4w), standbys: dell0104blade10, dell0104blade02
    osd: 3 osds: 3 up (since 4w), 3 in (since 4w)
 
  data:
    pools:   1 pools, 128 pgs
    objects: 14.99k objects, 57 GiB
    usage:   143 GiB used, 973 GiB / 1.1 TiB avail
    pgs:     128 active+clean

There is no systemd serice on dell0104blade10 that can be deleted.

Code:
root@dell0104blade10:/etc/systemd/system/ceph-mon.target.wants# ls
ceph-mon@dell0104blade10.service

The Syslog is not showing any ceph related errors either.

Any suggestions please ?

Thanks
VIvek
 

Attachments

  • Screenshot from 2019-11-06 17-59-23.png
    Screenshot from 2019-11-06 17-59-23.png
    33.8 KB · Views: 18
Restart the pvestatd daemons on the cluster, the entry should then go away. On the other hand it doesn't do anything it is just irritating.
 
Hi Alwin
I restarted the pvestatd service but the monitor still shows up as "?" Screenshot attached
I had to restart the cluster and had expected that this would disappear but it is still persisting
Any suggestions please.
Thanks Again
Vivek

mon.PNG
 
If that MON doesn't exist, then go into /var/lib/ceph/mon and delete the directory. The entry should then disappear on the GUI.
 
  • Like
Reactions: vivekdelhi

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!