[SOLVED] Ghost monitor in CEPH cluster

Whatever

Renowned Member
Nov 19, 2012
393
63
93
After an update from 5.x to 6.x one CEPH monitors became "ghost"
With status "stopped" and address "unknown"
It can be neither run, created or deleted with errors as below:
create: monitor address '10.10.10.104' already in use (500 )
destroy : no such monitor id 'pve-node4' (500)

I deleted "alive" mons, pools, ods, mgrs and tried to recreate everything from the scratch - mon on the pve-node4 still had status as described above.

I more thing to be noted: even-though PVE GUI shows 4 mons (3 active) there is only one monitor entry in ceph.conf

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.10.10.0/24
     fsid = c2d639ef-c720-4c85-ac77-2763ecaa0a5e
     mon_allow_pool_delete = true
     mon_host = 10.10.10.101 10.10.10.102 10.10.10.103
     osd_journal_size = 5120
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.10.10.0/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[osd]
     osd_class_update_on_start = false
     osd_max_backfills = 2
     osd_memory_target = 2147483648

[mon.pve-node1]
     host = pve-node1
     mon_addr = 10.10.10.101:6789

Code:
root@pve-node4:~# ceph -s
  cluster:
    id:     c2d639ef-c720-4c85-ac77-2763ecaa0a5e
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve-node1,pve-node3,pve-node2 (age 2h)
    mgr: pve-node1(active, since 12h), standbys: pve-node2, pve-node3
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

Any ideas how to delete mon entry on pve-node4 or reinstall it?
Thanks in advance
 
Last edited:
there is probably the systemd service still enabled
try
Code:
systemctl disable ceph-mon@pve-node4
on pve-node4
 
  • Like
Reactions: dmulk
Yeap, systemd service was enabled but disabling does change nothing

ceph log on pve-node4 on mon start:
Code:
Oct 04 13:41:25 pve-node4 systemd[1]: Started Ceph cluster monitor daemon.
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 mon.pve-node4@-1(???) e14 not in monmap and have been in a quorum before; must have been removed
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 mon.pve-node4@-1(???) e14 commit suicide!
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 failed to initialize
Oct 04 13:41:25 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Main process exited, code=exited, status=1/FAILURE
Oct 04 13:41:25 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Failed with result 'exit-code'.
Oct 04 13:41:35 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Service RestartSec=10s expired, scheduling restart.
Oct 04 13:41:35 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Scheduled restart job, restart counter is at 6.
Oct 04 13:41:35 pve-node4 systemd[1]: Stopped Ceph cluster monitor daemon.

Any other suggesting?
 
Not sure it's somehow related but I dont have any OSDs in my cluster by the moment

Code:
root@pve-node4:~# systemctl | grep ceph-
● ceph-mon@pve-node4.service                                                                                                     loaded failed     failed    Ceph cluster monitor daemon                                                  
● ceph-osd@32.service                                                                                                            loaded failed     failed    Ceph object storage daemon osd.32                                            
  ceph-mgr.target                                                                                                                loaded active     active    ceph target allowing to start/stop all ceph-mgr@.service instances at once   
  ceph-osd.target                                                                                                                loaded active     active
 
Nothing changes(
Code:
root@pve-node4:~# pveceph purge
detected running ceph services- unable to purge data
root@pve-node4:~# pveceph createmon
monitor 'pve-node4' already exists
root@pve-node4:~#
 
As @dcsapak said, disable the mon service and then remove the directory /var/lib/ceph/mon/pve-node4.
 
As @dcsapak said, disable the mon service and then remove the directory /var/lib/ceph/mon/pve-node4.
I did it. I even deleted all /var/lib/ceph folder and all ceph* related services in /etc/system.d/.. and rebooted that node

but pveceph purge still says:
Code:
root@pve-node4:~# pveceph purge
detected running ceph services- unable to purge data

what pveceph purge checks for "running ceph services-" ? How can I completely remove ceph from the node and reinstall it?

Thanks
 
what pveceph purge checks for "running ceph services-" ? How can I completely remove ceph from the node and reinstall it?
Any running Ceph service will keep 'pveceph purge' from removing configs. You don't need to re-install Ceph, once you removed all services of Ceph on all nodes and their directories (as als ceph.conf), the Ceph cluster stops to exist. Then you can already create a new cluster.
 
There was another user with the same issue and they were able to get rid of the ghost monitor by manually adding it using CEPH tools and then removing it using PVE GUI. Here is the thread.
 
  • Like
Reactions: eitikipok
Any running Ceph service will keep 'pveceph purge' from removing configs. You don't need to re-install Ceph, once you removed all services of Ceph on all nodes and their directories (as als ceph.conf), the Ceph cluster stops to exist. Then you can already create a new cluster.
Thanks. Managed to delete ceph and reinstall it
 
there is probably the systemd service still enabled
try
Code:
systemctl disable ceph-mon@pve-node4
on pve-node4


This solved my ghosting issue on my node post PVE5 to PVE6 upgrade. Interestingly, this node was one time a MON but using the older PVECEPH tools to remove/destroy it probably didn't clean up things enough. Anyway, thanks for the posted solve. Cheers!
<D>
 
systemctl disable ceph-mon@pve-node4
also solved the problem to me after upgrading from 5 to 6
Thanks
 
I tried to destroy the Monitor and Manager on one node. I now have a ghost Monitor that is both 'no such monitor id 'pve11' (500)' and 'monitor 'pve11' already exists (500)'. The Manager was destroyed with no issue. I also created a new Monitor/Manager on my 4th node to keep the count at 3.

While digging around to find answers I noticed none of the osd configs had been updated to reflect the new monitors:
ceph config show osd.2
NAME VALUE SOURCE OVERRIDES IGNORES
auth_client_required cephx file
auth_cluster_required cephx file
auth_service_required cephx file
cluster_network 10.10.4.11/24 file
daemonize false override
keyring $osd_data/keyring default
leveldb_log default
mon_allow_pool_delete true file
mon_host 10.10.3.11 10.10.3.12 10.10.3.13 file
osd_pool_default_min_size 2 file
osd_pool_default_size 3 file
public_network 10.10.3.11/24 file
rbd_default_features 61 default
setgroup ceph cmdline
setuser ceph cmdline

I restarted the osd from the gui and the monitors were updated:
ceph config show osd.2
NAME VALUE SOURCE OVERRIDES IGNORES
auth_client_required cephx file
auth_cluster_required cephx file
auth_service_required cephx file
cluster_network 10.10.4.11/24 file
daemonize false override
keyring $osd_data/keyring default
leveldb_log default
mon_allow_pool_delete true file
mon_host 10.10.3.12 10.10.3.13 10.10.3.14 file
osd_pool_default_min_size 2 file
osd_pool_default_size 3 file
public_network 10.10.3.11/24 file
rbd_default_features 61 default
setgroup ceph cmdline
setuser ceph cmdline

I then restarted all the osds to update their config file.

I then tried rebooting the node with the ghost monitor to see if that would remove the ghost, but the ghost monitor is still present...
 
I then tried rebooting the node with the ghost monitor to see if that would remove the ghost, but the ghost monitor is still present...
Did you try to disable the systemd unit of the mon?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!