[SOLVED] Ghost monitor in CEPH cluster

Whatever · Oct 4, 2019

After an update from 5.x to 6.x one CEPH monitors became "ghost"
With status "stopped" and address "unknown"
It can be neither run, created or deleted with errors as below:
create: monitor address '10.10.10.104' already in use (500 )
destroy : no such monitor id 'pve-node4' (500)

I deleted "alive" mons, pools, ods, mgrs and tried to recreate everything from the scratch - mon on the pve-node4 still had status as described above.

I more thing to be noted: even-though PVE GUI shows 4 mons (3 active) there is only one monitor entry in ceph.conf

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.10.10.0/24
     fsid = c2d639ef-c720-4c85-ac77-2763ecaa0a5e
     mon_allow_pool_delete = true
     mon_host = 10.10.10.101 10.10.10.102 10.10.10.103
     osd_journal_size = 5120
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.10.10.0/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[osd]
     osd_class_update_on_start = false
     osd_max_backfills = 2
     osd_memory_target = 2147483648

[mon.pve-node1]
     host = pve-node1
     mon_addr = 10.10.10.101:6789

Code:

root@pve-node4:~# ceph -s
  cluster:
    id:     c2d639ef-c720-4c85-ac77-2763ecaa0a5e
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve-node1,pve-node3,pve-node2 (age 2h)
    mgr: pve-node1(active, since 12h), standbys: pve-node2, pve-node3
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

Any ideas how to delete mon entry on pve-node4 or reinstall it?
Thanks in advance

Whatever · Oct 4, 2019

dcsapak · Oct 4, 2019

there is probably the systemd service still enabled
try

Code:

systemctl disable ceph-mon@pve-node4

on pve-node4

Whatever · Oct 4, 2019

Yeap, systemd service was enabled but disabling does change nothing

ceph log on pve-node4 on mon start:

Code:

Oct 04 13:41:25 pve-node4 systemd[1]: Started Ceph cluster monitor daemon.
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 mon.pve-node4@-1(???) e14 not in monmap and have been in a quorum before; must have been removed
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 mon.pve-node4@-1(???) e14 commit suicide!
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 failed to initialize
Oct 04 13:41:25 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Main process exited, code=exited, status=1/FAILURE
Oct 04 13:41:25 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Failed with result 'exit-code'.
Oct 04 13:41:35 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Service RestartSec=10s expired, scheduling restart.
Oct 04 13:41:35 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Scheduled restart job, restart counter is at 6.
Oct 04 13:41:35 pve-node4 systemd[1]: Stopped Ceph cluster monitor daemon.

Any other suggesting?

Whatever · Oct 4, 2019

Not sure it's somehow related but I dont have any OSDs in my cluster by the moment

Code:

root@pve-node4:~# systemctl | grep ceph-
● ceph-mon@pve-node4.service                                                                                                     loaded failed     failed    Ceph cluster monitor daemon                                                  
● ceph-osd@32.service                                                                                                            loaded failed     failed    Ceph object storage daemon osd.32                                            
  ceph-mgr.target                                                                                                                loaded active     active    ceph target allowing to start/stop all ceph-mgr@.service instances at once   
  ceph-osd.target                                                                                                                loaded active     active

Alwin · Oct 4, 2019

Whatever said:
Any other suggesting?

Reboot and see if the service persists.

Whatever · Oct 4, 2019

Nothing changes(

Code:

root@pve-node4:~# pveceph purge
detected running ceph services- unable to purge data
root@pve-node4:~# pveceph createmon
monitor 'pve-node4' already exists
root@pve-node4:~#

Alwin · Oct 4, 2019

As @dcsapak said, disable the mon service and then remove the directory /var/lib/ceph/mon/pve-node4.

Whatever · Oct 4, 2019

Alwin said:
As @dcsapak said, disable the mon service and then remove the directory /var/lib/ceph/mon/pve-node4.

I did it. I even deleted all /var/lib/ceph folder and all ceph* related services in /etc/system.d/.. and rebooted that node

but pveceph purge still says:

Code:

root@pve-node4:~# pveceph purge
detected running ceph services- unable to purge data

what pveceph purge checks for "running ceph services-" ? How can I completely remove ceph from the node and reinstall it?

Thanks

Alwin · Oct 7, 2019

Whatever said:
what pveceph purge checks for "running ceph services-" ? How can I completely remove ceph from the node and reinstall it?

Any running Ceph service will keep 'pveceph purge' from removing configs. You don't need to re-install Ceph, once you removed all services of Ceph on all nodes and their directories (as als ceph.conf), the Ceph cluster stops to exist. Then you can already create a new cluster.

Whatever · Oct 7, 2019

Alwin, thanks. Will give a try

fiona · Oct 7, 2019

There was another user with the same issue and they were able to get rid of the ghost monitor by manually adding it using CEPH tools and then removing it using PVE GUI. Here is the thread.

Whatever · Oct 10, 2019

Alwin said:
Any running Ceph service will keep 'pveceph purge' from removing configs. You don't need to re-install Ceph, once you removed all services of Ceph on all nodes and their directories (as als ceph.conf), the Ceph cluster stops to exist. Then you can already create a new cluster.

Thanks. Managed to delete ceph and reinstall it

dmulk · Dec 13, 2019

dcsapak said:
there is probably the systemd service still enabled
try

Code:

systemctl disable ceph-mon@pve-node4

on pve-node4

This solved my ghosting issue on my node post PVE5 to PVE6 upgrade. Interestingly, this node was one time a MON but using the older PVECEPH tools to remove/destroy it probably didn't clean up things enough. Anyway, thanks for the posted solve. Cheers!
<D>

sserrgio · Dec 18, 2019

systemctl disable ceph-mon@pve-node4
also solved the problem to me after upgrading from 5 to 6
Thanks

psionic · Dec 30, 2019

I tried to destroy the Monitor and Manager on one node. I now have a ghost Monitor that is both 'no such monitor id 'pve11' (500)' and 'monitor 'pve11' already exists (500)'. The Manager was destroyed with no issue. I also created a new Monitor/Manager on my 4th node to keep the count at 3.

While digging around to find answers I noticed none of the osd configs had been updated to reflect the new monitors:
ceph config show osd.2
NAME VALUE SOURCE OVERRIDES IGNORES
auth_client_required cephx file
auth_cluster_required cephx file
auth_service_required cephx file
cluster_network 10.10.4.11/24 file
daemonize false override
keyring $osd_data/keyring default
leveldb_log default
mon_allow_pool_delete true file
mon_host 10.10.3.11 10.10.3.12 10.10.3.13 file
osd_pool_default_min_size 2 file
osd_pool_default_size 3 file
public_network 10.10.3.11/24 file
rbd_default_features 61 default
setgroup ceph cmdline
setuser ceph cmdline

I restarted the osd from the gui and the monitors were updated:
ceph config show osd.2
NAME VALUE SOURCE OVERRIDES IGNORES
auth_client_required cephx file
auth_cluster_required cephx file
auth_service_required cephx file
cluster_network 10.10.4.11/24 file
daemonize false override
keyring $osd_data/keyring default
leveldb_log default
mon_allow_pool_delete true file
mon_host 10.10.3.12 10.10.3.13 10.10.3.14 file
osd_pool_default_min_size 2 file
osd_pool_default_size 3 file
public_network 10.10.3.11/24 file
rbd_default_features 61 default
setgroup ceph cmdline
setuser ceph cmdline

I then restarted all the osds to update their config file.

I then tried rebooting the node with the ghost monitor to see if that would remove the ghost, but the ghost monitor is still present...

Alwin · Jan 7, 2020

James Pass said:
I then tried rebooting the node with the ghost monitor to see if that would remove the ghost, but the ghost monitor is still present...

Did you try to disable the systemd unit of the mon?

psionic · Jan 10, 2020

Alwin said:
Did you try to disable the systemd unit of the mon?

Do you have a reference for implementing this?

Alwin · Jan 13, 2020

James Pass said:
Do you have a reference for implementing this?

systemctl disable ceph-mon@<node-name>.service, replace the node name with the one you want to disable.

psionic · Jan 13, 2020

Alwin said:
systemctl disable ceph-mon@<node-name>.service, replace the node name with the one you want to disable.

I tried the solution and it had no effect...

[SOLVED] Ghost monitor in CEPH cluster

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Staff Member

Renowned Member

Member

Active Member

Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

Member