[SOLVED] Ghost monitor in CEPH cluster

Whatever

Active Member
Nov 19, 2012
233
7
38
After an update from 5.x to 6.x one CEPH monitors became "ghost"
With status "stopped" and address "unknown"
It can be neither run, created or deleted with errors as below:
create: monitor address '10.10.10.104' already in use (500 )
destroy : no such monitor id 'pve-node4' (500)

I deleted "alive" mons, pools, ods, mgrs and tried to recreate everything from the scratch - mon on the pve-node4 still had status as described above.

I more thing to be noted: even-though PVE GUI shows 4 mons (3 active) there is only one monitor entry in ceph.conf

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.10.10.0/24
     fsid = c2d639ef-c720-4c85-ac77-2763ecaa0a5e
     mon_allow_pool_delete = true
     mon_host = 10.10.10.101 10.10.10.102 10.10.10.103
     osd_journal_size = 5120
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.10.10.0/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[osd]
     osd_class_update_on_start = false
     osd_max_backfills = 2
     osd_memory_target = 2147483648

[mon.pve-node1]
     host = pve-node1
     mon_addr = 10.10.10.101:6789
Code:
root@pve-node4:~# ceph -s
  cluster:
    id:     c2d639ef-c720-4c85-ac77-2763ecaa0a5e
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pve-node1,pve-node3,pve-node2 (age 2h)
    mgr: pve-node1(active, since 12h), standbys: pve-node2, pve-node3
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:
Any ideas how to delete mon entry on pve-node4 or reinstall it?
Thanks in advance
 
Last edited:

dcsapak

Proxmox Staff Member
Staff member
Feb 1, 2016
4,603
431
103
31
Vienna
there is probably the systemd service still enabled
try
Code:
systemctl disable ceph-mon@pve-node4
on pve-node4
 
  • Like
Reactions: dmulk

Whatever

Active Member
Nov 19, 2012
233
7
38
Yeap, systemd service was enabled but disabling does change nothing

ceph log on pve-node4 on mon start:
Code:
Oct 04 13:41:25 pve-node4 systemd[1]: Started Ceph cluster monitor daemon.
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 mon.pve-node4@-1(???) e14 not in monmap and have been in a quorum before; must have been removed
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 mon.pve-node4@-1(???) e14 commit suicide!
Oct 04 13:41:25 pve-node4 ceph-mon[436732]: 2019-10-04 13:41:25.495 7f5aed4ec440 -1 failed to initialize
Oct 04 13:41:25 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Main process exited, code=exited, status=1/FAILURE
Oct 04 13:41:25 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Failed with result 'exit-code'.
Oct 04 13:41:35 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Service RestartSec=10s expired, scheduling restart.
Oct 04 13:41:35 pve-node4 systemd[1]: ceph-mon@pve-node4.service: Scheduled restart job, restart counter is at 6.
Oct 04 13:41:35 pve-node4 systemd[1]: Stopped Ceph cluster monitor daemon.
Any other suggesting?
 

Whatever

Active Member
Nov 19, 2012
233
7
38
Not sure it's somehow related but I dont have any OSDs in my cluster by the moment

Code:
root@pve-node4:~# systemctl | grep ceph-
● ceph-mon@pve-node4.service                                                                                                     loaded failed     failed    Ceph cluster monitor daemon                                                  
● ceph-osd@32.service                                                                                                            loaded failed     failed    Ceph object storage daemon osd.32                                            
  ceph-mgr.target                                                                                                                loaded active     active    ceph target allowing to start/stop all ceph-mgr@.service instances at once   
  ceph-osd.target                                                                                                                loaded active     active
 

Whatever

Active Member
Nov 19, 2012
233
7
38
Nothing changes(
Code:
root@pve-node4:~# pveceph purge
detected running ceph services- unable to purge data
root@pve-node4:~# pveceph createmon
monitor 'pve-node4' already exists
root@pve-node4:~#
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,949
369
88
As @dcsapak said, disable the mon service and then remove the directory /var/lib/ceph/mon/pve-node4.
 

Whatever

Active Member
Nov 19, 2012
233
7
38
As @dcsapak said, disable the mon service and then remove the directory /var/lib/ceph/mon/pve-node4.
I did it. I even deleted all /var/lib/ceph folder and all ceph* related services in /etc/system.d/.. and rebooted that node

but pveceph purge still says:
Code:
root@pve-node4:~# pveceph purge
detected running ceph services- unable to purge data
what pveceph purge checks for "running ceph services-" ? How can I completely remove ceph from the node and reinstall it?

Thanks
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,949
369
88
what pveceph purge checks for "running ceph services-" ? How can I completely remove ceph from the node and reinstall it?
Any running Ceph service will keep 'pveceph purge' from removing configs. You don't need to re-install Ceph, once you removed all services of Ceph on all nodes and their directories (as als ceph.conf), the Ceph cluster stops to exist. Then you can already create a new cluster.
 

Fabian_E

Proxmox Staff Member
Staff member
Aug 1, 2019
355
35
28
There was another user with the same issue and they were able to get rid of the ghost monitor by manually adding it using CEPH tools and then removing it using PVE GUI. Here is the thread.
 
  • Like
Reactions: eitikipok

Whatever

Active Member
Nov 19, 2012
233
7
38
Any running Ceph service will keep 'pveceph purge' from removing configs. You don't need to re-install Ceph, once you removed all services of Ceph on all nodes and their directories (as als ceph.conf), the Ceph cluster stops to exist. Then you can already create a new cluster.
Thanks. Managed to delete ceph and reinstall it
 

dmulk

Member
Jan 24, 2017
74
5
13
45
there is probably the systemd service still enabled
try
Code:
systemctl disable ceph-mon@pve-node4
on pve-node4

This solved my ghosting issue on my node post PVE5 to PVE6 upgrade. Interestingly, this node was one time a MON but using the older PVECEPH tools to remove/destroy it probably didn't clean up things enough. Anyway, thanks for the posted solve. Cheers!
<D>
 

sserrgio

New Member
Sep 26, 2018
1
0
1
38
systemctl disable ceph-mon@pve-node4
also solved the problem to me after upgrading from 5 to 6
Thanks
 

James Pass

Member
May 23, 2019
68
3
8
I tried to destroy the Monitor and Manager on one node. I now have a ghost Monitor that is both 'no such monitor id 'pve11' (500)' and 'monitor 'pve11' already exists (500)'. The Manager was destroyed with no issue. I also created a new Monitor/Manager on my 4th node to keep the count at 3.

While digging around to find answers I noticed none of the osd configs had been updated to reflect the new monitors:
ceph config show osd.2
NAME VALUE SOURCE OVERRIDES IGNORES
auth_client_required cephx file
auth_cluster_required cephx file
auth_service_required cephx file
cluster_network 10.10.4.11/24 file
daemonize false override
keyring $osd_data/keyring default
leveldb_log default
mon_allow_pool_delete true file
mon_host 10.10.3.11 10.10.3.12 10.10.3.13 file
osd_pool_default_min_size 2 file
osd_pool_default_size 3 file
public_network 10.10.3.11/24 file
rbd_default_features 61 default
setgroup ceph cmdline
setuser ceph cmdline

I restarted the osd from the gui and the monitors were updated:
ceph config show osd.2
NAME VALUE SOURCE OVERRIDES IGNORES
auth_client_required cephx file
auth_cluster_required cephx file
auth_service_required cephx file
cluster_network 10.10.4.11/24 file
daemonize false override
keyring $osd_data/keyring default
leveldb_log default
mon_allow_pool_delete true file
mon_host 10.10.3.12 10.10.3.13 10.10.3.14 file
osd_pool_default_min_size 2 file
osd_pool_default_size 3 file
public_network 10.10.3.11/24 file
rbd_default_features 61 default
setgroup ceph cmdline
setuser ceph cmdline

I then restarted all the osds to update their config file.

I then tried rebooting the node with the ghost monitor to see if that would remove the ghost, but the ghost monitor is still present...
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,949
369
88
I then tried rebooting the node with the ghost monitor to see if that would remove the ghost, but the ghost monitor is still present...
Did you try to disable the systemd unit of the mon?
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,949
369
88
Do you have a reference for implementing this?
systemctl disable ceph-mon@<node-name>.service, replace the node name with the one you want to disable.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!