[SOLVED] I managed to create a ghost ceph monitor

icemox · Sep 26, 2019

During upgrade from Luminous to Nautilus I somehow managed to create a ghost ceph monitor. I have a four hosts cluster but only 3 of them are ceph monitors. Probably because I copied pasted the ceph monitor upgrade comands the node that didn't have a monitor now appears in the proxmox gui to have one.

Let's call this node problemhost

Every command tells me that monitor does not exist:

Code:

# pveceph status|grep problemhost

is empty

Bash:

ceph mon rm problemhost
mon.problemhost does not exist or has already been removed

ceph.conf has nothing about this node either. /var/lib/ceph/mon is empty on that host.

But in the GUI I have a line:
Name: mon.problemhost
Host: problemhost
Status: stopped
Address: Unknown

If I click "destroy" I get no such monitor id 'problemhost' (500)

For kicks I tried to create it: monitor 'problemhost' already exists (500)

I tried rebooting all nodes and it's still there. Where is the GUI getting this information?

fiona · Sep 30, 2019

Hi,
could it be that there are systemd service files present for the 'problemhost', i.e. is there a file under '/etc/systemd/system/ceph-mon.target.wants/'? Please also run

Code:

systemctl status ceph-mon@<problemhost>.service

on 'problemhost'.

icemox · Sep 30, 2019

Tried it, didn't work. What did work just now was to manually create the monitor using the instructions from here: https://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/ as pveceph didn't want to create it (because it exists) and didn't want to remove it (because it didn't exist) making it Schrodingers mon I guess and then remove it from the gui.

Thanks !

psionic · Jan 13, 2020

Fabian_E said:
Hi,
could it be that there are systemd service files present for the 'problemhost', i.e. is there a file under '/etc/systemd/system/ceph-mon.target.wants/'? Please also run

Code:

systemctl status ceph-mon@<problemhost>.service

on 'problemhost'.

ls /etc/systemd/system/ceph-mon.target.wants/
ls: cannot access '/etc/systemd/system/ceph-mon.target.wants/': No such file or directory

systemctl status ceph-mon@pve11.service
● ceph-mon@pve11.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; disabled; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: inactive (dead)

Pourya Mehdinejad · Mar 17, 2020

Fabian_E said:
Hi,
could it be that there are systemd service files present for the 'problemhost', i.e. is there a file under '/etc/systemd/system/ceph-mon.target.wants/'? Please also run

Code:

systemctl status ceph-mon@<problemhost>.service

on 'problemhost'.

I have the same issue.
and the service file exists at this path, what should I do remove it?

/etc/systemd/system/ceph-mon.target.wants

icemox · Mar 17, 2020

Sorry I don't really remember and I gave up on ceph shortly afterwards.

Alwin · Mar 17, 2020

Maybe this thread may help.
https://forum.proxmox.com/threads/ghost-monitor-in-ceph-cluster.58683/

BurningHero · Apr 17, 2020

Solution that worked for my cluster:

Code:

ceph-mon -i pxc05 --extract-monmap tmp/map
mkdir /var/lib/ceph/mon
rm /etc/systemd/system/ceph-mon.target.wants/ceph-mon@pxc05.service
pveceph createmon

rekahsoft · Oct 23, 2020

I saw this behavior after a node crashed and needed to be completely replaced (somewhat of a long story that is not fully relevant). When bringing up the new proxmox replacement node, I noticed the following:

The systemd service for the nodes monitor was enabled and running but intermittently crashing
The impacted monitor was not listed by `ceph mon_status`
The mon keyring and monitor map files were in place in `/var/lib/ceph/mon/ceph-<node>`
The proxmox ui showed the monitor ip (no port), and had empty `version` and `quorum` columns
When attempting to destroy the impacted monitor, it would fail with `destroy : no such monitor id 'pve-<node>' (500)` (both cli and ui)

The way I resolved this was by creating the monitor with ceph, then removing and re-adding it with pveceph.

fxandrei · Apr 8, 2021

rekahsoft said:
I saw this behavior after a node crashed and needed to be completely replaced (somewhat of a long story that is not fully relevant). When bringing up the new proxmox replacement node, I noticed the following:

The systemd service for the nodes monitor was enabled and running but intermittently crashing

The impacted monitor was not listed by `ceph mon_status`

The mon keyring and monitor map files were in place in `/var/lib/ceph/mon/ceph-<node>`

The proxmox ui showed the monitor ip (no port), and had empty `version` and `quorum` columns

When attempting to destroy the impacted monitor, it would fail with `destroy : no such monitor id 'pve-<node>' (500)` (both cli and ui)

The way I resolved this was by creating the monitor with ceph, then removing and re-adding it with pveceph.

Im having a similar problem. Could you go into details to what you actually did.
How did you go about creating the monitor with ceph if the service was created and running ?

Jospeh Huber · May 11, 2021

rekahsoft said:
I saw this behavior after a node crashed and needed to be completely replaced (somewhat of a long story that is not fully relevant). When bringing up the new proxmox replacement node, I noticed the following:

The systemd service for the nodes monitor was enabled and running but intermittently crashing

The impacted monitor was not listed by `ceph mon_status`

The mon keyring and monitor map files were in place in `/var/lib/ceph/mon/ceph-<node>`

The proxmox ui showed the monitor ip (no port), and had empty `version` and `quorum` columns

When attempting to destroy the impacted monitor, it would fail with `destroy : no such monitor id 'pve-<node>' (500)` (both cli and ui)

The way I resolved this was by creating the monitor with ceph, then removing and re-adding it with pveceph.

I had the same problem today, but a manual creation of the monitor with the ceph built in commands did not help.
It is described here https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#adding-a-monitor-manual ... but in my case the result was the same as creating the monitor with the proxmox tools.
After several hours I want to share my solution... I hope it helps others tro save some time ;-)

The short version: the solution was to inject a monmap into the stale monitor:
https://docs.ceph.com/en/latest/rad...ing-mon/#recovering-a-monitor-s-broken-monmap

The long version:
1.) Remove all of the stale monitors components and data manually:
pveceph mon destroy ceph-mon@myhost ... results in an error described above "no such monitor id "!

Code:

# stop and disable and remove service
service ceph-mon@myhost stop
systemctl disable ceph-mon@myhost
systemctl daemon-reload
# delete the datadir of the monitor
rm -r /var/lib/ceph/mon/ceph-myhost
#adjust /etc/ceph/ceph.conf manually
# delete IP of stale monitor
mon_host = xx.0.99.83 xx.0.99.82 xx.0.99.84 
=> mon_host = xx.0.99.83 xx.0.99.82
# delete the section
[mon.myhost]
         public_addr = xx.0.99.84

2.)
After that I was able to add the monitor again with gui or cmdline

Code:

# on the missing monitor host
pveceph createmon

But the monitor could not join the cluster... I tried it again and again, in the logs (var/log/ceph/ceph-mon@myhost.log there

Code:

2021-05-11 16:37:47.465 7fdf7c47d700 -1 mon.myhost@-1(probing) e13 get_health_metrics reporting 11 slow ops, oldest is log(1 entries from seq 1 at 2021-05-11 16:33:35.201786)
2021-05-11 16:37:50.321 7fdf7e481700  1 mon.myhost@-1(probing) e13 handle_auth_request failed to assign global_id
2021-05-11 16:37:50.373 7fdf7e481700  1 mon.myhost@-1(probing) e13 handle_auth_request failed to assign global_id

A nice feature is to query the admin port of the stale monitor and query the monitor status direct from the monitor, this helps for a better understanding...

Code:

#get the admin socket patch
ceph-conf --name mon.myhost --show-config-value admin_socket
# possible commands of the monitor
ceph --admin-daemon /var/run/ceph/ceph-mon.myhost.asok help
# query monitor status
ceph --admin-daemon /var/run/ceph/ceph-mon.myhost.asok mon_status

Details can be found here https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#understanding-mon-status

3.)
The only working solution for me to bring the monitor back in the cluster again was to "Inject a monmap into the monitor" as described here https://docs.ceph.com/en/latest/rad...ing-mon/#recovering-a-monitor-s-broken-monmap
With a manual permission fix!

Code:

# login to a working monitor host, stop the service and extract the map
service ceph-mon@myworkinghost stop
ceph-mon -i myworkinghost  --extract-monmap /tmp/monmap
2021-05-11 16:44:04.119 7f6087bd6400 -1 wrote monmap to /tmp/monmap
#start the service again and transfer it to the stalehost
service ceph-mon@myworkinghost start
scp /tmp/monmap mystalehost:/tmp/monmap

# on the stale monhost, stop the monitor and inject tthe mapstoppen und injecten
service ceph-mon@mystalehost stop
ceph-mon -i mystalehost --inject-monmap /tmp/monmap
# after that the access rights must be set manually fopr the user "ceph" (there where file permission denied errors, becuase after the injection some files belonged to root)
chown ceph.ceph /var/lib/ceph/mon/ceph-mystalehost/store.db/*
# and start again
service ceph-mon@mystalehost start

rekahsoft · Jun 19, 2021

I hit this again the other day, which resulted in me regretting not writing down clearly what I did to resolve it.

To reproduce this issue:

1. Add a new proxmox node, but before adding it to the cluster, install and configure ceph (the key part here is that you completed the configuration)
2. Add this new node to the cluster; it will result in the the monitor for the node becoming a ghost as described in this post

The ghost monitor in this case has never joined the cluster, so just needs to be removed. To do this I used:

Bash:

systemctl stop ceph-mon@<impacted-mon>
systemctl disable ceph-mon@<impacted-mon>
systemctl daemon-reload
systemctl reset-failed
pveceph mon destroy <impacted-mon>

mr44er · Dec 22, 2022

Jospeh Huber said:
The long version:
1.) Remove all of the stale monitors components and data manually:
pveceph mon destroy ceph-mon@myhost ... results in an error described above "no such monitor id "!

Code:

# stop and disable and remove service service ceph-mon@myhost stop systemctl disable ceph-mon@myhost systemctl daemon-reload # delete the datadir of the monitor rm -r /var/lib/ceph/mon/ceph-myhost #adjust /etc/ceph/ceph.conf manually # delete IP of stale monitor mon_host = xx.0.99.83 xx.0.99.82 xx.0.99.84 => mon_host = xx.0.99.83 xx.0.99.82 # delete the section [mon.myhost] public_addr = xx.0.99.84

2.)
After that I was able to add the monitor again with gui or cmdline

Code:

# on the missing monitor host pveceph createmon

But the monitor could not join the cluster... I tried it again and again, in the logs (var/log/ceph/ceph-mon@myhost.log there

Code:

2021-05-11 16:37:47.465 7fdf7c47d700 -1 mon.myhost@-1(probing) e13 get_health_metrics reporting 11 slow ops, oldest is log(1 entries from seq 1 at 2021-05-11 16:33:35.201786) 2021-05-11 16:37:50.321 7fdf7e481700 1 mon.myhost@-1(probing) e13 handle_auth_request failed to assign global_id 2021-05-11 16:37:50.373 7fdf7e481700 1 mon.myhost@-1(probing) e13 handle_auth_request failed to assign global_id

This should be sticky or in the manual!
I had this problem also 1:1 with three monitors.

My solution was follow the steps from @Jospeh Huber , then readd via the gui. As expected, monitor stays off. Also tried the injection, did not work.
So all the steps again (but not the injection!) , double-check that all mons are not visible anymore in the gui. Shutdown the whole cluster, booted one node after another. Then everything was fresh, recreating the monitors just worked as usual.
Quorum from all mons changed to no for some seconds, then everything went back to normal (attachment).

The interesting part ist...what happened initially? I just deleted some monitors and readded them via the gui and gave it good time to settle down after every step. With some nodes it worked, three ones got the problem after that. Corosync too slow to sync the config? I don't know...

thomas.weingart · Feb 23, 2023

Hello,

I have the same problem with a monitor. I have followed Joseph Huber's steps including injection. Unfortunately it did not help, the monitor service is displayed as "stopped" and cannot be started. Monitor status and ceph mon dump does not show the monitor either. What else can I do?

proxymax · Aug 2, 2023

Jospeh Huber said:
I had the same problem today, but a manual creation of the monitor with the ceph built in commands did not help.
It is described here https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#adding-a-monitor-manual ... but in my case the result was the same as creating the monitor with the proxmox tools.
After several hours I want to share my solution... I hope it helps others tro save some time ;-)

The short version: the solution was to inject a monmap into the stale monitor:
https://docs.ceph.com/en/latest/rad...ing-mon/#recovering-a-monitor-s-broken-monmap

The long version:
1.) Remove all of the stale monitors components and data manually:
pveceph mon destroy ceph-mon@myhost ... results in an error described above "no such monitor id "!

Code:

# stop and disable and remove service service ceph-mon@myhost stop systemctl disable ceph-mon@myhost systemctl daemon-reload # delete the datadir of the monitor rm -r /var/lib/ceph/mon/ceph-myhost #adjust /etc/ceph/ceph.conf manually # delete IP of stale monitor mon_host = xx.0.99.83 xx.0.99.82 xx.0.99.84 => mon_host = xx.0.99.83 xx.0.99.82 # delete the section [mon.myhost] public_addr = xx.0.99.84

2.)
After that I was able to add the monitor again with gui or cmdline

Code:

# on the missing monitor host pveceph createmon

But the monitor could not join the cluster... I tried it again and again, in the logs (var/log/ceph/ceph-mon@myhost.log there

Code:

2021-05-11 16:37:47.465 7fdf7c47d700 -1 mon.myhost@-1(probing) e13 get_health_metrics reporting 11 slow ops, oldest is log(1 entries from seq 1 at 2021-05-11 16:33:35.201786) 2021-05-11 16:37:50.321 7fdf7e481700 1 mon.myhost@-1(probing) e13 handle_auth_request failed to assign global_id 2021-05-11 16:37:50.373 7fdf7e481700 1 mon.myhost@-1(probing) e13 handle_auth_request failed to assign global_id

A nice feature is to query the admin port of the stale monitor and query the monitor status direct from the monitor, this helps for a better understanding...

Code:

#get the admin socket patch ceph-conf --name mon.myhost --show-config-value admin_socket # possible commands of the monitor ceph --admin-daemon /var/run/ceph/ceph-mon.myhost.asok help # query monitor status ceph --admin-daemon /var/run/ceph/ceph-mon.myhost.asok mon_status

Details can be found here https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#understanding-mon-status

3.)
The only working solution for me to bring the monitor back in the cluster again was to "Inject a monmap into the monitor" as described here https://docs.ceph.com/en/latest/rad...ing-mon/#recovering-a-monitor-s-broken-monmap
With a manual permission fix!

Code:

# login to a working monitor host, stop the service and extract the map service ceph-mon@myworkinghost stop ceph-mon -i myworkinghost --extract-monmap /tmp/monmap 2021-05-11 16:44:04.119 7f6087bd6400 -1 wrote monmap to /tmp/monmap #start the service again and transfer it to the stalehost service ceph-mon@myworkinghost start scp /tmp/monmap mystalehost:/tmp/monmap # on the stale monhost, stop the monitor and inject tthe mapstoppen und injecten service ceph-mon@mystalehost stop ceph-mon -i mystalehost --inject-monmap /tmp/monmap # after that the access rights must be set manually fopr the user "ceph" (there where file permission denied errors, becuase after the injection some files belonged to root) chown ceph.ceph /var/lib/ceph/mon/ceph-mystalehost/store.db/* # and start again service ceph-mon@mystalehost start

I wasted hours trying to resolve the ghost mon and nothing worked except these steps. Almost gave up until I tried this one last time.

aaxvig · Feb 24, 2024

rekahsoft said:
I hit this again the other day, which resulted in me regretting not writing down clearly what I did to resolve it.

To reproduce this issue:

1. Add a new proxmox node, but before adding it to the cluster, install and configure ceph (the key part here is that you completed the configuration)
2. Add this new node to the cluster; it will result in the the monitor for the node becoming a ghost as described in this post

The ghost monitor in this case has never joined the cluster, so just needs to be removed. To do this I used:

Bash:

systemctl stop ceph-mon@<impacted-mon> systemctl disable ceph-mon@<impacted-mon> systemctl daemon-reload systemctl reset-failed pveceph mon destroy <impacted-mon>

This was exactly how I got in the situation. Your solution almost worked fine. I couldn't figure out what <impacted-mon> was in the final line. It was autocompleting as proxmox5.service for the first two lines but that wasn't working for the last one.

I ended up with an alternate final step. Instead I removed the folder in /var/lib/ceph/mon/: rm -r /var/lib/ceph/mon/ceph-proxmox5.

grin · Sep 22, 2024

Just adding a different story:

Host was offline for a while, came up again, and mon was stuck at "handle_auth_request failed to assign global_id".
I went out reading, and when I was back (after 15 minutes) the mon was in again. It tried probing for about 10 minutes, then started synchronising by itself:

Code:

2024-09-22T00:23:03.662+0200 7dcb5b2006c0  1 mon.h0@1(synchronizing) e7 sync_obtain_latest_monmap
2024-09-22T00:23:03.663+0200 7dcb5b2006c0  1 mon.h0@1(synchronizing) e7 sync_obtain_latest_monmap obtained monmap e7

Which took about a minute, then (with the new, up to date monmap) it started normally.
(This was 17.2.7 quincy)

It would probably have been faster by manually injecting the actual monmap.

Search

Search

[SOLVED] I managed to create a ghost ceph monitor

icemox

Active Member

fiona

Proxmox Staff Member

icemox

Active Member

psionic

Member

Pourya Mehdinejad

Member

icemox

Active Member

Alwin

Proxmox Retired Staff

BurningHero

New Member

rekahsoft

Member

fxandrei

Renowned Member

Jospeh Huber

Renowned Member

rekahsoft

Member

mr44er

Renowned Member

Attachments

thomas.weingart

Member

proxymax

New Member

aaxvig

New Member

grin

Renowned Member