Ceph Monitor stopped

kahless2k

New Member
Jul 14, 2022
2
2
3
Hi All,

We have a 3 node ceph cluster which handles storage for 5 hypervisor nodes.

We are in the process of upgrading to Proxmox 7.4 and have all of the hypervisors and two storage nodes now running 7.2 (and ceph 16.2.11); the upgrade was from 7.2 and ceph 16.2.9.

We have halted our upgrade because on our second storage node (PVEMTS2), the Ceph monitor is displayed as stopped. The two other monitors are working and the cluster is fully operational - we just obviously don't want to leave only two monitors.

We tried to destroy the failed monitor but received an error that it doesn't exist.

We then stopped the service on that node, disabled it and removed the directory from /var/lib/ceph/mon; we also removed the mon entry from ceph.conf and removed the IP from the mon_host line in ceph.conf.

This removed the monitor from the cluster.

We then created a new monitor using pveceph mon create
This updated the ceph.conf, started the service and things should have been good.

However, on the proxmox ceph monitor screen the service is displayed as stopped.

ceph -s gives us

services: mon: 2 daemons, quorum PVEMTS1,PVEMTS3 (age 9h) mgr: PVEMTS3(active, since 4w), standbys: PVEMTS1, PVEMTS2 mds: 1/1 daemons up, 1 standby osd: 15 osds: 15 up (since 40m), 15 in (since 18h)



systemctl status ceph-mon@PVEMTS2 gives us

● ceph-mon@PVEMTS2.service - Ceph cluster monitor daemon Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d └─ceph-after-pve-cluster.conf Active: active (running) since Fri 2023-05-12 11:43:09 EDT; 24min ago Main PID: 1254978 (ceph-mon) Tasks: 27 Memory: 93.6M CPU: 6.864s CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@PVEMTS2.service └─1254978 /usr/bin/ceph-mon -f --cluster ceph --id PVEMTS2 --setuser ceph --setgroup ceph May 12 11:43:09 PVEMTS2 systemd[1]: Started Ceph cluster monitor daemon.


So the service appears to be running but something isn't talking.


Any suggestions on where to go from here?
 
  • Like
Reactions: BruceX
Did you run ceph mon remove {mon-id}?

The MONs have the internal state, the so called monmap as well where they map each MON that they should be able to talk to.

Checkout out the troubleshooting MONs section in the Ceph conf. Especially the part about opening the socket to the local mon is useful. https://docs.ceph.com/en/latest/rad...hooting-mon/#using-the-monitor-s-admin-socket

The path should be /run/ceph/.... running mon_status or quorum_status should print out quite a bit. At the beginning of the mon_status command you will see the state the mon is currently in and the monmap it keeps internally. Compare those between the nodes. That should give you an idea where something might be off. Worst case, remove the problematic MON again.
 
Hi there,

We did run that when we removed the failed monitor.

I just rebooted one of the other storage nodes (the one that had the successful upgrade) and the monitor on this node came online - I'm wondering if the ceph.conf update didn't get processed on all of the ceph nodes when we added the new monitor.

Either way - the reboot seems to have resolved the issue for us.

I'll read through that link for future reference - thank you very much.
 
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!