URGENT & CRITICAL: Ceph cluster stopped after restart

MigF117

New Member
Jun 14, 2025
3
0
1
Hello

I've a Proxmox 7.4-17 3 hosts cluster with Ceph cluster, Each host has 2 OSDs and everything was working fine until I had to do a full shutdown and restart.
All the hosts came up fine but the Ceph cluster didn't. looking at the monitors they all in unknown status and no listed OSDs.

I ran systemctl status ceph-mon@ and ceph-mgr@ on each host, they show as running

After a lot of digging and trying to recreate the monmap and injecting it to all 3 hosts. still no luck to bring the Ceph the cluster up.
I tried everything I can find about recreating the Monitor store and DB, with no luck.

When I try any of the Ceph commands like ceph -s, I get nothing,



Here is a screenshot of ceph.conf
1749895134486.png

I'm stuck now and not sure what to do next.

Any help please.
 
Can you share the content of "/var/log/ceph/" ?

Can all three nodes ping and access each other by network ?


Fabián Rodríguez | Le Goût du Libre Inc. | Montreal, Canada | Mastodon
Proxmox Silver Partner, server and desktop enterprise support in French, English and Spanish

Yes, all 3 hosts can ping and access each other on the public and ceph network.
I'll get the log tomorrow morning when I get back to the office.
But from memory, when I looked at the ceph log window in the GUI, I couldn't see errors, just heaps of sync entries to AVHOST02.
 
After a lot of more digging
I found that HOST01 is trying to start OSDs with the wrong FSIDs, not sure where these FSIDs are coming from.

[2025-06-15 14:15:43,718][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 0-6a05e0de-c1a9-4d95-95d5-22846b03604b
[2025-06-15 14:15:43,741][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 1-1b5cfe78-2297-4f83-a65d-10bc42fb1c26
[2025-06-15 14:15:43,841][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.0 with osd_fsid 6a05e0de-c1a9-4d95-95d5-22846b03604b
[2025-06-15 14:15:43,849][systemd][WARNING] command returned non-zero exit status: 1
[2025-06-15 14:15:43,849][systemd][WARNING] failed activating OSD, retries left: 1
[2025-06-15 14:15:43,877][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.1 with osd_fsid 1b5cfe78-2297-4f83-a65d-10bc42fb1c26
[2025-06-15 14:15:43,885][systemd][WARNING] command returned non-zero exit status: 1
[2025-06-15 14:15:43,885][systemd][WARNING] failed activating OSD, retries left: 1

These are the wrong FSIDs for OSD.0 and OSD.1
The coerrct ones are
[osd.0] fsid = d0dc1dc3-5f80-40b1-9664-abd5e2f7c2f4
[osd.2] fsid = d3a70dd6-5eda-4068-9543-0fd7f853ce9c

All the hosts still showing as Unknown in the GUI but all the services are running and looking at the HOST2 & 3 logs (ceph-volume-systemd.log) all the OSDs are mounted.
I think because I only have 3 hosts, and after the reboot the cluster couldn't create quorum with 2 hosts only to start with.

Any idea on how to fix this issue or how to get the data of the OSDs will be great help.
 
Hi again,

Thanks for the details.

I searched forums and there is a similar situation discussed here:

This particular documentation brought my attention:
https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds

Because this is rather urgent and may lead to data loss, I'd suggest getting help from Proxmox support directly, I lack the time to reply quickly here but they may be able to best assist in speedy recovery / rebuild.