URGENT & CRITICAL: Ceph cluster stopped after restart

MigF117

New Member
Jun 14, 2025
3
0
1
Hello

I've a Proxmox 7.4-17 3 hosts cluster with Ceph cluster, Each host has 2 OSDs and everything was working fine until I had to do a full shutdown and restart.
All the hosts came up fine but the Ceph cluster didn't. looking at the monitors they all in unknown status and no listed OSDs.

I ran systemctl status ceph-mon@ and ceph-mgr@ on each host, they show as running

After a lot of digging and trying to recreate the monmap and injecting it to all 3 hosts. still no luck to bring the Ceph the cluster up.
I tried everything I can find about recreating the Monitor store and DB, with no luck.

When I try any of the Ceph commands like ceph -s, I get nothing,



Here is a screenshot of ceph.conf
1749895134486.png

I'm stuck now and not sure what to do next.

Any help please.
 
Can you share the content of "/var/log/ceph/" ?

Can all three nodes ping and access each other by network ?


Fabián Rodríguez | Le Goût du Libre Inc. | Montreal, Canada | Mastodon
Proxmox Silver Partner, server and desktop enterprise support in French, English and Spanish

Yes, all 3 hosts can ping and access each other on the public and ceph network.
I'll get the log tomorrow morning when I get back to the office.
But from memory, when I looked at the ceph log window in the GUI, I couldn't see errors, just heaps of sync entries to AVHOST02.
 
After a lot of more digging
I found that HOST01 is trying to start OSDs with the wrong FSIDs, not sure where these FSIDs are coming from.

[2025-06-15 14:15:43,718][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 0-6a05e0de-c1a9-4d95-95d5-22846b03604b
[2025-06-15 14:15:43,741][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 1-1b5cfe78-2297-4f83-a65d-10bc42fb1c26
[2025-06-15 14:15:43,841][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.0 with osd_fsid 6a05e0de-c1a9-4d95-95d5-22846b03604b
[2025-06-15 14:15:43,849][systemd][WARNING] command returned non-zero exit status: 1
[2025-06-15 14:15:43,849][systemd][WARNING] failed activating OSD, retries left: 1
[2025-06-15 14:15:43,877][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.1 with osd_fsid 1b5cfe78-2297-4f83-a65d-10bc42fb1c26
[2025-06-15 14:15:43,885][systemd][WARNING] command returned non-zero exit status: 1
[2025-06-15 14:15:43,885][systemd][WARNING] failed activating OSD, retries left: 1

These are the wrong FSIDs for OSD.0 and OSD.1
The coerrct ones are
[osd.0] fsid = d0dc1dc3-5f80-40b1-9664-abd5e2f7c2f4
[osd.2] fsid = d3a70dd6-5eda-4068-9543-0fd7f853ce9c

All the hosts still showing as Unknown in the GUI but all the services are running and looking at the HOST2 & 3 logs (ceph-volume-systemd.log) all the OSDs are mounted.
I think because I only have 3 hosts, and after the reboot the cluster couldn't create quorum with 2 hosts only to start with.

Any idea on how to fix this issue or how to get the data of the OSDs will be great help.