URGENT & CRITICAL: Ceph cluster stopped after restart

MigF117 · Saturday at 12:04

Hello

I've a Proxmox 7.4-17 3 hosts cluster with Ceph cluster, Each host has 2 OSDs and everything was working fine until I had to do a full shutdown and restart.
All the hosts came up fine but the Ceph cluster didn't. looking at the monitors they all in unknown status and no listed OSDs.

I ran systemctl status ceph-mon@ and ceph-mgr@ on each host, they show as running

After a lot of digging and trying to recreate the monmap and injecting it to all 3 hosts. still no luck to bring the Ceph the cluster up.
I tried everything I can find about recreating the Monitor store and DB, with no luck.

When I try any of the Ceph commands like ceph -s, I get nothing,

Here is a screenshot of ceph.conf

I'm stuck now and not sure what to do next.

Any help please.

MagicFab · Saturday at 13:47

Can you share the content of "/var/log/ceph/" ?

Can all three nodes ping and access each other by network ?

Fabián Rodríguez | Le Goût du Libre Inc. | Montreal, Canada | Mastodon
Proxmox Silver Partner, server and desktop enterprise support in French, English and Spanish

MigF117 · Saturday at 16:56

MagicFab said:
Can you share the content of "/var/log/ceph/" ?

Can all three nodes ping and access each other by network ?

Fabián Rodríguez | Le Goût du Libre Inc. | Montreal, Canada | Mastodon
Proxmox Silver Partner, server and desktop enterprise support in French, English and Spanish

Yes, all 3 hosts can ping and access each other on the public and ceph network.
I'll get the log tomorrow morning when I get back to the office.
But from memory, when I looked at the ceph log window in the GUI, I couldn't see errors, just heaps of sync entries to AVHOST02.

MigF117 · 2025-06-15T09:40:27+0200

After a lot of more digging
I found that HOST01 is trying to start OSDs with the wrong FSIDs, not sure where these FSIDs are coming from.

[2025-06-15 14:15:43,718][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 0-6a05e0de-c1a9-4d95-95d5-22846b03604b
[2025-06-15 14:15:43,741][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 1-1b5cfe78-2297-4f83-a65d-10bc42fb1c26
[2025-06-15 14:15:43,841][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.0 with osd_fsid 6a05e0de-c1a9-4d95-95d5-22846b03604b
[2025-06-15 14:15:43,849][systemd][WARNING] command returned non-zero exit status: 1
[2025-06-15 14:15:43,849][systemd][WARNING] failed activating OSD, retries left: 1
[2025-06-15 14:15:43,877][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.1 with osd_fsid 1b5cfe78-2297-4f83-a65d-10bc42fb1c26
[2025-06-15 14:15:43,885][systemd][WARNING] command returned non-zero exit status: 1
[2025-06-15 14:15:43,885][systemd][WARNING] failed activating OSD, retries left: 1

These are the wrong FSIDs for OSD.0 and OSD.1
The coerrct ones are
[osd.0] fsid = d0dc1dc3-5f80-40b1-9664-abd5e2f7c2f4
[osd.2] fsid = d3a70dd6-5eda-4068-9543-0fd7f853ce9c

All the hosts still showing as Unknown in the GUI but all the services are running and looking at the HOST2 & 3 logs (ceph-volume-systemd.log) all the OSDs are mounted.
I think because I only have 3 hosts, and after the reboot the cluster couldn't create quorum with 2 hosts only to start with.

Any idea on how to fix this issue or how to get the data of the OSDs will be great help.

Search

Search

URGENT & CRITICAL: Ceph cluster stopped after restart

MigF117

New Member

MagicFab

Renowned Member

MigF117

New Member

MigF117

New Member

We value your privacy