URGENT & CRITICAL: Ceph cluster stopped after restart

MigF117 · Jun 14, 2025

Hello

I've a Proxmox 7.4-17 3 hosts cluster with Ceph cluster, Each host has 2 OSDs and everything was working fine until I had to do a full shutdown and restart.
All the hosts came up fine but the Ceph cluster didn't. looking at the monitors they all in unknown status and no listed OSDs.

I ran systemctl status ceph-mon@ and ceph-mgr@ on each host, they show as running

After a lot of digging and trying to recreate the monmap and injecting it to all 3 hosts. still no luck to bring the Ceph the cluster up.
I tried everything I can find about recreating the Monitor store and DB, with no luck.

When I try any of the Ceph commands like ceph -s, I get nothing,

Here is a screenshot of ceph.conf

I'm stuck now and not sure what to do next.

Any help please.

MagicFab · Jun 14, 2025

Can you share the content of "/var/log/ceph/" ?

Can all three nodes ping and access each other by network ?

Fabián Rodríguez | Le Goût du Libre Inc. | Montreal, Canada | Mastodon
Proxmox Silver Partner, server and desktop enterprise support in French, English and Spanish

MigF117 · Jun 14, 2025

MagicFab said:
Can you share the content of "/var/log/ceph/" ?

Can all three nodes ping and access each other by network ?

Fabián Rodríguez | Le Goût du Libre Inc. | Montreal, Canada | Mastodon
Proxmox Silver Partner, server and desktop enterprise support in French, English and Spanish

Yes, all 3 hosts can ping and access each other on the public and ceph network.
I'll get the log tomorrow morning when I get back to the office.
But from memory, when I looked at the ceph log window in the GUI, I couldn't see errors, just heaps of sync entries to AVHOST02.

MigF117 · Jun 15, 2025

After a lot of more digging
I found that HOST01 is trying to start OSDs with the wrong FSIDs, not sure where these FSIDs are coming from.

[2025-06-15 14:15:43,718][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 0-6a05e0de-c1a9-4d95-95d5-22846b03604b
[2025-06-15 14:15:43,741][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 1-1b5cfe78-2297-4f83-a65d-10bc42fb1c26
[2025-06-15 14:15:43,841][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.0 with osd_fsid 6a05e0de-c1a9-4d95-95d5-22846b03604b
[2025-06-15 14:15:43,849][systemd][WARNING] command returned non-zero exit status: 1
[2025-06-15 14:15:43,849][systemd][WARNING] failed activating OSD, retries left: 1
[2025-06-15 14:15:43,877][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.1 with osd_fsid 1b5cfe78-2297-4f83-a65d-10bc42fb1c26
[2025-06-15 14:15:43,885][systemd][WARNING] command returned non-zero exit status: 1
[2025-06-15 14:15:43,885][systemd][WARNING] failed activating OSD, retries left: 1

These are the wrong FSIDs for OSD.0 and OSD.1
The coerrct ones are
[osd.0] fsid = d0dc1dc3-5f80-40b1-9664-abd5e2f7c2f4
[osd.2] fsid = d3a70dd6-5eda-4068-9543-0fd7f853ce9c

All the hosts still showing as Unknown in the GUI but all the services are running and looking at the HOST2 & 3 logs (ceph-volume-systemd.log) all the OSDs are mounted.
I think because I only have 3 hosts, and after the reboot the cluster couldn't create quorum with 2 hosts only to start with.

Any idea on how to fix this issue or how to get the data of the OSDs will be great help.

MagicFab · Jun 16, 2025

Hi again,

Thanks for the details.

I searched forums and there is a similar situation discussed here:

S

Thread 'Ceph recovery: Wiped out 3-node cluster with OSDs still intact'

Feb 8, 2025

This 3-node cluster also had a 4th node (r730), which didn't have any OSDs assigned.

This is what I have to recover:-

/etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring available from a previous node that was part of cluster.
/var/lib/pve-cluster/config.db file from r730 node

Now, I've 3 proxmox nodes reinstalled, brand new cluster. Now, I want to revive the ceph cluster with existing OSDs.

Overall goal: How can I recover the VM images only? That way, I can start them up as a new VM. For recovery, I'm open to adding the "r730" node again, if it simplifies...

This particular documentation brought my attention:
https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds

Because this is rather urgent and may lead to data loss, I'd suggest getting help from Proxmox support directly, I lack the time to reply quickly here but they may be able to best assist in speedy recovery / rebuild.

Search

Search

URGENT & CRITICAL: Ceph cluster stopped after restart

MigF117

New Member

MagicFab

Renowned Member

MigF117

New Member

MigF117

New Member

MagicFab

Renowned Member

Thread 'Ceph recovery: Wiped out 3-node cluster with OSDs still intact'

We value your privacy