Ceph recovery: Wiped out 3-node cluster with OSDs still intact

shadyabhi · Feb 8, 2025

This 3-node cluster also had a 4th node (r730), which didn't have any OSDs assigned.

This is what I have to recover:-

/etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring available from a previous node that was part of cluster.
/var/lib/pve-cluster/config.db file from r730 node

Now, I've 3 proxmox nodes reinstalled, brand new cluster. Now, I want to revive the ceph cluster with existing OSDs.

Overall goal: How can I recover the VM images only? That way, I can start them up as a new VM. For recovery, I'm open to adding the "r730" node again, if it simplifies things for us.

In order to validate that the OSDs do exist, I've verified that OSD with following commands.

So far, I've only tried the following commands on one node only, and it gives me a hint that there is a possibility of recovering.

Code:

root@hp800g9-1:~# sudo ceph-volume lvm activate --all
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph-authtool --gen-print-key
--> Activating OSD ID 0 FSID 8df70b91-28bf-4a7c-96c4-51f1e63d2e03
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-a7873caa-1ef2-4b84-acfb-53448242a9c8/osd-block-8df70b91-28bf-4a7c-96c4-51f1e63d2e03 --path /var/lib/ceph/osd/ceph-0 --no-mon-config
Running command: /usr/bin/ln -snf /dev/ceph-a7873caa-1ef2-4b84-acfb-53448242a9c8/osd-block-8df70b91-28bf-4a7c-96c4-51f1e63d2e03 /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-0
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/systemctl enable ceph-volume@lvm-0-8df70b91-28bf-4a7c-96c4-51f1e63d2e03
Running command: /usr/bin/systemctl enable --runtime ceph-osd@0
Running command: /usr/bin/systemctl start ceph-osd@0
--> ceph-volume lvm activate successful for osd ID: 0
root@hp800g9-1:~#

root@hp800g9-1:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op update-mon-db --mon-store-path /mnt/osd-0/ --no-mon-config
osd.0   : 5593 osdmaps trimmed, 0 osdmaps added.
root@hp800g9-1:~# ls /mnt/osd-0/
kv_backend  store.db
root@hp800g9-1:~#

root@hp800g9-1:~# ceph-volume lvm list


====== osd.0 =======

  [block]       /dev/ceph-a7873caa-1ef2-4b84-acfb-53448242a9c8/osd-block-8df70b91-28bf-4a7c-96c4-51f1e63d2e03

      block device              /dev/ceph-a7873caa-1ef2-4b84-acfb-53448242a9c8/osd-block-8df70b91-28bf-4a7c-96c4-51f1e63d2e03
      block uuid                s7LJFW-5jYi-TFEj-w9hS-5ep5-jOLy-ZibL8t
      cephx lockbox secret
      cluster fsid              c3c25528-cbda-4f9b-a805-583d16b93e8f
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  8df70b91-28bf-4a7c-96c4-51f1e63d2e03
      osd id                    0
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/nvme1n1
root@hp800g9-1:~#

How do we proceed?

This is the cluster state right now. I've only installed ceph packages so far, nothing else.

Ceph Status:-

Code:

root@hp800g9-1:~# ceph -s
  cluster:
    id:     9c9daac0-736e-4dc1-8380-e6a3fa7d2c23
    health: HEALTH_WARN
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 1 daemons, quorum hp800g9-1 (age 17h)
    mgr: hp800g9-1(active, since 17h)
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

root@hp800g9-1:~#

Nodes:-

Code:

root@hp800g9-1:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 hp800g9-1 (local)
         2          1 intelnuc10
         3          1 beelinku59pro
root@hp800g9-1:~#

gurubert · Feb 10, 2025

As you have created a new cluster (id 9c9daac0-736e-4dc1-8380-e6a3fa7d2c23) and the OSDs on disk belong to the old cluster (id c3c25528-cbda-4f9b-a805-583d16b93e8f) you cannot just add them.

Have you read through the procedure to recover the MON db from the copies stored on the OSDs?

https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds

shadyabhi · Feb 12, 2025

Hey @gurubert, thank you for the link. There is also a missing understanding for me in terms of how Ceph is integrated with Proxmox. Can you help list out the overall steps needed for me to recover?

Few questions:-

1. After following section "Recovering from OSD" without monitor data from Ceph wiki, I'll end up having files store.db for all OSDs, that's a command I've already executed in my first post, so I can perform that for all 3 nodes.

2. Once that is done, what else is needed to completely own these OSDs in the new Ceph cluster? Can you please provide a high-level overview?

I've ensured that node names are same as last time, in order to avoid more problems in recovering the cluster. Let me know if there are some other things to lookout during recovery. My goal is to just gain access to VM images.

Thanks

gurubert · Feb 13, 2025

You need to stop the MON and replace its store.db. Make sure that the MON runs on the same IP as in the old cluster. It's all in the Ceph documentation.

Search

Search

Ceph recovery: Wiped out 3-node cluster with OSDs still intact

shadyabhi

Member

Attachments

gurubert

Distinguished Member

shadyabhi

Member

gurubert

Distinguished Member

We value your privacy