Ceph recovery: Wiped out 3-node cluster with OSDs still intact

shadyabhi

Member
Feb 14, 2023
12
0
6
This 3-node cluster also had a 4th node (r730), which didn't have any OSDs assigned.

This is what I have to recover:-
  • /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring available from a previous node that was part of cluster.
  • /var/lib/pve-cluster/config.db file from r730 node


Now, I've 3 proxmox nodes reinstalled, brand new cluster. Now, I want to revive the ceph cluster with existing OSDs.

Overall goal: How can I recover the VM images only? That way, I can start them up as a new VM. For recovery, I'm open to adding the "r730" node again, if it simplifies things for us.

In order to validate that the OSDs do exist, I've verified that OSD with following commands.

So far, I've only tried the following commands on one node only, and it gives me a hint that there is a possibility of recovering.

Code:
root@hp800g9-1:~# sudo ceph-volume lvm activate --all
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph-authtool --gen-print-key
--> Activating OSD ID 0 FSID 8df70b91-28bf-4a7c-96c4-51f1e63d2e03
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-a7873caa-1ef2-4b84-acfb-53448242a9c8/osd-block-8df70b91-28bf-4a7c-96c4-51f1e63d2e03 --path /var/lib/ceph/osd/ceph-0 --no-mon-config
Running command: /usr/bin/ln -snf /dev/ceph-a7873caa-1ef2-4b84-acfb-53448242a9c8/osd-block-8df70b91-28bf-4a7c-96c4-51f1e63d2e03 /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-0
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: /usr/bin/systemctl enable ceph-volume@lvm-0-8df70b91-28bf-4a7c-96c4-51f1e63d2e03
Running command: /usr/bin/systemctl enable --runtime ceph-osd@0
Running command: /usr/bin/systemctl start ceph-osd@0
--> ceph-volume lvm activate successful for osd ID: 0
root@hp800g9-1:~#

root@hp800g9-1:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op update-mon-db --mon-store-path /mnt/osd-0/ --no-mon-config
osd.0   : 5593 osdmaps trimmed, 0 osdmaps added.
root@hp800g9-1:~# ls /mnt/osd-0/
kv_backend  store.db
root@hp800g9-1:~#

root@hp800g9-1:~# ceph-volume lvm list


====== osd.0 =======

  [block]       /dev/ceph-a7873caa-1ef2-4b84-acfb-53448242a9c8/osd-block-8df70b91-28bf-4a7c-96c4-51f1e63d2e03

      block device              /dev/ceph-a7873caa-1ef2-4b84-acfb-53448242a9c8/osd-block-8df70b91-28bf-4a7c-96c4-51f1e63d2e03
      block uuid                s7LJFW-5jYi-TFEj-w9hS-5ep5-jOLy-ZibL8t
      cephx lockbox secret
      cluster fsid              c3c25528-cbda-4f9b-a805-583d16b93e8f
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  8df70b91-28bf-4a7c-96c4-51f1e63d2e03
      osd id                    0
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/nvme1n1
root@hp800g9-1:~#

How do we proceed?

This is the cluster state right now. I've only installed ceph packages so far, nothing else.

Ceph Status:-
Code:
root@hp800g9-1:~# ceph -s
  cluster:
    id:     9c9daac0-736e-4dc1-8380-e6a3fa7d2c23
    health: HEALTH_WARN
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 1 daemons, quorum hp800g9-1 (age 17h)
    mgr: hp800g9-1(active, since 17h)
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

root@hp800g9-1:~#

Nodes:-
Code:
root@hp800g9-1:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 hp800g9-1 (local)
         2          1 intelnuc10
         3          1 beelinku59pro
root@hp800g9-1:~#

1739051670575.png
 

Attachments

Last edited:
Hey @gurubert, thank you for the link. There is also a missing understanding for me in terms of how Ceph is integrated with Proxmox. Can you help list out the overall steps needed for me to recover?

Few questions:-

1. After following section "Recovering from OSD" without monitor data from Ceph wiki, I'll end up having files store.db for all OSDs, that's a command I've already executed in my first post, so I can perform that for all 3 nodes.

2. Once that is done, what else is needed to completely own these OSDs in the new Ceph cluster? Can you please provide a high-level overview?

I've ensured that node names are same as last time, in order to avoid more problems in recovering the cluster. Let me know if there are some other things to lookout during recovery. My goal is to just gain access to VM images.

Thanks
 
Last edited: