Recover ceph from OSDs only

cwriter

New Member
Aug 16, 2022
3
0
1
Hi

While backing up my proxmox instances, a misconfigured script nuked the entire /var/lib/ directory tree on all my proxmox nodes.

The recovery of basic system functionality was very painful (e.g. dpkg) and symlinks, but I've gotten proxmox itself into a workable state again. Luckily, the installation was completely up to date, so I could path the holes with data from a fresh and updated pve install. However, ceph is not looking good. I did have a few nodes in maintenance mode that were down when the script did its damage, so there is a somewhat recent copy of the cluster state available,
However, the cephx keys of the OSDs got deleted from all monitors, osds, mds and mgrs.
Because of this,
Code:
ceph -s
hangs indefinitely.
However, the data on the OSDs themselves seems fine. I've attempted to follow different guides, e.g. https://docs.ceph.com/en/quincy/rad...leshooting-mon/#mon-store-recovery-using-osds , but since the osd daemons themselves cannot run due to a missing data diretory in
Code:
/var/lib/ceph/osd/osd-{nodename}
and because the keys are missing, I have not yet gotten the monitor back up.

Is there a way to recover not only the monitors, but the entire ceph cluster and configuration from the OSDs; respectively to create a new Ceph cluster and import the data from the OSDs?

Thanks!
 
If the OSD data is still available, you could go ahead and recreate the mon DB as in the linked docs. To get the OSDs started, try to run
Code:
ceph-volume lvm activate --all
.
On recent bluestore OSDs, the filesystem mounted in /var/lib/ceph/... is a tmpfs that is stored on the OSD itself.

With a bit of luck, you can get the OSD services to start. Then follow the mon recovery guide.

I did something similar a while ago and it was tedious work. It was necessary to change the cluster fsid once the mon had been recovered from the OSD data to match the cluster fsid that the OSDs had configured (in the ceph.conf file as well, but that is easy).
Auth for the services also had to be recreated to get it all up and running enough to get access to the data.

Once you have recovered your data, I do recommend though that you wipe everything and set it up fresh. You never know if any of the issues you run into the future is because it is a legitimate problem or because you missed something when recovering the cluster :)
 
Hi aaron

Thank you very much for your quick reply!
Running ceph-volume returns
Code:
root@pve:~# ceph-volume lvm activate --all
--> OSD ID 7 FSID 8321bfa2-36ab-4349-b7df-872b6aae5d8d process is active. Skipping activation
--> OSD ID 4 FSID d9c68ecd-c5e5-4697-8454-392b3a5ff5c6 process is active. Skipping activation
--> OSD ID 9 FSID 4cda8130-162d-4c45-82b8-669909b9c41c process is active. Skipping activation
and the /var/lib/ceph/osd/osd-{id}/ directories are populated and fine.

However, getting the services up is hard.
systemctl list-units --failed returns
Code:
root@pve:~# systemctl list-units --failed
  UNIT                 LOAD   ACTIVE SUB    DESCRIPTION
● ceph-mds@pve.service loaded failed failed Ceph metadata server daemon
● ceph-mgr@pve.service loaded failed failed Ceph cluster manager daemon
● ceph-mon@pve.service loaded failed failed Ceph cluster monitor daemon
● ceph-osd@pve.service loaded failed failed Ceph object storage daemon osd.pve
The mon crashes straightup:
Code:
Aug 16 09:38:01 pve systemd[1]: ceph-mon@pve.service: Failed with result 'signal'.
Aug 16 09:38:11 pve systemd[1]: ceph-mon@pve.service: Scheduled restart job, restart counter is at 4.
Aug 16 09:38:11 pve systemd[1]: Stopped Ceph cluster monitor daemon.
Aug 16 09:38:11 pve systemd[1]: Started Ceph cluster monitor daemon.
Aug 16 09:38:11 pve ceph-mon[2436113]: ./src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7fcb85a94700 time 2022-08-
16T09:38:11.852524+0200
Aug 16 09:38:11 pve ceph-mon[2436113]: ./src/mon/AuthMonitor.cc: 316: FAILED ceph_assert(ret == 0)
Aug 16 09:38:11 pve ceph-mon[2436113]:  ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Aug 16 09:38:11 pve ceph-mon[2436113]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7fcb86969fde]
Aug 16 09:38:11 pve ceph-mon[2436113]:  2: /usr/lib/ceph/libceph-common.so.2(+0x251169) [0x7fcb8696a169]
Aug 16 09:38:11 pve ceph-mon[2436113]:  3: (AuthMonitor::update_from_paxos(bool*)+0x18fc) [0x55a6672f77fc]
Aug 16 09:38:11 pve ceph-mon[2436113]:  4: (Monitor::refresh_from_paxos(bool*)+0x163) [0x55a667266703]
Aug 16 09:38:11 pve ceph-mon[2436113]:  5: (Monitor::preinit()+0x9af) [0x55a667292b0f]
Aug 16 09:38:11 pve ceph-mon[2436113]:  6: main()
Aug 16 09:38:11 pve ceph-mon[2436113]:  7: __libc_start_main()
Aug 16 09:38:11 pve ceph-mon[2436113]:  8: _start()
Aug 16 09:38:11 pve ceph-mon[2436113]: 2022-08-16T09:38:11.850+0200 7fcb85a94700 -1 ./src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_pa
xos(bool*)' thread 7fcb85a94700 time 2022-08-16T09:38:11.852524+0200
Aug 16 09:38:11 pve ceph-mon[2436113]: ./src/mon/AuthMonitor.cc: 316: FAILED ceph_assert(ret == 0)
Aug 16 09:38:11 pve ceph-mon[2436113]:  ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Aug 16 09:38:11 pve ceph-mon[2436113]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7fcb86969fde]
Aug 16 09:38:11 pve ceph-mon[2436113]:  2: /usr/lib/ceph/libceph-common.so.2(+0x251169) [0x7fcb8696a169]
Aug 16 09:38:11 pve ceph-mon[2436113]:  3: (AuthMonitor::update_from_paxos(bool*)+0x18fc) [0x55a6672f77fc]
Aug 16 09:38:11 pve ceph-mon[2436113]:  4: (Monitor::refresh_from_paxos(bool*)+0x163) [0x55a667266703]
Aug 16 09:38:11 pve ceph-mon[2436113]:  5: (Monitor::preinit()+0x9af) [0x55a667292b0f]
Aug 16 09:38:11 pve ceph-mon[2436113]:  6: main()
Aug 16 09:38:11 pve ceph-mon[2436113]:  7: __libc_start_main()
Aug 16 09:38:11 pve ceph-mon[2436113]:  8: _start()
Aug 16 09:38:11 pve ceph-mon[2436113]: *** Caught signal (Aborted) **
Aug 16 09:38:11 pve ceph-mon[2436113]:  in thread 7fcb85a94700 thread_name:ceph-mon
Aug 16 09:38:11 pve ceph-mon[2436113]:  ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Aug 16 09:38:11 pve ceph-mon[2436113]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fcb86451140]
Aug 16 09:38:11 pve ceph-mon[2436113]:  2: gsignal()
Aug 16 09:38:11 pve ceph-mon[2436113]:  3: abort()
Aug 16 09:38:11 pve ceph-mon[2436113]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x7fcb8696a028]
Aug 16 09:38:11 pve ceph-mon[2436113]:  5: /usr/lib/ceph/libceph-common.so.2(+0x251169) [0x7fcb8696a169]
Aug 16 09:38:11 pve ceph-mon[2436113]:  6: (AuthMonitor::update_from_paxos(bool*)+0x18fc) [0x55a6672f77fc]
Aug 16 09:38:11 pve ceph-mon[2436113]:  7: (Monitor::refresh_from_paxos(bool*)+0x163) [0x55a667266703]
Aug 16 09:38:11 pve ceph-mon[2436113]:  8: (Monitor::preinit()+0x9af) [0x55a667292b0f]
Aug 16 09:38:11 pve ceph-mon[2436113]:  9: main()
Aug 16 09:38:11 pve ceph-mon[2436113]:  10: __libc_start_main()
Aug 16 09:38:11 pve ceph-mon[2436113]:  11: _start()
Aug 16 09:38:11 pve ceph-mon[2436113]: 2022-08-16T09:38:11.850+0200 7fcb85a94700 -1 *** Caught signal (Aborted) **
Aug 16 09:38:11 pve ceph-mon[2436113]:  in thread 7fcb85a94700 thread_name:ceph-mon
Aug 16 09:38:11 pve ceph-mon[2436113]:  ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Aug 16 09:38:11 pve ceph-mon[2436113]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fcb86451140]
Aug 16 09:38:11 pve ceph-mon[2436113]:  2: gsignal()
Aug 16 09:38:11 pve ceph-mon[2436113]:  3: abort()

Of course, this seems to be a keyring / paxos error - probably to debug later.

On the other hand, the proxmox osd wrapper seems to fail as well:
Code:
journalctl -u ceph-osd@pve

Aug 16 10:03:45 pve systemd[1]: Starting Ceph object storage daemon osd.pve...
Aug 16 10:03:45 pve ceph-osd-prestart.sh[2451367]: OSD data directory /var/lib/ceph/osd/ceph-pve does not exist; bailing out.
Aug 16 10:03:45 pve systemd[1]: ceph-osd@pve.service: Control process exited, code=exited, status=1/FAILURE
Aug 16 10:03:45 pve systemd[1]: ceph-osd@pve.service: Failed with result 'exit-code'.
Aug 16 10:03:45 pve systemd[1]: Failed to start Ceph object storage daemon osd.pve.
Of course, I created the directory as well and chowned it to ceph:ceph, but the service does not recover and still claims that the directory does not exist. What should be the contents of this directory normally?


For reference, this is the script used to rebuild:
Code:
#!/bin/bash

set -xe

ms=/root/mon-store
mkdir $ms || true
hosts=( "pve01" "pve02" "pve" )

# collect the cluster map from stopped OSDs
for host in "${hosts[@]}"; do
  rsync -avz $ms/. root@$host:$ms.remote
  rm -rf $ms
  ssh root@$host <<EOF
    set -x
    for osd in /var/lib/ceph/osd/ceph-*; do
      ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
    done
EOF
  rsync -avz root@$host:$ms.remote/. $ms
done

# rebuild the monitor store from the collected map, if the cluster does not
# use cephx authentication, we can skip the following steps to update the
# keyring with the caps, and there is no need to pass the "--keyring" option.
# i.e. just use "ceph-monstore-tool $ms rebuild" instead
ceph-authtool /etc/pve/priv/ceph.mon.keyring -n mon. \
  --cap mon 'allow *'
ceph-authtool /etc/pve/priv/ceph.mon.keyring -n client.admin \
  --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
# add one or more ceph-mgr's key to the keyring. in this case, an encoded key
# for mgr.x is added, you can find the encoded key in
# /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
# deployed
ceph-authtool /etc/pve/priv/ceph.mon.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.pve \
  --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
# If your monitors' ids are not sorted by ip address, please specify them in order.
# For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is  10.0.0.4,
# please passing "--mon-ids b a c".
# In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please
# specify them in the command line by passing them as arguments of the "--mon-ids"
# option. if you are not sure, please check your ceph.conf to see if there is any
# sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
# using DNS SRV for looking up monitors.
ceph-monstore-tool $ms rebuild -- --keyring /etc/pve/priv/ceph.mon.keyring --mon-ids pve pve01 pve02 pve03


# make a backup of the corrupted store.db just in case!  repeat for
# all monitors.
mv /var/lib/ceph/mon/pve-ceph/store.db{,.corrupted}

# move rebuild store.db into place.  repeat for all monitors.
mv $ms/store.db /var/lib/ceph/mon/pve-ceph/store.db
chown -R ceph:ceph /var/lib/ceph/mon/pve-ceph/store.db

The mgr key is probably not correct, however, I did not yet find a way to recover the mgr key. is this required to do before the mon rebuild?

Thanks again for the quick response!


/EDIT:

I was able to find a monitor that had been offline, but by using the monmaptool, I was able to remove the other monitors and use this old monitor to at least unlock ceph status.

However, I'm still unable to copy the repaired store, as it errors out with the update_from_paxos assert.
While running though, ceph continously marks the pgs as "Unknown" and throws OSDs out of the system ("down") and I'm unable to re-add them, even though the ceph-osd@<id>.service units are running.

Do you know why there is this paxos assert acting up?
 
Last edited:
Hello everyone

So, somehow, I was able to get the system to not crash. I'm still not entirely sure what I did differently, but I do have a suspicion.

For future reference if anyone wants to experience the rush of hundreds of lost TiB as well and fancies to run an rm -rf /var/lib, here's a summary on how to recover. Ultimately, I cheated a bit by using an old manager (let's call it "backup") to be able to leverage the proxmox tools to create new mons and mgrs:

  1. Most importantly: Shut Ceph down.
    Code:
    systemctl stop ceph.target
  2. Make absolutely sure that you need to rebuild and a restart does not solve the problem
  3. Then, make sure the osds are actually "fine". They do not need to quorate, but they must be accessible. On each host, run
    Code:
    ceph-volume lvm activate --all
    . Chances are that they are already active, but it does not hurt.
  4. Then, select the node least likely to fail during recovery and that has enough free disk space. Create the following script in /root/rebuild.sh:
    Code:
    #!/bin/bash
    
    set -xe
    # Your path. Make sure it's large enough and empty (couple of GB for a big cluster, not the sum of OSD size)
    ms=/root/mon-store
    rm -r $ms || true
    mkdir $ms || true
    # Hosts that provide OSDs - if you don't specify a host here that has OSDs, they will become "Ghost OSDs" in rebuild and data may be lost
    hosts=( "pve" "pve01" "pve02" "pve03" )
    
    # collect the cluster map from stopped OSDs - basically, this daisy-chains the gathering. Make
    # sure to start with clean folders, or the rebuild will fail when starting ceph-mon
    # (update_from_paxos assertion error) (the rm -rf is no mistake here)
    for host in "${hosts[@]}"; do
      rsync -avz $ms/. root@$host:$ms.remote
      rm -rf $ms
      ssh root@$host <<EOF
        set -x
        for osd in /var/lib/ceph/osd/ceph-*; do
          # We do need the || true here to not crash when ceph tries to recover the osd-{node}-Directory present on some hosts
          ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote || true
        done
    EOF
      rsync -avz --remove-source-files root@$host:$ms.remote/. $ms
    done
    
    # You probably need this one on proxmox
    KEYRING="/etc/pve/priv/ceph.mon.keyring"
    
    # rebuild the monitor store from the collected map, if the cluster does not
    # use cephx authentication, we can skip the following steps to update the
    # keyring with the caps, and there is no need to pass the "--keyring" option.
    # i.e. just use "ceph-monstore-tool $ms rebuild" instead
    ceph-authtool "$KEYRING" -n mon. \
      --cap mon 'allow *'
    ceph-authtool "$KEYRING" -n client.admin \
      --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
    # add one or more ceph-mgr's key to the keyring. in this case, an encoded key
    # for mgr.x is added, you can find the encoded key in
    # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
    # deployed
    ceph-authtool "$KEYRING" --add-key '<my_mgr_key>' -n mgr.pve \
      --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
    # If your monitors' ids are not sorted by ip address, please specify them in order.
    # For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is  10.0.0.4,
    # please passing "--mon-ids b a c".
    # In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please
    # specify them in the command line by passing them as arguments of the "--mon-ids"
    # option. if you are not sure, please check your ceph.conf to see if there is any
    # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
    
    # using DNS SRV for looking up monitors.
    # This will fail if the provided monitors are not in the ceph.conf or if there is a mismatch in length. SET YOUR OWN monitor IDs here
    ceph-monstore-tool $ms rebuild -- --keyring "$KEYRING" --mon-ids pve01 pve02 pve03
    
    
    # make a backup of the corrupted store.db just in case!  repeat for
    # all monitors.
    # CAREFUL here: Running the script multiple times will overwrite the backup!
    mv /var/lib/ceph/mon/ceph-pve/store.db /var/lib/ceph/mon/ceph-pve/store.db.corrupted
    
    # move rebuild store.db into place.  repeat for all monitors.
    cp -r $ms/store.db /var/lib/ceph/mon/ceph-pve/store.db
    chown -R ceph:ceph /var/lib/ceph/mon/ceph-pve/store.db
    # Now, rsync the files to other hosts as well. Keep in mind that "pve" in "ceph-pve" is the
    # hostname and this needs to be adjusted for every host. This is also a good moment pause
    #and make sure that the backup exists. Personally, I prefer copying the backup to every host 
    # first and then applying it manually to be absolutely sure, but this can be automated
    # Also, make sure that /root/$ms is empty on the target node
    for host in "${mons[@]}"; do
        rsync -avz $ms root@mon:/root/
    done
  5. ADJUST THE VALUES! While the script itself should not damage anything so far on a default installation, there are some nasty rm -r in this script! Double and Triple-check first before running!
  6. Run the script (after checking step 3. and 4. again)
  7. Once copied into place, start the monitors
    Code:
    systemctl start ceph-mon@<host>
    . If everything is good, you should have regained the ability to run
    Code:
    ceph -s
    and watch the progress. If you get lots of pgs in "Unknown", you probably have an outdated mon store.
For me, these steps were mostly enough to get ceph to run again as if nothing ever happened (zero data loss). Since I had deleted the bootstrap-osd keyring as well, I had to recreate and re-import them as well like so
Code:
ceph auth import -i /var/lib/ceph/bootstrap-osd/ceph.keyring
, as well as some other minor fixes.

For my shared CephFS, the official ceph documentation provides the steps as well:
https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/

For me, proxmox was still showing name and data- and metadata pool names, so the commands were quite easy to derive:
Code:
# Make ceph aware of the filesystem, set "--recover" (i.e. do not recreate/overwrite metadata), but force-overwrite the configuration
ceph fs new <fs_name> <metadata_pool> <data_pool> --force --recover
# Allow MDS to take over
ceph fs set <fs_name> joinable true
And yes, many configuration variables were lost, so be prepared to reconfigure. However, all data was recovered from the FS as well.


As aaron hinted, I will probably have to keep fixing for quite some time - a complete reinstall could potentially be easier, but I do not have the nerve for it anymore.

So, in conclusion:
- Ceph is a really resilient beast that can sustain quite a lot of abuse
- /var/lib/ is not



After having dealt with that shitty thing, I have a question/suggestion towards proxmox: Would it be, if at all, sensible/possible to backup the ceph authx keys within the proxmox pmxcfs or using a similar system (maybe backup server), such that in a similar case (loss of /var/lib) there is no need to recreate the keys? To my understanding, proxmox/ceph keep each key on a single node only (as is practice with private keys), but could it be sensible to have them backed-up in an encrypted form?

Cheers
 
Last edited:
Great to hear that you got it back up :)

If you want to regularly export the auth data to have it on the side, have a look at the ceph auth export command.

The auth keys to access the storage are located in /etc/pve/priv/ceph, therefore available on all nodes.
The data for MONs and MDSs are stored locally in their /var/lib/ceph/... directory. Though it should be not much of an issue to just create new ones as they don't store any data by themselves.

The OSDs store their keyrings on their own block devices, so as long as the OSD is not destroyed, they should be there as long as ceph-volume can access them and mount the tmpfs to /var/lib/ceph/osd/...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!