Hello everyone
So, somehow, I was able to get the system to not crash. I'm still not entirely sure what I did differently, but I do have a suspicion.
For future reference if anyone wants to experience the rush of hundreds of lost TiB as well and fancies to run an rm -rf /var/lib, here's a summary on how to recover. Ultimately, I cheated a bit by using an old manager (let's call it "backup") to be able to leverage the proxmox tools to create new mons and mgrs:
- Most importantly: Shut Ceph down.
Code:
systemctl stop ceph.target
- Make absolutely sure that you need to rebuild and a restart does not solve the problem
- Then, make sure the osds are actually "fine". They do not need to quorate, but they must be accessible. On each host, run
Code:
ceph-volume lvm activate --all
. Chances are that they are already active, but it does not hurt.
- Then, select the node least likely to fail during recovery and that has enough free disk space. Create the following script in /root/rebuild.sh:
Code:
#!/bin/bash
set -xe
# Your path. Make sure it's large enough and empty (couple of GB for a big cluster, not the sum of OSD size)
ms=/root/mon-store
rm -r $ms || true
mkdir $ms || true
# Hosts that provide OSDs - if you don't specify a host here that has OSDs, they will become "Ghost OSDs" in rebuild and data may be lost
hosts=( "pve" "pve01" "pve02" "pve03" )
# collect the cluster map from stopped OSDs - basically, this daisy-chains the gathering. Make
# sure to start with clean folders, or the rebuild will fail when starting ceph-mon
# (update_from_paxos assertion error) (the rm -rf is no mistake here)
for host in "${hosts[@]}"; do
rsync -avz $ms/. root@$host:$ms.remote
rm -rf $ms
ssh root@$host <<EOF
set -x
for osd in /var/lib/ceph/osd/ceph-*; do
# We do need the || true here to not crash when ceph tries to recover the osd-{node}-Directory present on some hosts
ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote || true
done
EOF
rsync -avz --remove-source-files root@$host:$ms.remote/. $ms
done
# You probably need this one on proxmox
KEYRING="/etc/pve/priv/ceph.mon.keyring"
# rebuild the monitor store from the collected map, if the cluster does not
# use cephx authentication, we can skip the following steps to update the
# keyring with the caps, and there is no need to pass the "--keyring" option.
# i.e. just use "ceph-monstore-tool $ms rebuild" instead
ceph-authtool "$KEYRING" -n mon. \
--cap mon 'allow *'
ceph-authtool "$KEYRING" -n client.admin \
--cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
# add one or more ceph-mgr's key to the keyring. in this case, an encoded key
# for mgr.x is added, you can find the encoded key in
# /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
# deployed
ceph-authtool "$KEYRING" --add-key '<my_mgr_key>' -n mgr.pve \
--cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
# If your monitors' ids are not sorted by ip address, please specify them in order.
# For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is 10.0.0.4,
# please passing "--mon-ids b a c".
# In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please
# specify them in the command line by passing them as arguments of the "--mon-ids"
# option. if you are not sure, please check your ceph.conf to see if there is any
# sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
# using DNS SRV for looking up monitors.
# This will fail if the provided monitors are not in the ceph.conf or if there is a mismatch in length. SET YOUR OWN monitor IDs here
ceph-monstore-tool $ms rebuild -- --keyring "$KEYRING" --mon-ids pve01 pve02 pve03
# make a backup of the corrupted store.db just in case! repeat for
# all monitors.
# CAREFUL here: Running the script multiple times will overwrite the backup!
mv /var/lib/ceph/mon/ceph-pve/store.db /var/lib/ceph/mon/ceph-pve/store.db.corrupted
# move rebuild store.db into place. repeat for all monitors.
cp -r $ms/store.db /var/lib/ceph/mon/ceph-pve/store.db
chown -R ceph:ceph /var/lib/ceph/mon/ceph-pve/store.db
# Now, rsync the files to other hosts as well. Keep in mind that "pve" in "ceph-pve" is the
# hostname and this needs to be adjusted for every host. This is also a good moment pause
#and make sure that the backup exists. Personally, I prefer copying the backup to every host
# first and then applying it manually to be absolutely sure, but this can be automated
# Also, make sure that /root/$ms is empty on the target node
for host in "${mons[@]}"; do
rsync -avz $ms root@mon:/root/
done
- ADJUST THE VALUES! While the script itself should not damage anything so far on a default installation, there are some nasty rm -r in this script! Double and Triple-check first before running!
- Run the script (after checking step 3. and 4. again)
- Once copied into place, start the monitors
Code:
systemctl start ceph-mon@<host>
. If everything is good, you should have regained the ability to run
and watch the progress. If you get lots of pgs in "Unknown", you probably have an outdated mon store.
For me, these steps were mostly enough to get ceph to run again as if nothing ever happened (zero data loss). Since I had deleted the bootstrap-osd keyring as well, I had to recreate and re-import them as well like so
Code:
ceph auth import -i /var/lib/ceph/bootstrap-osd/ceph.keyring
, as well as some other minor fixes.
For my shared CephFS, the official ceph documentation provides the steps as well:
https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/
For me, proxmox was still showing name and data- and metadata pool names, so the commands were quite easy to derive:
Code:
# Make ceph aware of the filesystem, set "--recover" (i.e. do not recreate/overwrite metadata), but force-overwrite the configuration
ceph fs new <fs_name> <metadata_pool> <data_pool> --force --recover
# Allow MDS to take over
ceph fs set <fs_name> joinable true
And yes, many configuration variables were lost, so be prepared to reconfigure. However, all data was recovered from the FS as well.
As aaron hinted, I will probably have to keep fixing for quite some time - a complete reinstall could potentially be easier, but I do not have the nerve for it anymore.
So, in conclusion:
- Ceph is a really resilient beast that can sustain quite a lot of abuse
- /var/lib/ is not
After having dealt with that shitty thing, I have a question/suggestion towards proxmox: Would it be, if at all, sensible/possible to backup the ceph authx keys within the proxmox pmxcfs or using a similar system (maybe backup server), such that in a similar case (loss of /var/lib) there is no need to recreate the keys? To my understanding, proxmox/ceph keep each key on a single node only (as is practice with private keys), but could it be sensible to have them backed-up in an encrypted form?
Cheers