Hi aaron
Thank you very much for your quick reply!
Running ceph-volume returns
Code:
root@pve:~# ceph-volume lvm activate --all
--> OSD ID 7 FSID 8321bfa2-36ab-4349-b7df-872b6aae5d8d process is active. Skipping activation
--> OSD ID 4 FSID d9c68ecd-c5e5-4697-8454-392b3a5ff5c6 process is active. Skipping activation
--> OSD ID 9 FSID 4cda8130-162d-4c45-82b8-669909b9c41c process is active. Skipping activation
and the /var/lib/ceph/osd/osd-{id}/ directories are populated and fine.
However, getting the services up is hard.
systemctl list-units --failed returns
Code:
root@pve:~# systemctl list-units --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● ceph-mds@pve.service loaded failed failed Ceph metadata server daemon
● ceph-mgr@pve.service loaded failed failed Ceph cluster manager daemon
● ceph-mon@pve.service loaded failed failed Ceph cluster monitor daemon
● ceph-osd@pve.service loaded failed failed Ceph object storage daemon osd.pve
The mon crashes straightup:
Code:
Aug 16 09:38:01 pve systemd[1]: ceph-mon@pve.service: Failed with result 'signal'.
Aug 16 09:38:11 pve systemd[1]: ceph-mon@pve.service: Scheduled restart job, restart counter is at 4.
Aug 16 09:38:11 pve systemd[1]: Stopped Ceph cluster monitor daemon.
Aug 16 09:38:11 pve systemd[1]: Started Ceph cluster monitor daemon.
Aug 16 09:38:11 pve ceph-mon[2436113]: ./src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7fcb85a94700 time 2022-08-
16T09:38:11.852524+0200
Aug 16 09:38:11 pve ceph-mon[2436113]: ./src/mon/AuthMonitor.cc: 316: FAILED ceph_assert(ret == 0)
Aug 16 09:38:11 pve ceph-mon[2436113]: ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Aug 16 09:38:11 pve ceph-mon[2436113]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7fcb86969fde]
Aug 16 09:38:11 pve ceph-mon[2436113]: 2: /usr/lib/ceph/libceph-common.so.2(+0x251169) [0x7fcb8696a169]
Aug 16 09:38:11 pve ceph-mon[2436113]: 3: (AuthMonitor::update_from_paxos(bool*)+0x18fc) [0x55a6672f77fc]
Aug 16 09:38:11 pve ceph-mon[2436113]: 4: (Monitor::refresh_from_paxos(bool*)+0x163) [0x55a667266703]
Aug 16 09:38:11 pve ceph-mon[2436113]: 5: (Monitor::preinit()+0x9af) [0x55a667292b0f]
Aug 16 09:38:11 pve ceph-mon[2436113]: 6: main()
Aug 16 09:38:11 pve ceph-mon[2436113]: 7: __libc_start_main()
Aug 16 09:38:11 pve ceph-mon[2436113]: 8: _start()
Aug 16 09:38:11 pve ceph-mon[2436113]: 2022-08-16T09:38:11.850+0200 7fcb85a94700 -1 ./src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_pa
xos(bool*)' thread 7fcb85a94700 time 2022-08-16T09:38:11.852524+0200
Aug 16 09:38:11 pve ceph-mon[2436113]: ./src/mon/AuthMonitor.cc: 316: FAILED ceph_assert(ret == 0)
Aug 16 09:38:11 pve ceph-mon[2436113]: ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Aug 16 09:38:11 pve ceph-mon[2436113]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7fcb86969fde]
Aug 16 09:38:11 pve ceph-mon[2436113]: 2: /usr/lib/ceph/libceph-common.so.2(+0x251169) [0x7fcb8696a169]
Aug 16 09:38:11 pve ceph-mon[2436113]: 3: (AuthMonitor::update_from_paxos(bool*)+0x18fc) [0x55a6672f77fc]
Aug 16 09:38:11 pve ceph-mon[2436113]: 4: (Monitor::refresh_from_paxos(bool*)+0x163) [0x55a667266703]
Aug 16 09:38:11 pve ceph-mon[2436113]: 5: (Monitor::preinit()+0x9af) [0x55a667292b0f]
Aug 16 09:38:11 pve ceph-mon[2436113]: 6: main()
Aug 16 09:38:11 pve ceph-mon[2436113]: 7: __libc_start_main()
Aug 16 09:38:11 pve ceph-mon[2436113]: 8: _start()
Aug 16 09:38:11 pve ceph-mon[2436113]: *** Caught signal (Aborted) **
Aug 16 09:38:11 pve ceph-mon[2436113]: in thread 7fcb85a94700 thread_name:ceph-mon
Aug 16 09:38:11 pve ceph-mon[2436113]: ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Aug 16 09:38:11 pve ceph-mon[2436113]: 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fcb86451140]
Aug 16 09:38:11 pve ceph-mon[2436113]: 2: gsignal()
Aug 16 09:38:11 pve ceph-mon[2436113]: 3: abort()
Aug 16 09:38:11 pve ceph-mon[2436113]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x7fcb8696a028]
Aug 16 09:38:11 pve ceph-mon[2436113]: 5: /usr/lib/ceph/libceph-common.so.2(+0x251169) [0x7fcb8696a169]
Aug 16 09:38:11 pve ceph-mon[2436113]: 6: (AuthMonitor::update_from_paxos(bool*)+0x18fc) [0x55a6672f77fc]
Aug 16 09:38:11 pve ceph-mon[2436113]: 7: (Monitor::refresh_from_paxos(bool*)+0x163) [0x55a667266703]
Aug 16 09:38:11 pve ceph-mon[2436113]: 8: (Monitor::preinit()+0x9af) [0x55a667292b0f]
Aug 16 09:38:11 pve ceph-mon[2436113]: 9: main()
Aug 16 09:38:11 pve ceph-mon[2436113]: 10: __libc_start_main()
Aug 16 09:38:11 pve ceph-mon[2436113]: 11: _start()
Aug 16 09:38:11 pve ceph-mon[2436113]: 2022-08-16T09:38:11.850+0200 7fcb85a94700 -1 *** Caught signal (Aborted) **
Aug 16 09:38:11 pve ceph-mon[2436113]: in thread 7fcb85a94700 thread_name:ceph-mon
Aug 16 09:38:11 pve ceph-mon[2436113]: ceph version 16.2.9 (a569859f5e07da0c4c39da81d5fb5675cd95da49) pacific (stable)
Aug 16 09:38:11 pve ceph-mon[2436113]: 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fcb86451140]
Aug 16 09:38:11 pve ceph-mon[2436113]: 2: gsignal()
Aug 16 09:38:11 pve ceph-mon[2436113]: 3: abort()
Of course, this seems to be a keyring / paxos error - probably to debug later.
On the other hand, the proxmox osd wrapper seems to fail as well:
Code:
journalctl -u ceph-osd@pve
Aug 16 10:03:45 pve systemd[1]: Starting Ceph object storage daemon osd.pve...
Aug 16 10:03:45 pve ceph-osd-prestart.sh[2451367]: OSD data directory /var/lib/ceph/osd/ceph-pve does not exist; bailing out.
Aug 16 10:03:45 pve systemd[1]: ceph-osd@pve.service: Control process exited, code=exited, status=1/FAILURE
Aug 16 10:03:45 pve systemd[1]: ceph-osd@pve.service: Failed with result 'exit-code'.
Aug 16 10:03:45 pve systemd[1]: Failed to start Ceph object storage daemon osd.pve.
Of course, I created the directory as well and chowned it to ceph:ceph, but the service does not recover and still claims that the directory does not exist. What should be the contents of this directory normally?
For reference, this is the script used to rebuild:
Code:
#!/bin/bash
set -xe
ms=/root/mon-store
mkdir $ms || true
hosts=( "pve01" "pve02" "pve" )
# collect the cluster map from stopped OSDs
for host in "${hosts[@]}"; do
rsync -avz $ms/. root@$host:$ms.remote
rm -rf $ms
ssh root@$host <<EOF
set -x
for osd in /var/lib/ceph/osd/ceph-*; do
ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
done
EOF
rsync -avz root@$host:$ms.remote/. $ms
done
# rebuild the monitor store from the collected map, if the cluster does not
# use cephx authentication, we can skip the following steps to update the
# keyring with the caps, and there is no need to pass the "--keyring" option.
# i.e. just use "ceph-monstore-tool $ms rebuild" instead
ceph-authtool /etc/pve/priv/ceph.mon.keyring -n mon. \
--cap mon 'allow *'
ceph-authtool /etc/pve/priv/ceph.mon.keyring -n client.admin \
--cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
# add one or more ceph-mgr's key to the keyring. in this case, an encoded key
# for mgr.x is added, you can find the encoded key in
# /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
# deployed
ceph-authtool /etc/pve/priv/ceph.mon.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.pve \
--cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
# If your monitors' ids are not sorted by ip address, please specify them in order.
# For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is 10.0.0.4,
# please passing "--mon-ids b a c".
# In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please
# specify them in the command line by passing them as arguments of the "--mon-ids"
# option. if you are not sure, please check your ceph.conf to see if there is any
# sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
# using DNS SRV for looking up monitors.
ceph-monstore-tool $ms rebuild -- --keyring /etc/pve/priv/ceph.mon.keyring --mon-ids pve pve01 pve02 pve03
# make a backup of the corrupted store.db just in case! repeat for
# all monitors.
mv /var/lib/ceph/mon/pve-ceph/store.db{,.corrupted}
# move rebuild store.db into place. repeat for all monitors.
mv $ms/store.db /var/lib/ceph/mon/pve-ceph/store.db
chown -R ceph:ceph /var/lib/ceph/mon/pve-ceph/store.db
The mgr key is probably not correct, however, I did not yet find a way to recover the mgr key. is this required to do before the mon rebuild?
Thanks again for the quick response!
/EDIT:
I was able to find a monitor that had been offline, but by using the monmaptool, I was able to remove the other monitors and use this old monitor to at least unlock ceph status.
However, I'm still unable to copy the repaired store, as it errors out with the update_from_paxos assert.
While running though, ceph continously marks the pgs as "Unknown" and throws OSDs out of the system ("down") and I'm unable to re-add them, even though the ceph-osd@<id>.service units are running.
Do you know why there is this paxos assert acting up?