Why is my Ceph OSD dead (and won't start again)?????????????

proxwolfe

Well-Known Member
Jun 20, 2020
530
62
48
50
Hi,

I have a three node PVE cluster with Ceph installed and two pools between them, each pool with one disk (OSD) on each node.

For some reason that I have'nt found yet, yesterday, one PVE (and Ceph) node crashed, rendering both pools degraded. After restarting the node, it came back online but while Ceph rebalanced, a Ceph OSD on another node went down. This one is now "dead":

Code:
systemctl status ceph-osd@4.service
● ceph-osd@4.service - Ceph object storage daemon osd.4
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: inactive (dead)

And it won't start again (error code -1). journalctl -xe tells me nothing.

So what could this be?

Thanks!
 
Hi,
whats the output of journalctl -u ceph-osd@4.service? Any other errors to be found in the journal?
 
journalctl -u ceph-osd@4.service
-- Journal begins at Thu 2022-11-24 13:49:04 CET, ends at Thu 2023-05-11 16:56:35 CEST. --
-- No entries --

Nothing (relating to this device).
 
Are you sure that you are querying the logs for the correct osd on the correct host? Try systemctl status ceph-osd@*.service to query all osd services running on that node.
 
Sorry, I sometimes forget which commands I can enter on any host and which commands I need to enter on a specific host...

So here it goes:

Code:
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x5571c7aa0140) no keyring found at /var/lib/>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x7fffc2aa7700) no keyring found at /var/lib/>
May 11 16:17:05 node2 ceph-osd[14881]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 1.
May 11 16:17:15 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
 
Sorry, I sometimes forget which commands I can enter on any host and which commands I need to enter on a specific host...

So here it goes:

Code:
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x5571c7aa0140) no keyring found at /var/lib/>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x7fffc2aa7700) no keyring found at /var/lib/>
May 11 16:17:05 node2 ceph-osd[14881]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 1.
May 11 16:17:15 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
These logs are cut off and incomplete, please redirect the output of journalctl to a file and attach that here.
 
How about, for starters, everything since the last boot:

Code:
-- Boot 41578566d8984f7789232b8d7aa546e9 --
May 11 16:17:05 node2 systemd[1]: Starting Ceph object storage daemon osd.4...
May 11 16:17:05 node2 systemd[1]: Started Ceph object storage daemon osd.4.
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x5571c7aa0140) no keyring found at /var/lib/ce>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x7fffc2aa7700) no keyring found at /var/lib/ce>
May 11 16:17:05 node2 ceph-osd[14881]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 1.
May 11 16:17:15 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
May 11 16:17:15 node2 systemd[1]: Starting Ceph object storage daemon osd.4...
May 11 16:17:15 node2 systemd[1]: Started Ceph object storage daemon osd.4.
May 11 16:17:15 node2 ceph-osd[14979]: 2023-05-11T16:17:15.386+0200 7f052fc20240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:15 node2 ceph-osd[14979]: 2023-05-11T16:17:15.386+0200 7f052fc20240 -1 AuthRegistry(0x5571243e8140) no keyring found at /var/lib/ce>
May 11 16:17:15 node2 ceph-osd[14979]: 2023-05-11T16:17:15.386+0200 7f052fc20240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:15 node2 ceph-osd[14979]: 2023-05-11T16:17:15.386+0200 7f052fc20240 -1 AuthRegistry(0x7fff974f42d0) no keyring found at /var/lib/ce>
May 11 16:17:15 node2 ceph-osd[14979]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:25 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 2.
May 11 16:17:25 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
May 11 16:17:25 node2 systemd[1]: Starting Ceph object storage daemon osd.4...
May 11 16:17:25 node2 systemd[1]: Started Ceph object storage daemon osd.4.
May 11 16:17:25 node2 ceph-osd[15071]: 2023-05-11T16:17:25.646+0200 7f129fe5d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:25 node2 ceph-osd[15071]: 2023-05-11T16:17:25.646+0200 7f129fe5d240 -1 AuthRegistry(0x556deb03e140) no keyring found at /var/lib/ce>
May 11 16:17:25 node2 ceph-osd[15071]: 2023-05-11T16:17:25.646+0200 7f129fe5d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:25 node2 ceph-osd[15071]: 2023-05-11T16:17:25.646+0200 7f129fe5d240 -1 AuthRegistry(0x7ffff9f51580) no keyring found at /var/lib/ce>
May 11 16:17:25 node2 ceph-osd[15071]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:25 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:25 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:35 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 3.
May 11 16:17:35 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
May 11 16:17:35 node2 systemd[1]: ceph-osd@4.service: Start request repeated too quickly.
May 11 16:17:35 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:35 node2 systemd[1]: Failed to start Ceph object storage daemon osd.4.
May 11 16:24:50 node2 systemd[1]: ceph-osd@4.service: Start request repeated too quickly.
May 11 16:24:50 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:24:50 node2 systemd[1]: Failed to start Ceph object storage daemon osd.4.
 
Your output is still cut off, but I assume the keyring file is missing. Please attach the output from journalctl -b -u ceph-osd@4.service > journal-osd-4.txt.

If the keyring file is corrupt or missing, you can try the following
Bash:
# Export key
ceph auth export osd.4 -o keyring.export
# Adapt user and permissions
chown ceph:ceph keyring.export
chmod 0600 keyring.export
# Remove caps lines
sed -i '/^\s*caps/d' keyring.export
# Put keyring into expected location
mv keyring.export /var/lib/ceph/osd/ceph-4/keyring
# Reset failed counter for service
systemctl reset-failed ceph-osd@4.service
# Start osd service
systemctl start ceph-osd@4.service

Hope this helps!