Why is my Ceph OSD dead (and won't start again)?????????????

proxwolfe

Well-Known Member
Jun 20, 2020
501
52
48
49
Hi,

I have a three node PVE cluster with Ceph installed and two pools between them, each pool with one disk (OSD) on each node.

For some reason that I have'nt found yet, yesterday, one PVE (and Ceph) node crashed, rendering both pools degraded. After restarting the node, it came back online but while Ceph rebalanced, a Ceph OSD on another node went down. This one is now "dead":

Code:
systemctl status ceph-osd@4.service
● ceph-osd@4.service - Ceph object storage daemon osd.4
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: inactive (dead)

And it won't start again (error code -1). journalctl -xe tells me nothing.

So what could this be?

Thanks!
 
Hi,
whats the output of journalctl -u ceph-osd@4.service? Any other errors to be found in the journal?
 
journalctl -u ceph-osd@4.service
-- Journal begins at Thu 2022-11-24 13:49:04 CET, ends at Thu 2023-05-11 16:56:35 CEST. --
-- No entries --

Nothing (relating to this device).
 
Are you sure that you are querying the logs for the correct osd on the correct host? Try systemctl status ceph-osd@*.service to query all osd services running on that node.
 
Sorry, I sometimes forget which commands I can enter on any host and which commands I need to enter on a specific host...

So here it goes:

Code:
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x5571c7aa0140) no keyring found at /var/lib/>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x7fffc2aa7700) no keyring found at /var/lib/>
May 11 16:17:05 node2 ceph-osd[14881]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 1.
May 11 16:17:15 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
 
Sorry, I sometimes forget which commands I can enter on any host and which commands I need to enter on a specific host...

So here it goes:

Code:
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x5571c7aa0140) no keyring found at /var/lib/>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x7fffc2aa7700) no keyring found at /var/lib/>
May 11 16:17:05 node2 ceph-osd[14881]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 1.
May 11 16:17:15 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
These logs are cut off and incomplete, please redirect the output of journalctl to a file and attach that here.
 
How about, for starters, everything since the last boot:

Code:
-- Boot 41578566d8984f7789232b8d7aa546e9 --
May 11 16:17:05 node2 systemd[1]: Starting Ceph object storage daemon osd.4...
May 11 16:17:05 node2 systemd[1]: Started Ceph object storage daemon osd.4.
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x5571c7aa0140) no keyring found at /var/lib/ce>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:05 node2 ceph-osd[14881]: 2023-05-11T16:17:05.142+0200 7fbc2bc1d240 -1 AuthRegistry(0x7fffc2aa7700) no keyring found at /var/lib/ce>
May 11 16:17:05 node2 ceph-osd[14881]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:05 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 1.
May 11 16:17:15 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
May 11 16:17:15 node2 systemd[1]: Starting Ceph object storage daemon osd.4...
May 11 16:17:15 node2 systemd[1]: Started Ceph object storage daemon osd.4.
May 11 16:17:15 node2 ceph-osd[14979]: 2023-05-11T16:17:15.386+0200 7f052fc20240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:15 node2 ceph-osd[14979]: 2023-05-11T16:17:15.386+0200 7f052fc20240 -1 AuthRegistry(0x5571243e8140) no keyring found at /var/lib/ce>
May 11 16:17:15 node2 ceph-osd[14979]: 2023-05-11T16:17:15.386+0200 7f052fc20240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:15 node2 ceph-osd[14979]: 2023-05-11T16:17:15.386+0200 7f052fc20240 -1 AuthRegistry(0x7fff974f42d0) no keyring found at /var/lib/ce>
May 11 16:17:15 node2 ceph-osd[14979]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:15 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:25 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 2.
May 11 16:17:25 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
May 11 16:17:25 node2 systemd[1]: Starting Ceph object storage daemon osd.4...
May 11 16:17:25 node2 systemd[1]: Started Ceph object storage daemon osd.4.
May 11 16:17:25 node2 ceph-osd[15071]: 2023-05-11T16:17:25.646+0200 7f129fe5d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:25 node2 ceph-osd[15071]: 2023-05-11T16:17:25.646+0200 7f129fe5d240 -1 AuthRegistry(0x556deb03e140) no keyring found at /var/lib/ce>
May 11 16:17:25 node2 ceph-osd[15071]: 2023-05-11T16:17:25.646+0200 7f129fe5d240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-4/k>
May 11 16:17:25 node2 ceph-osd[15071]: 2023-05-11T16:17:25.646+0200 7f129fe5d240 -1 AuthRegistry(0x7ffff9f51580) no keyring found at /var/lib/ce>
May 11 16:17:25 node2 ceph-osd[15071]: failed to fetch mon config (--no-mon-config to skip)
May 11 16:17:25 node2 systemd[1]: ceph-osd@4.service: Main process exited, code=exited, status=1/FAILURE
May 11 16:17:25 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:35 node2 systemd[1]: ceph-osd@4.service: Scheduled restart job, restart counter is at 3.
May 11 16:17:35 node2 systemd[1]: Stopped Ceph object storage daemon osd.4.
May 11 16:17:35 node2 systemd[1]: ceph-osd@4.service: Start request repeated too quickly.
May 11 16:17:35 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:17:35 node2 systemd[1]: Failed to start Ceph object storage daemon osd.4.
May 11 16:24:50 node2 systemd[1]: ceph-osd@4.service: Start request repeated too quickly.
May 11 16:24:50 node2 systemd[1]: ceph-osd@4.service: Failed with result 'exit-code'.
May 11 16:24:50 node2 systemd[1]: Failed to start Ceph object storage daemon osd.4.
 
Your output is still cut off, but I assume the keyring file is missing. Please attach the output from journalctl -b -u ceph-osd@4.service > journal-osd-4.txt.

If the keyring file is corrupt or missing, you can try the following
Bash:
# Export key
ceph auth export osd.4 -o keyring.export
# Adapt user and permissions
chown ceph:ceph keyring.export
chmod 0600 keyring.export
# Remove caps lines
sed -i '/^\s*caps/d' keyring.export
# Put keyring into expected location
mv keyring.export /var/lib/ceph/osd/ceph-4/keyring
# Reset failed counter for service
systemctl reset-failed ceph-osd@4.service
# Start osd service
systemctl start ceph-osd@4.service

Hope this helps!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!