[SOLVED] OSD not starting

Fred Saunier · Sep 4, 2017

On a proxmox 4.4 node running ceph jewel, osd.0 has suddenly dropped out of the cluster and has been stopped. I cannot get it to start again. Going through the different logs, I could trace a number of slow requests errors, which could have led to the exclusion of the osd from the cluster.

As errors started to appear, I was cloning a 1 TB vm into the 3-osd ceph storage. The vm is now running from ceph, but only using the 2 remaining osd. All 3 monitors are ok and can access ceph (including the monitor the osd of which failed).

Code:

systemctl status ceph-osd@0.service
sept. 03 11:34:14 prox1 systemd[1]: ceph-osd@0.service start request repeated too quickly, refusing to start.
sept. 03 11:34:14 prox1 systemd[1]: Failed to start Ceph object storage daemon.
sept. 03 11:34:14 prox1 systemd[1]: Unit ceph-osd@0.service entered failed state.

/var/log/ceph/ceph-osd.0.log.1.gz (suddenly contains a lot of errors such as this one) :

Code:

log_channel(cluster) log [WRN] : slow request 30.382557 seconds old, received at 2017-09-02 10:06:57.257924: osd_repop(client.190397.0:143223 3.4c 3:32894d75:::rbd_object_map.2e7ba238e1f29:head v 38'118814) currently commit_sent

ceph.log.2.gz (suddenly contains a lot of errors such as this one) :

Code:

mon.0 192.168.100.11:6789/0 230679 : cluster [INF] HEALTH_WARN; 12 requests are blocked > 32 sec

I am trying to troubleshoot the issue, but I fail to understand what actions I need to take to solve it.
Any advice?

gosha · Sep 4, 2017

Hi!

Once I had a similar problem. SMART showed that the disk is working.
I removed osd-disk (stop, out, remove).
Cleared disk: ceph-disk zap /dev/sd...
Add it again: pveceph createosd /dev/sd..
The problem has disappeared.

Best regards,
Gosha

Fred Saunier · Sep 6, 2017

Thank you, that solved it for me!

Fred Saunier · Sep 10, 2017

As a follow-up to the initial problem, although ceph was back to health_ok, further monitoring of the drive showed an increasing number of errors on the drive in syslog, such as

Code:

prox1 smartd[2141]: Device: /dev/sdd [SAT], 12 Currently unreadable (pending) sectors

I ended up changing the drive before it self-destructed, which put an end to my trouble.

József Venczel · Jun 13, 2018

gosha said:
Hi!

Once I had a similar problem. SMART showed that the disk is working.
I removed osd-disk (stop, out, remove).
Cleared disk: ceph-disk zap /dev/sd...
Add it again: pveceph createosd /dev/sd..
The problem has disappeared.

Best regards,
Gosha

It worked for me too! Thank You!

GoZippy · Apr 15, 2022

can this be done without losing the data on the drive though?

Tmanok · Apr 24, 2022

GoZippy said:
can this be done without losing the data on the drive though?

Zapping the disk preserves the partitioning but not the data. The --destroy flag removes both, effectively making it a full wipe. Either way, consider your data toast.

This was my solution: https://forum.proxmox.com/threads/s...the-drive-without-rebooting-the-server.89602/

Search

Search

[SOLVED] OSD not starting

Fred Saunier

Well-Known Member

gosha

Well-Known Member

Fred Saunier

Well-Known Member

Fred Saunier

Well-Known Member

József Venczel

Active Member

GoZippy

Active Member

Tmanok

Renowned Member

We value your privacy