[SOLVED] OSD not starting

Fred Saunier

Well-Known Member
Aug 24, 2017
55
2
48
Brussels, BE
On a proxmox 4.4 node running ceph jewel, osd.0 has suddenly dropped out of the cluster and has been stopped. I cannot get it to start again. Going through the different logs, I could trace a number of slow requests errors, which could have led to the exclusion of the osd from the cluster.

As errors started to appear, I was cloning a 1 TB vm into the 3-osd ceph storage. The vm is now running from ceph, but only using the 2 remaining osd. All 3 monitors are ok and can access ceph (including the monitor the osd of which failed).

Code:
systemctl status ceph-osd@0.service
sept. 03 11:34:14 prox1 systemd[1]: ceph-osd@0.service start request repeated too quickly, refusing to start.
sept. 03 11:34:14 prox1 systemd[1]: Failed to start Ceph object storage daemon.
sept. 03 11:34:14 prox1 systemd[1]: Unit ceph-osd@0.service entered failed state.

/var/log/ceph/ceph-osd.0.log.1.gz (suddenly contains a lot of errors such as this one) :
Code:
log_channel(cluster) log [WRN] : slow request 30.382557 seconds old, received at 2017-09-02 10:06:57.257924: osd_repop(client.190397.0:143223 3.4c 3:32894d75:::rbd_object_map.2e7ba238e1f29:head v 38'118814) currently commit_sent

ceph.log.2.gz (suddenly contains a lot of errors such as this one) :
Code:
mon.0 192.168.100.11:6789/0 230679 : cluster [INF] HEALTH_WARN; 12 requests are blocked > 32 sec

I am trying to troubleshoot the issue, but I fail to understand what actions I need to take to solve it.
Any advice?
 
Hi!

Once I had a similar problem. SMART showed that the disk is working.
I removed osd-disk (stop, out, remove).
Cleared disk: ceph-disk zap /dev/sd...
Add it again: pveceph createosd /dev/sd..
The problem has disappeared.

Best regards,
Gosha
 
Last edited:
  • Like
Reactions: József Venczel
As a follow-up to the initial problem, although ceph was back to health_ok, further monitoring of the drive showed an increasing number of errors on the drive in syslog, such as
Code:
prox1 smartd[2141]: Device: /dev/sdd [SAT], 12 Currently unreadable (pending) sectors

I ended up changing the drive before it self-destructed, which put an end to my trouble.
 
Hi!

Once I had a similar problem. SMART showed that the disk is working.
I removed osd-disk (stop, out, remove).
Cleared disk: ceph-disk zap /dev/sd...
Add it again: pveceph createosd /dev/sd..
The problem has disappeared.

Best regards,
Gosha

It worked for me too! Thank You!