[SOLVED] OSD not starting

Fred Saunier

Well-Known Member
Aug 24, 2017
55
2
48
Brussels, BE
On a proxmox 4.4 node running ceph jewel, osd.0 has suddenly dropped out of the cluster and has been stopped. I cannot get it to start again. Going through the different logs, I could trace a number of slow requests errors, which could have led to the exclusion of the osd from the cluster.

As errors started to appear, I was cloning a 1 TB vm into the 3-osd ceph storage. The vm is now running from ceph, but only using the 2 remaining osd. All 3 monitors are ok and can access ceph (including the monitor the osd of which failed).

Code:
systemctl status ceph-osd@0.service
sept. 03 11:34:14 prox1 systemd[1]: ceph-osd@0.service start request repeated too quickly, refusing to start.
sept. 03 11:34:14 prox1 systemd[1]: Failed to start Ceph object storage daemon.
sept. 03 11:34:14 prox1 systemd[1]: Unit ceph-osd@0.service entered failed state.

/var/log/ceph/ceph-osd.0.log.1.gz (suddenly contains a lot of errors such as this one) :
Code:
log_channel(cluster) log [WRN] : slow request 30.382557 seconds old, received at 2017-09-02 10:06:57.257924: osd_repop(client.190397.0:143223 3.4c 3:32894d75:::rbd_object_map.2e7ba238e1f29:head v 38'118814) currently commit_sent

ceph.log.2.gz (suddenly contains a lot of errors such as this one) :
Code:
mon.0 192.168.100.11:6789/0 230679 : cluster [INF] HEALTH_WARN; 12 requests are blocked > 32 sec

I am trying to troubleshoot the issue, but I fail to understand what actions I need to take to solve it.
Any advice?
 
Hi!

Once I had a similar problem. SMART showed that the disk is working.
I removed osd-disk (stop, out, remove).
Cleared disk: ceph-disk zap /dev/sd...
Add it again: pveceph createosd /dev/sd..
The problem has disappeared.

Best regards,
Gosha
 
Last edited:
  • Like
Reactions: József Venczel
As a follow-up to the initial problem, although ceph was back to health_ok, further monitoring of the drive showed an increasing number of errors on the drive in syslog, such as
Code:
prox1 smartd[2141]: Device: /dev/sdd [SAT], 12 Currently unreadable (pending) sectors

I ended up changing the drive before it self-destructed, which put an end to my trouble.
 
Hi!

Once I had a similar problem. SMART showed that the disk is working.
I removed osd-disk (stop, out, remove).
Cleared disk: ceph-disk zap /dev/sd...
Add it again: pveceph createosd /dev/sd..
The problem has disappeared.

Best regards,
Gosha

It worked for me too! Thank You!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!