On a proxmox 4.4 node running ceph jewel, osd.0 has suddenly dropped out of the cluster and has been stopped. I cannot get it to start again. Going through the different logs, I could trace a number of slow requests errors, which could have led to the exclusion of the osd from the cluster.
As errors started to appear, I was cloning a 1 TB vm into the 3-osd ceph storage. The vm is now running from ceph, but only using the 2 remaining osd. All 3 monitors are ok and can access ceph (including the monitor the osd of which failed).
/var/log/ceph/ceph-osd.0.log.1.gz (suddenly contains a lot of errors such as this one) :
ceph.log.2.gz (suddenly contains a lot of errors such as this one) :
I am trying to troubleshoot the issue, but I fail to understand what actions I need to take to solve it.
Any advice?
As errors started to appear, I was cloning a 1 TB vm into the 3-osd ceph storage. The vm is now running from ceph, but only using the 2 remaining osd. All 3 monitors are ok and can access ceph (including the monitor the osd of which failed).
Code:
systemctl status ceph-osd@0.service
sept. 03 11:34:14 prox1 systemd[1]: ceph-osd@0.service start request repeated too quickly, refusing to start.
sept. 03 11:34:14 prox1 systemd[1]: Failed to start Ceph object storage daemon.
sept. 03 11:34:14 prox1 systemd[1]: Unit ceph-osd@0.service entered failed state.
/var/log/ceph/ceph-osd.0.log.1.gz (suddenly contains a lot of errors such as this one) :
Code:
log_channel(cluster) log [WRN] : slow request 30.382557 seconds old, received at 2017-09-02 10:06:57.257924: osd_repop(client.190397.0:143223 3.4c 3:32894d75:::rbd_object_map.2e7ba238e1f29:head v 38'118814) currently commit_sent
ceph.log.2.gz (suddenly contains a lot of errors such as this one) :
Code:
mon.0 192.168.100.11:6789/0 230679 : cluster [INF] HEALTH_WARN; 12 requests are blocked > 32 sec
I am trying to troubleshoot the issue, but I fail to understand what actions I need to take to solve it.
Any advice?