[SOLVED] [Improvement request] Ceph OSD start (restart) does not work after an OSD goes down.

BenediktS · Sep 27, 2023

One node of our ceph 17.2.6 cluster went down.
We don't know why, but at the same time 3 other OSDs on other Nodes went down as well.

When we try to restart the osd via the web page, then we only get a error message that the start of the service failed.
We get this errors in the logs:

Code:

Sep 27 13:38:40 prox4 systemd[1]: ceph-osd@5.service: Start request repeated too quickly.
Sep 27 13:38:40 prox4 systemd[1]: ceph-osd@5.service: Failed with result 'exit-code'.
Sep 27 13:38:40 prox4 systemd[1]: Failed to start ceph-osd@5.service - Ceph object storage daemon osd.5.

So we have to perform these commands to start the service

Bash:

root@prox3:~# systemctl daemon-reload
root@prox3:~# systemctl reset-failed ceph-osd@4
root@prox3:~# systemctl start ceph-osd@4

Maybe the "systemmctl daemon-reload" is overkill, but here is my question.

Can you please add these extra commands behind the "start" button in the osd GUI ? Because when i press the button "start", I don't care how often the system itself tried to restart the OSD.
Or edit the service file of the OSDs, so the restart is not so frequent or it could try to restart the service without limits.

Philipp Hufnagl · Sep 27, 2023

Hello

I don't think that would be a good idea. When an OSD so often, that this becomes a regular problem it most likely means there is something significant wrong in your setup. Adding this option would suggest to user that this can be ignored which probably would need to more problems in the long run.

Id rather suggest reading through journalctl -b and cat /var/log/ceph/ceph-osd.<vmid>.log to find out why the underling issue.

BenediktS · Sep 27, 2023

But then it is hard to tell my future me, that it is his problem

OK... OK.. You are right.

btw: My findings for the 3 OSDs going down:

While under high load (beacuse of the failing node). 3 NVMEs get very warm and produced IO Errors.
_aio_thread got r=-5 ((5) Input/output error)

All OSDs that failed have been from the same type. They where all "Micron 7400 MTFDKBG3T8TDZ".
So 100% of our Micron NVMEs failed under load. So I think it's a good thing that only 3 of our 33 OSDs are on micron NVMEs.

They have the newest firmware I could find on the manufactorar website.
So i think we have to replace all 3 NVMEs.

THX for reminding me, not to take the fast way out.

Search

Search

[SOLVED] [Improvement request] Ceph OSD start (restart) does not work after an OSD goes down.

BenediktS

Member

Philipp Hufnagl

Active Member

BenediktS

Member