[SOLVED] [Improvement request] Ceph OSD start (restart) does not work after an OSD goes down.

Dec 6, 2022
55
19
13
One node of our ceph 17.2.6 cluster went down.
We don't know why, but at the same time 3 other OSDs on other Nodes went down as well.

When we try to restart the osd via the web page, then we only get a error message that the start of the service failed.
We get this errors in the logs:

Code:
Sep 27 13:38:40 prox4 systemd[1]: ceph-osd@5.service: Start request repeated too quickly.
Sep 27 13:38:40 prox4 systemd[1]: ceph-osd@5.service: Failed with result 'exit-code'.
Sep 27 13:38:40 prox4 systemd[1]: Failed to start ceph-osd@5.service - Ceph object storage daemon osd.5.

So we have to perform these commands to start the service

Bash:
root@prox3:~# systemctl daemon-reload
root@prox3:~# systemctl reset-failed ceph-osd@4
root@prox3:~# systemctl start ceph-osd@4

Maybe the "systemmctl daemon-reload" is overkill, but here is my question.

Can you please add these extra commands behind the "start" button in the osd GUI ? Because when i press the button "start", I don't care how often the system itself tried to restart the OSD.
Or edit the service file of the OSDs, so the restart is not so frequent or it could try to restart the service without limits.
 
Hello

I don't think that would be a good idea. When an OSD so often, that this becomes a regular problem it most likely means there is something significant wrong in your setup. Adding this option would suggest to user that this can be ignored which probably would need to more problems in the long run.

Id rather suggest reading through journalctl -b and cat /var/log/ceph/ceph-osd.<vmid>.log to find out why the underling issue.
 
But then it is hard to tell my future me, that it is his problem :(
OK... OK.. You are right. :)

btw: My findings for the 3 OSDs going down:

While under high load (beacuse of the failing node). 3 NVMEs get very warm and produced IO Errors.
_aio_thread got r=-5 ((5) Input/output error)

All OSDs that failed have been from the same type. They where all "Micron 7400 MTFDKBG3T8TDZ".
So 100% of our Micron NVMEs failed under load. So I think it's a good thing that only 3 of our 33 OSDs are on micron NVMEs.

They have the newest firmware I could find on the manufactorar website.
So i think we have to replace all 3 NVMEs.

THX for reminding me, not to take the fast way out.
 
  • Like
Reactions: Philipp Hufnagl

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!