CEPH OSD(s) failing to initialize after a hardware change

the issue, as i see, the osd service doesnt start. you need reset the counter

systemctl reset-failed "servicename"
systemctl start service


i had the same issue last weak, the result from some network changes. never got it fully up again. dont waste time - create new and restore.

than it seems you have a mapping issue too.
 
the issue, as i see, the osd service doesnt start. you need reset the counter

systemctl reset-failed "servicename"
systemctl start service


i had the same issue last weak, the result from some network changes. never got it fully up again. dont waste time - create new and restore.

than it seems you have a mapping issue too.
So, should I just wipe the OSDs on that node then and re-add them? I just want to clarify before attempting that. I am in a 2/3 for that setup, and CephFS did come back up, and I am able to access everything, those 15 OSDs are still down though. So to be completely clear, destroy them and re-add?
 
Last edited:
i would stongly advice. as you wrote yourself - you fugg around since days with it.
i had exactly the same experience. i played around 3 days. nothing worked. finally i reinstalled proxmomx completely - my cluster wasnt productive yet and, as i had a feeling, i made a backup just right before.
and my personally opinion: if you play to much around on a prod system, you can never be sure to a later time what the server will do. i dont trust that server anymore, so i reinstalled it (but i am not much mad because i could fix an issue and increased the performance dramatically)

but wait for a staff member, for advice, how to remove and re-add the node properly (only my advice).
 
Yes, you typically do not recover a node partially. What you did is effectively replace the node but put old hard drives in and now you’re trying to recover those in a clustered file system. At this point, the data on the ‘dead’ OSD is out of sync, it does not matter, forcefully trying to re-insert them will cause you more trouble.

So yes, wipe the disks, I would even say, wipe the Proxmox at this point and start fresh and treat it as a new node.
 
  • Like
Reactions: pille99