CEPH OSD Crash on Same Node

malhivm

New Member
Jun 13, 2024
3
0
1
Hi Everyone, I'm a homelabber. I have a 3 node cluster with ceph and I have one OSD that consistently causes my problems. It is the only one that always crashes. I consistently have either the Monitor or OSD on this node crashing. This causes pgs to show up as inconsistent. I can manually repair them to bring back the cluster to a healthy state, but it happens again. I think it typically happens when there are network issues. I should put the ceph traffic on a separate network, but I haven't gotten to that point yet. Any advice based on the log attached?

The only thing that I can do is delete and recreate the OSD. It's strange to me that it's the same node all the time.
 

Attachments

Last edited:
hey thanks for willing to help. I don't see anything obvious. When I went through the Ceph OSD log, or the Ceph Mon log, I didn't see anything specifically for the sda device. I just installed rsyslog, so maybe it will pick up some IO errors on sda.

dmseg displays ceph[722223]: segfault at a9 ip 000000000050c66d sp 00007ffc7b948120 error 4 in python3.11[41f000+2b5000] likely on CPU 1 (core 1, socket 0). It could be completely unrelated and I'm not able to recreate the failure on demand, nor do I really want to.

The problem could very well be the sata controller or motherboard, or memory or processor. I replaced the drive and cable and changed the sata port. If you have any other advice to troubleshoot this I'd appreciate it. Thanks.