Hi,
I'm having an annoying problem with a 5 node cluster. All OSDs are crashing on one node just after midnight most nights.
In syslog I can see that starting at 23:58:53 all the osds on the node start having issues with heartbeat_check with no reply from other osds. A minute later all the osds are shutdown. There are no problems starting them again in the morning and they run fine until the next midnight with another osd death, not every night.
Additionally, I have backup jobs running, I moved them to 23 instead of 00, still have the same midnight massacre. I have three networks (corosync, client and ceph) and I can ping other nodes from all three interfaces (about 0.2 ms).
Any pointers on how to troubleshoot?
I'm having an annoying problem with a 5 node cluster. All OSDs are crashing on one node just after midnight most nights.
In syslog I can see that starting at 23:58:53 all the osds on the node start having issues with heartbeat_check with no reply from other osds. A minute later all the osds are shutdown. There are no problems starting them again in the morning and they run fine until the next midnight with another osd death, not every night.
Additionally, I have backup jobs running, I moved them to 23 instead of 00, still have the same midnight massacre. I have three networks (corosync, client and ceph) and I can ping other nodes from all three interfaces (about 0.2 ms).
Any pointers on how to troubleshoot?
