We have a new four node cluster that is almost identical to other clusters we are running. However, since it has been up and running at what seems to be random times we end up with errors similar to:
This of course causes the status of the cluster to go bad. Simply running the commands:
Where the second command is run for each listed id will bring the cluster back to HEALTH_OK for a while. But then the process repeats. We have been watching the OSD's involved and it seems to span across all of them at this point, so we would not believe it is a failing OSD.
Any thoughts what might be causing this and/or how we can further troubleshoot how to find what is causing this? Not much logged that we have been able to see.
Thanks!
Code:
2018-02-05 06:48:16.581002 26686 : cluster [ERR] Health check update: Possible data damage: 4 pgs inconsistent, 2 pgs repair (PG_DAMAGED)
2018-02-05 06:49:14.237060 26687 : cluster [ERR] overall HEALTH_ERR 4 scrub errors; Possible data damage: 4 pgs inconsistent, 2 pgs repair
2018-02-05 06:49:16.626283 26688 : cluster [ERR] Health check update: 3 scrub errors (OSD_SCRUB_ERRORS)
This of course causes the status of the cluster to go bad. Simply running the commands:
Code:
ceph health detail
ceph pg repair <pgId's>
Where the second command is run for each listed id will bring the cluster back to HEALTH_OK for a while. But then the process repeats. We have been watching the OSD's involved and it seems to span across all of them at this point, so we would not believe it is a failing OSD.
Any thoughts what might be causing this and/or how we can further troubleshoot how to find what is causing this? Not much logged that we have been able to see.
Thanks!