Ceph issues new cluster 5.1 fully patched

lweidig · Feb 5, 2018

We have a new four node cluster that is almost identical to other clusters we are running. However, since it has been up and running at what seems to be random times we end up with errors similar to:

Code:

2018-02-05 06:48:16.581002  26686 : cluster [ERR] Health check update: Possible data damage: 4 pgs inconsistent, 2 pgs repair (PG_DAMAGED)
2018-02-05 06:49:14.237060  26687 : cluster [ERR] overall HEALTH_ERR 4 scrub errors; Possible data damage: 4 pgs inconsistent, 2 pgs repair
2018-02-05 06:49:16.626283  26688 : cluster [ERR] Health check update: 3 scrub errors (OSD_SCRUB_ERRORS)

This of course causes the status of the cluster to go bad. Simply running the commands:

Code:

ceph health detail
ceph pg repair <pgId's>

Where the second command is run for each listed id will bring the cluster back to HEALTH_OK for a while. But then the process repeats. We have been watching the OSD's involved and it seems to span across all of them at this point, so we would not believe it is a failing OSD.

Any thoughts what might be causing this and/or how we can further troubleshoot how to find what is causing this? Not much logged that we have been able to see.

Thanks!

Alwin · Feb 6, 2018

What is your hardware? As it could still be disks or RAM, as a first shot. How is ceph configured?

Set higher logging for Ceph.
http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/

lweidig · Feb 9, 2018

Pretty sure we have narrowed this down to one of the four nodes as ALL of the pgId's has an OSD located on that node. Now to dig further why this node is misbehaving.

Ceph issues new cluster 5.1 fully patched

lweidig

Active Member

Alwin

Proxmox Retired Staff

lweidig

Active Member

We value your privacy