Ceph issues new cluster 5.1 fully patched

lweidig

Active Member
Oct 20, 2011
104
2
38
Sheboygan, WI
We have a new four node cluster that is almost identical to other clusters we are running. However, since it has been up and running at what seems to be random times we end up with errors similar to:

Code:
2018-02-05 06:48:16.581002  26686 : cluster [ERR] Health check update: Possible data damage: 4 pgs inconsistent, 2 pgs repair (PG_DAMAGED)
2018-02-05 06:49:14.237060  26687 : cluster [ERR] overall HEALTH_ERR 4 scrub errors; Possible data damage: 4 pgs inconsistent, 2 pgs repair
2018-02-05 06:49:16.626283  26688 : cluster [ERR] Health check update: 3 scrub errors (OSD_SCRUB_ERRORS)

This of course causes the status of the cluster to go bad. Simply running the commands:

Code:
ceph health detail
ceph pg repair <pgId's>

Where the second command is run for each listed id will bring the cluster back to HEALTH_OK for a while. But then the process repeats. We have been watching the OSD's involved and it seems to span across all of them at this point, so we would not believe it is a failing OSD.

Any thoughts what might be causing this and/or how we can further troubleshoot how to find what is causing this? Not much logged that we have been able to see.

Thanks!
 
Pretty sure we have narrowed this down to one of the four nodes as ALL of the pgId's has an OSD located on that node. Now to dig further why this node is misbehaving.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!