Hi Alwin,
Thanks for your reponse. Yes, I've tried a scrub and a deep-scrub, but as the PG is undersized, it's not starting the scrub yet. It's trying to backfill the undersized PG and as soon as it it backfilling one particular object, the OSD processes that it's trying to replicate to (on two different servers), they crash. Surprisingly enough, the OSD that initiates the backfill process (that currently has the copy of the data), does not crash.
The PG in question:
<code>
ceph pg 1.3e4 query | head -n 30
{
"state": "undersized+degraded+remapped+backfill_wait+peered",
"snap_trimq": "[130~1,13d~1]",
"snap_trimq_len": 2,
"epoch": 41343,
"up": [
8,
1,
25
],
"acting": [
16
],
"backfill_targets": [
"1",
"8",
"25"
],
"actingbackfill": [
"1",
"8",
"16",
"25"
],
</code>
In this case above, it's crashing targets 1, 8 and 25 simultaneously the moment it tries to backfill.
As soon as they're crashed, Ceph goes in recovery mode, the OSD's come back online again after about 20 seconds and as soon as Ceph tries to recover/backfill the same PG again, it's all starting over again like clockwork.
Initially thought has HDD issues, so have removed the original target drives, but no change. Have taken the original owner of the PG out, still no change. (hence why in the example above, the "acting" is not one of the targets)
A snippet of the main Ceph log file, showing osd.8 and osd.23 flapping, and PG 1.3e4 starting the backfill process after which they go down again:
2019-03-31 06:26:02.406944 mon.prox1 mon.0 192.168.1.81:6789/0 744951 : cluster [INF] osd.8 failed (root=default,host=prox7) (connection refused reported by osd.0)
2019-03-31 06:26:02.409424 mon.prox1 mon.0 192.168.1.81:6789/0 744969 : cluster [INF] osd.23 failed (root=default,host=prox6) (connection refused reported by osd.6)
2019-03-31 06:26:03.324917 mon.prox1 mon.0 192.168.1.81:6789/0 745198 : cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)
2019-03-31 06:26:03.325094 mon.prox1 mon.0 192.168.1.81:6789/0 745199 : cluster [WRN] Health check update: 489/5747322 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:26:04.550642 mon.prox1 mon.0 192.168.1.81:6789/0 745201 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2019-03-31 06:26:06.186950 mon.prox1 mon.0 192.168.1.81:6789/0 745203 : cluster [WRN] Health check update: Degraded data redundancy: 586/5747346 objects degraded (0.010%), 15 pgs degraded, 1 pg undersized (PG_DEGRADED)
2019-03-31 06:26:08.543045 mon.prox1 mon.0 192.168.1.81:6789/0 745204 : cluster [WRN] Health check update: 3668/5747490 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:10.873192 mon.prox1 mon.0 192.168.1.81:6789/0 745206 : cluster [WRN] Health check update: Reduced data availability: 2 pgs inactive (PG_AVAILABILITY)
2019-03-31 06:26:11.187406 mon.prox1 mon.0 192.168.1.81:6789/0 745207 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747499 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:16.187967 mon.prox1 mon.0 192.168.1.81:6789/0 745208 : cluster [WRN] Health check update: 3668/5747499 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:34.987288 mon.prox1 mon.0 192.168.1.81:6789/0 745210 : cluster [WRN] Health check update: 3668/5747502 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:34.987356 mon.prox1 mon.0 192.168.1.81:6789/0 745211 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747502 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:40.280592 mon.prox1 mon.0 192.168.1.81:6789/0 745214 : cluster [WRN] Health check update: 1 osds down (OSD_DOWN)
2019-03-31 06:26:40.430710 mon.prox1 mon.0 192.168.1.81:6789/0 745215 : cluster [INF] osd.23 192.168.1.86:6817/3115652 boot
2019-03-31 06:26:41.190103 mon.prox1 mon.0 192.168.1.81:6789/0 745217 : cluster [WRN] Health check update: 3668/5747508 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:41.190179 mon.prox1 mon.0 192.168.1.81:6789/0 745218 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747508 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:43.794538 mon.prox1 mon.0 192.168.1.81:6789/0 745220 : cluster [INF] Health check cleared: OBJECT_MISPLACED (was: 3668/5747508 objects misplaced (0.064%))
2019-03-31 06:26:45.034919 mon.prox1 mon.0 192.168.1.81:6789/0 745221 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive)
2019-03-31 06:26:46.190647 mon.prox1 mon.0 192.168.1.81:6789/0 745222 : cluster [WRN] Health check update: Degraded data redundancy: 302190/5747508 objects degraded (5.258%), 247 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:49.549834 mon.prox1 mon.0 192.168.1.81:6789/0 745225 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-03-31 06:26:49.598780 mon.prox1 mon.0 192.168.1.81:6789/0 745226 : cluster [WRN] Health check failed: 1834/5747508 objects misplaced (0.032%) (OBJECT_MISPLACED)
2019-03-31 06:26:49.690099 mon.prox1 mon.0 192.168.1.81:6789/0 745227 : cluster [INF] osd.8 192.168.1.87:6800/815810 boot
2019-03-31 06:26:51.191126 mon.prox1 mon.0 192.168.1.81:6789/0 745230 : cluster [WRN] Health check update: Degraded data redundancy: 288856/5747508 objects degraded (5.026%), 236 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:52.142607 osd.14 osd.14 192.168.1.81:6812/3099 8370 : cluster [INF] 1.3e4 continuing backfill to osd.23 from (27492'9481638,39119'9526978] 1:27efe7b3:::rbd_data.f21da26b8b4567.00000000000010a4:head to 39119'9526978
2019-03-31 06:26:55.961547 mon.prox1 mon.0 192.168.1.81:6789/0 745231 : cluster [WRN] Health check update: 489/5747523 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:26:56.191516 mon.prox1 mon.0 192.168.1.81:6789/0 745232 : cluster [WRN] Health check update: Degraded data redundancy: 733/5747523 objects degraded (0.013%), 111 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:01.191946 mon.prox1 mon.0 192.168.1.81:6789/0 745233 : cluster [WRN] Health check update: 489/5747541 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:01.192047 mon.prox1 mon.0 192.168.1.81:6789/0 745234 : cluster [WRN] Health check update: Degraded data redundancy: 728/5747541 objects degraded (0.013%), 103 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:06.192432 mon.prox1 mon.0 192.168.1.81:6789/0 745237 : cluster [WRN] Health check update: Degraded data redundancy: 677/5747541 objects degraded (0.012%), 76 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:10.147576 mon.prox1 mon.0 192.168.1.81:6789/0 745239 : cluster [WRN] Health check update: 489/5747544 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:11.192871 mon.prox1 mon.0 192.168.1.81:6789/0 745240 : cluster [WRN] Health check update: Degraded data redundancy: 662/5747544 objects degraded (0.012%), 66 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:15.339742 mon.prox1 mon.0 192.168.1.81:6789/0 745241 : cluster [WRN] Health check update: 489/5747613 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:16.240006 mon.prox1 mon.0 192.168.1.81:6789/0 745244 : cluster [WRN] Health check update: Degraded data redundancy: 631/5747613 objects degraded (0.011%), 49 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:21.240485 mon.prox1 mon.0 192.168.1.81:6789/0 745247 : cluster [WRN] Health check update: Degraded data redundancy: 623/5747613 objects degraded (0.011%), 44 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:26.240859 mon.prox1 mon.0 192.168.1.81:6789/0 745248 : cluster [WRN] Health check update: Degraded data redundancy: 598/5747613 objects degraded (0.010%), 26 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:29.354948 mon.prox1 mon.0 192.168.1.81:6789/0 745252 : cluster [INF] osd.8 failed (root=default,host=prox7) (connection refused reported by osd.19)
2019-03-31 06:27:29.390502 mon.prox1 mon.0 192.168.1.81:6789/0 745287 : cluster [INF] osd.23 failed (root=default,host=prox6) (connection refused reported by osd.0)
In the individual OSD logs, I can see OSD process crashing, and restarting itself after about 10/20 seconds, after which it recovers itself. If you'd like to have a copy of these logs, happy to supply them, but as they're about 0.5/1G each, I didn't think it's a good idea to put them straight in here
As it seems to be a software issue / bug, rather than a configuration issue, I've raised a ticket with Ceph as well, but haven't heard anything back yet. If interested, you can find it here:
https://tracker.ceph.com/issues/39055
Thanks for any light you can shed on this!
Alex