Ceph - OSD's crashing when trying to backfill a specific PG

alext

Active Member
Sep 14, 2018
27
0
41
44
Newmarket, UK
www.abct.net
Hi,

A bit of a long shot, but does anyone have any experience with the following;
I have a specific PG that is trying to backfill it's objects to the other peers, but the moment it tries to do so on some specific objects within the placement group, the other OSD's it's trying to backfill it to, crash (simultaneously).

As soon as they're crashed, Ceph goes in recovery mode, the OSD's come back online again after about 20 seconds and as soon as Ceph tries to recover/backfill the same PG again, it's all starting over again like clockwork.

Initially thought was HDD issues, so have removed the original target drives, but no change. Have taken the original owner of the PG out, still no change.

Had a really good look around for current bugs or fixes, but couldn't find anything that came close. I've got a good idea what the exact objects are that are causing the problems, so if anyone knows how to manually delete objects from a Bluestore osd, that would already be great! (I'll happily take the hit on the lost data is the cluster can be healthy again)

Any help much appreciated! - Apologies if this is the wrong place for it.

Running Ceph 12.2.11-pve1, on ProxMox 5.3-12

Thanks in advance,

Alex
 
Is there anything related in the ceph logs? Did you try to scrub the PG?
 
Hi Alwin,

Thanks for your reponse. Yes, I've tried a scrub and a deep-scrub, but as the PG is undersized, it's not starting the scrub yet. It's trying to backfill the undersized PG and as soon as it it backfilling one particular object, the OSD processes that it's trying to replicate to (on two different servers), they crash. Surprisingly enough, the OSD that initiates the backfill process (that currently has the copy of the data), does not crash.

The PG in question:

<code>
ceph pg 1.3e4 query | head -n 30
{
"state": "undersized+degraded+remapped+backfill_wait+peered",
"snap_trimq": "[130~1,13d~1]",
"snap_trimq_len": 2,
"epoch": 41343,
"up": [
8,
1,
25
],
"acting": [
16
],
"backfill_targets": [
"1",
"8",
"25"
],
"actingbackfill": [
"1",
"8",
"16",
"25"
],
</code>

In this case above, it's crashing targets 1, 8 and 25 simultaneously the moment it tries to backfill.
As soon as they're crashed, Ceph goes in recovery mode, the OSD's come back online again after about 20 seconds and as soon as Ceph tries to recover/backfill the same PG again, it's all starting over again like clockwork.

Initially thought has HDD issues, so have removed the original target drives, but no change. Have taken the original owner of the PG out, still no change. (hence why in the example above, the "acting" is not one of the targets)

A snippet of the main Ceph log file, showing osd.8 and osd.23 flapping, and PG 1.3e4 starting the backfill process after which they go down again:

2019-03-31 06:26:02.406944 mon.prox1 mon.0 192.168.1.81:6789/0 744951 : cluster [INF] osd.8 failed (root=default,host=prox7) (connection refused reported by osd.0)
2019-03-31 06:26:02.409424 mon.prox1 mon.0 192.168.1.81:6789/0 744969 : cluster [INF] osd.23 failed (root=default,host=prox6) (connection refused reported by osd.6)
2019-03-31 06:26:03.324917 mon.prox1 mon.0 192.168.1.81:6789/0 745198 : cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)
2019-03-31 06:26:03.325094 mon.prox1 mon.0 192.168.1.81:6789/0 745199 : cluster [WRN] Health check update: 489/5747322 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:26:04.550642 mon.prox1 mon.0 192.168.1.81:6789/0 745201 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2019-03-31 06:26:06.186950 mon.prox1 mon.0 192.168.1.81:6789/0 745203 : cluster [WRN] Health check update: Degraded data redundancy: 586/5747346 objects degraded (0.010%), 15 pgs degraded, 1 pg undersized (PG_DEGRADED)
2019-03-31 06:26:08.543045 mon.prox1 mon.0 192.168.1.81:6789/0 745204 : cluster [WRN] Health check update: 3668/5747490 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:10.873192 mon.prox1 mon.0 192.168.1.81:6789/0 745206 : cluster [WRN] Health check update: Reduced data availability: 2 pgs inactive (PG_AVAILABILITY)
2019-03-31 06:26:11.187406 mon.prox1 mon.0 192.168.1.81:6789/0 745207 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747499 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:16.187967 mon.prox1 mon.0 192.168.1.81:6789/0 745208 : cluster [WRN] Health check update: 3668/5747499 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:34.987288 mon.prox1 mon.0 192.168.1.81:6789/0 745210 : cluster [WRN] Health check update: 3668/5747502 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:34.987356 mon.prox1 mon.0 192.168.1.81:6789/0 745211 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747502 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:40.280592 mon.prox1 mon.0 192.168.1.81:6789/0 745214 : cluster [WRN] Health check update: 1 osds down (OSD_DOWN)
2019-03-31 06:26:40.430710 mon.prox1 mon.0 192.168.1.81:6789/0 745215 : cluster [INF] osd.23 192.168.1.86:6817/3115652 boot
2019-03-31 06:26:41.190103 mon.prox1 mon.0 192.168.1.81:6789/0 745217 : cluster [WRN] Health check update: 3668/5747508 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:41.190179 mon.prox1 mon.0 192.168.1.81:6789/0 745218 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747508 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:43.794538 mon.prox1 mon.0 192.168.1.81:6789/0 745220 : cluster [INF] Health check cleared: OBJECT_MISPLACED (was: 3668/5747508 objects misplaced (0.064%))
2019-03-31 06:26:45.034919 mon.prox1 mon.0 192.168.1.81:6789/0 745221 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive)
2019-03-31 06:26:46.190647 mon.prox1 mon.0 192.168.1.81:6789/0 745222 : cluster [WRN] Health check update: Degraded data redundancy: 302190/5747508 objects degraded (5.258%), 247 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:49.549834 mon.prox1 mon.0 192.168.1.81:6789/0 745225 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-03-31 06:26:49.598780 mon.prox1 mon.0 192.168.1.81:6789/0 745226 : cluster [WRN] Health check failed: 1834/5747508 objects misplaced (0.032%) (OBJECT_MISPLACED)
2019-03-31 06:26:49.690099 mon.prox1 mon.0 192.168.1.81:6789/0 745227 : cluster [INF] osd.8 192.168.1.87:6800/815810 boot
2019-03-31 06:26:51.191126 mon.prox1 mon.0 192.168.1.81:6789/0 745230 : cluster [WRN] Health check update: Degraded data redundancy: 288856/5747508 objects degraded (5.026%), 236 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:52.142607 osd.14 osd.14 192.168.1.81:6812/3099 8370 : cluster [INF] 1.3e4 continuing backfill to osd.23 from (27492'9481638,39119'9526978] 1:27efe7b3:::rbd_data.f21da26b8b4567.00000000000010a4:head to 39119'9526978
2019-03-31 06:26:55.961547 mon.prox1 mon.0 192.168.1.81:6789/0 745231 : cluster [WRN] Health check update: 489/5747523 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:26:56.191516 mon.prox1 mon.0 192.168.1.81:6789/0 745232 : cluster [WRN] Health check update: Degraded data redundancy: 733/5747523 objects degraded (0.013%), 111 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:01.191946 mon.prox1 mon.0 192.168.1.81:6789/0 745233 : cluster [WRN] Health check update: 489/5747541 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:01.192047 mon.prox1 mon.0 192.168.1.81:6789/0 745234 : cluster [WRN] Health check update: Degraded data redundancy: 728/5747541 objects degraded (0.013%), 103 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:06.192432 mon.prox1 mon.0 192.168.1.81:6789/0 745237 : cluster [WRN] Health check update: Degraded data redundancy: 677/5747541 objects degraded (0.012%), 76 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:10.147576 mon.prox1 mon.0 192.168.1.81:6789/0 745239 : cluster [WRN] Health check update: 489/5747544 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:11.192871 mon.prox1 mon.0 192.168.1.81:6789/0 745240 : cluster [WRN] Health check update: Degraded data redundancy: 662/5747544 objects degraded (0.012%), 66 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:15.339742 mon.prox1 mon.0 192.168.1.81:6789/0 745241 : cluster [WRN] Health check update: 489/5747613 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:16.240006 mon.prox1 mon.0 192.168.1.81:6789/0 745244 : cluster [WRN] Health check update: Degraded data redundancy: 631/5747613 objects degraded (0.011%), 49 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:21.240485 mon.prox1 mon.0 192.168.1.81:6789/0 745247 : cluster [WRN] Health check update: Degraded data redundancy: 623/5747613 objects degraded (0.011%), 44 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:26.240859 mon.prox1 mon.0 192.168.1.81:6789/0 745248 : cluster [WRN] Health check update: Degraded data redundancy: 598/5747613 objects degraded (0.010%), 26 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:29.354948 mon.prox1 mon.0 192.168.1.81:6789/0 745252 : cluster [INF] osd.8 failed (root=default,host=prox7) (connection refused reported by osd.19)
2019-03-31 06:27:29.390502 mon.prox1 mon.0 192.168.1.81:6789/0 745287 : cluster [INF] osd.23 failed (root=default,host=prox6) (connection refused reported by osd.0)

In the individual OSD logs, I can see OSD process crashing, and restarting itself after about 10/20 seconds, after which it recovers itself. If you'd like to have a copy of these logs, happy to supply them, but as they're about 0.5/1G each, I didn't think it's a good idea to put them straight in here ;)

As it seems to be a software issue / bug, rather than a configuration issue, I've raised a ticket with Ceph as well, but haven't heard anything back yet. If interested, you can find it here: https://tracker.ceph.com/issues/39055

Thanks for any light you can shed on this!

Alex
 
Yes, I've tried a scrub and a deep-scrub, but as the PG is undersized, it's not starting the scrub yet.
A scrub should also work during recovery. Try to set a 'nobackfill', 'norecover' and maybe even a ' norebalance' to stop the selfhealing and then try to deep scrub the PG.

f you'd like to have a copy of these logs, happy to supply them, but as they're about 0.5/1G each, I didn't think it's a good idea to put them straight in here ;)
Are there lines mentioning the PG in question?
 
Aahhh, I was looking for something like the nobackfill / norecover flag, but couldn't find them - thanks for that so long! This will at least take some stress away now that the osd's shouldn't be flapping every couple of minutes. ;)

The PG in question is 1.3e4, and I suspect the object causing the trouble is the one mentioned in this log entry?

2019-03-31 06:26:52.142607 osd.14 osd.14 192.168.1.81:6812/3099 8370 : cluster [INF] 1.3e4 continuing backfill to osd.23 from (27492'9481638,39119'9526978] 1:27efe7b3:::rbd_data.f21da26b8b4567.00000000000010a4:head to 39119'9526978

I've set all three flags and then instructed first to do a scrub, give it a minute, check the PG, and then instructed a deep-scrub, but in both cases, it seems that it's still not taking it. See here the last couple of lines of "ceph pg 1.3e4 query"

"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.max_end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.waiting_on_whom": []
}

Is there a way to delete a certain block object from Ceph? I've found a way to do this for filesystem-based sytems, but as I'm using Bluestore, that won't work for me.
 
See here the last couple of lines of "ceph pg 1.3e4 query"
What is the recovery state of the PG saying?

Is there a way to delete a certain block object from Ceph? I've found a way to do this for filesystem-based sytems, but as I'm using Bluestore, that won't work for me.
You could try with 'rados', you may also check to which VM the prefix belongs to 'rbd_data.f21da26b8b4567' (rbd info <image>).
 
What is the recovery state of the PG saying?

pg 1.3e4 is stuck undersized for 3880.378223, current state active+recovery_wait+undersized+degraded+remapped, last acting [14,16]


You could try with 'rados', you may also check to which VM the prefix belongs to 'rbd_data.f21da26b8b4567' (rbd info <image>).

Brilliant, now I'm finding out all the commands I've been struggling to decipher for the last week ;)
I've now indeed found out what VM it belongs to, and if I can't find the exact object by tonight, I'll try to move the VM disk to a different Ceph pool and see if that fixes it. (seen that it's a 1TB drive, it takes a while..)
 
pg 1.3e4 is stuck undersized for 3880.378223, current state active+recovery_wait+undersized+degraded+remapped, last acting [14,16]
Ah, I meant the part from the query. :D

I've now indeed found out what VM it belongs to, and if I can't find the exact object by tonight, I'll try to move the VM disk to a different Ceph pool and see if that fixes it. (seen that it's a 1TB drive, it takes a while..)
If the other pool is also using the same OSDs, then the object may not travel away from the OSD (14).
 
Ah, I meant the part from the query. :D

Oops... :) For completeness, I've attached the full output of "ceph pg 1.3e4 query"

If the other pool is also using the same OSDs, then the object may not travel away from the OSD (14)

Couple of things I've done yesterday evening and this morning;
- Moved the VM disk to the different pool (using the Move Disk option in the Hardware section of the VM - Proxmox GUI), and as the different pool has different PG numbers, it should at least clean up the stuck PG. This worked and the block_name_prefix has changed for this VM disk.
- Found the exact object indeed using the rados tool (rados stat rbd_data.f21da26b8b4567.00000000000010a4 --pool <pool name>), and although I could find it yesterday, when I try the same command this morning (after the VM disk move), the object doesn't exist anymore. I've even dumped all rados object files in the pool (rados ls --pool <pool name> radosblocks.txt) to manually check if there were any remaining objects of this VM, but none.

This SHOULD be good news and the problem should have been resolved, if it wasn't for the fact that Ceph is still trying to replicate the object...

These are the messages that I find in the ceph.log after removing the "norecover" flag, after which the disks crash again:
2019-04-03 09:55:01.994000 osd.14 osd.14 192.168.1.81:6812/861413 140 : cluster [INF] 1.3e4 continuing backfill to osd.23 from (46098'9548367,46574'9553313] 1:27efe7b3:::rbd_data.f21da26b8b4567.00000000000010a4:head to 46574'9553313
2019-04-03 09:55:01.994012 osd.14 osd.14 192.168.1.81:6812/861413 141 : cluster [INF] 1.3e4 continuing backfill to osd.26 from (46098'9548389,46574'9553313] 1:27efe7b3:::rbd_data.f21da26b8b4567.00000000000010a4:head to 46574'9553313

...so I get the impression that this object is stuck somewhere between rados and ceph.... Any suggestions? Thanks for your continued support ;)
 

Attachments

Code:
"up_primary": 26,
"acting_primary": 14
Maybe a a stop -> start of the OSD 14, on my test system the primary values are identical OSD IDs.

EDIT: Up Set -> up_primary; Acting Set -> acting_primary;
Acting Set
The ordered list of OSDs who are (or were as of some epoch) responsible for a particular placement group.
Up Set
The ordered list of OSDs responsible for a particular placement group for a particular epoch according to CRUSH. Normally this is the same as the Acting Set, except when the Acting Set has been explicitly overridden via pg_temp in the OSD Map.
http://docs.ceph.com/docs/luminous/rados/operations/pg-concepts/
 
Last edited:
Code:
"up_primary": 26,
"acting_primary": 14
Maybe a a stop -> start of the OSD 14, on my test system the primary values are identical OSD IDs.

Hmm, that did something alright ;) The moment I stopped OSD 14, it also stopped OSD's 23 and 26 - the two it's currently trying to replicate to. Surprisingly OSD 16, another acting drive for this PG and which should have the correct data for this PG, did NOT go down.

I've uploaded the log files for both osd.14 and osd.26, as well as the ceph.log file for that time period.(sorry, they were too big to attach)
https://www.dropbox.com/sh/u1fs0g724f29rbw/AABaikJreUzMW3OwqhMd-Fsga?dl=0

Manual stop initiated to osd.14: 2019-04-03 12:53:50.168694
Manual start initiated to osd.14: 2019-04-03 12:56:19.508759

If interested, the ceph-osd.26.log also shows two previous events where it caused an OSD crash, around 10.45 and 10.53. For osd.26 log, I've cut off the top 100k messages for good measure ;) All servers are in-sync with time, so the times in the logs should correlate.

Thanks,
 
Not that anything in particular would jump out to me, but what if you shutdown the osd.14 and unset all the norecover, nobackfill and see if the cluster continues to recover. If it does than something on osd.14 is the issue.
 
Not that anything in particular would jump out to me, but what if you shutdown the osd.14 and unset all the norecover, nobackfill and see if the cluster continues to recover. If it does than something on osd.14 is the issue.

Fair suggestion, but already tried that last week. It still has a copy of the data on osd.16, and that one is then trying to replicate its data to the other ones, causing the same results.

When I wasn't aware of the norecover/nobackfill flags, the only way I had to stop the OSD's flapping, was to stop osd.14 and osd.16, basically marking pg 1.3e4 down (...which came with its own raft of problems ;)
 
These server aren't under any heavy load or have data swapped out? How does the 'ceph osd df tree' look like?
 
No, at the moment, they're running only about 10-ish VM's (some websites and some Windows boxes) - all crucial VM's have been migrated to the backup cluster. Normally they're running about 50 to 100-ish VM's that are used for a training lab environment.

root@prox7:~# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
-1 42.00000 - 42.8TiB 21.6TiB 21.2TiB 50.52 1.00 - root default
-3 11.00000 - 11.8TiB 5.30TiB 6.53TiB 44.80 0.89 - host prox1
0 hdd 2.00000 1.00000 1.82TiB 1.18TiB 655GiB 64.87 1.28 307 osd.0
2 hdd 3.00000 1.00000 3.64TiB 1.36TiB 2.28TiB 37.26 0.74 354 osd.2
6 hdd 2.00000 1.00000 2.73TiB 1.07TiB 1.66TiB 39.24 0.78 279 osd.6
13 hdd 2.00000 1.00000 1.82TiB 1.11TiB 727GiB 60.95 1.21 286 osd.13
14 hdd 2.00000 1.00000 1.82TiB 596GiB 1.24TiB 31.99 0.63 152 osd.14
-5 10.00000 - 9.10TiB 5.47TiB 3.63TiB 60.09 1.19 - host prox3
1 hdd 2.00000 1.00000 1.82TiB 1020GiB 843GiB 54.75 1.08 224 osd.1
7 hdd 2.00000 1.00000 1.82TiB 1.15TiB 682GiB 63.38 1.25 300 osd.7
9 hdd 2.00000 1.00000 1.82TiB 1.17TiB 666GiB 64.25 1.27 305 osd.9
10 hdd 2.00000 1.00000 1.82TiB 1009GiB 853GiB 54.19 1.07 259 osd.10
12 hdd 2.00000 1.00000 1.82TiB 1.16TiB 672GiB 63.91 1.26 298 osd.12
-11 10.00000 - 10.5TiB 5.28TiB 5.18TiB 50.52 1.00 - host prox6
16 hdd 1.00000 1.00000 1.82TiB 160GiB 1.66TiB 8.61 0.17 42 osd.16
19 hdd 2.00000 1.00000 1.82TiB 1.34TiB 495GiB 73.44 1.45 342 osd.19
20 hdd 1.00000 1.00000 931GiB 624GiB 307GiB 67.03 1.33 153 osd.20
21 hdd 1.00000 1.00000 1.36TiB 631GiB 767GiB 45.14 0.89 155 osd.21
22 hdd 1.00000 1.00000 931GiB 645GiB 287GiB 69.24 1.37 166 osd.22
23 hdd 1.00000 1.00000 931GiB 105GiB 826GiB 11.30 0.22 28 osd.23
24 hdd 1.00000 1.00000 931GiB 590GiB 342GiB 63.31 1.25 154 osd.24
25 hdd 2.00000 1.00000 1.82TiB 1.26TiB 575GiB 69.13 1.37 324 osd.25
-9 11.00000 - 11.4TiB 5.55TiB 5.82TiB 48.82 0.97 - host prox7
3 hdd 1.00000 1.00000 931GiB 662GiB 270GiB 71.03 1.41 169 osd.3
4 hdd 2.00000 1.00000 1.82TiB 1.13TiB 701GiB 62.38 1.23 298 osd.4
5 hdd 1.00000 1.00000 931GiB 604GiB 328GiB 64.80 1.28 153 osd.5
8 hdd 2.00000 1.00000 1.82TiB 1.11TiB 725GiB 61.06 1.21 285 osd.8
11 hdd 1.00000 1.00000 931GiB 468GiB 463GiB 50.29 1.00 118 osd.11
15 hdd 2.00000 1.00000 1.82TiB 1.02TiB 819GiB 56.04 1.11 264 osd.15
18 hdd 1.00000 1.00000 1.82TiB 284GiB 1.54TiB 15.22 0.30 68 osd.18
26 hdd 1.00000 1.00000 1.36TiB 323GiB 1.05TiB 23.12 0.46 80 osd.26
TOTAL 42.8TiB 21.6TiB 21.2TiB 50.52
MIN/MAX VAR: 0.17/1.45 STDDEV: 19.02
 
...Saying that, I have rebalanced some of the OSD's yesterday to bring them back in (had removed a bunch of them to see if it made a difference last week). It is still rebalancing the cluster and has about 12% of objects misplaced which it is slowly putting back in the right place.
 
While not elegant, you could try to set the 'noout' flag and let the cluster recovery, this will produce stuck requests if the OSDs in question go into a restart. But it may finish the recovery down to broken one.

Another idea would be to move all VMs to a different pool and afterward, when no data left in the pool, mark the PG lost. But this will still leave the question, why it happend in the first place.

TOTAL 42.8TiB 21.6TiB 21.2TiB 50.52 MIN/MAX VAR: 0.17/1.45 STDDEV: 19.02
Aside, your cluster is very uneven balanced (ATM), this might introduce additional load on the cluster and may have a reduced storage capacity.
 
While not elegant, you could try to set the 'noout' flag and let the cluster recovery, this will produce stuck requests if the OSDs in question go into a restart. But it may finish the recovery down to broken one.

I think I've tried that last week, but let me indeed try again once the cluster is rebalanced and see what happens. From memory, I think it ended up just crashing the osd's quicker ;)

Another idea would be to move all VMs to a different pool and afterward, when no data left in the pool, mark the PG lost. But this will still leave the question, why it happend in the first place.

I've started to do that with a few disks - was hoping I could avoid it as there's a good 100 VM's or so, and most of them are relying on snapshots, which I can't take along to the new pool it seems. ...maybe a good moment for me to do some housekeeping!
But yeah, still makes me wonder what the root cause is then and what's the chance it pops up again.
Long shot, but have you got any idea if there's a way to pull the raw blocks out of Ceph directly - or wipe them? (so past the rados layer) I've tried to have a look in ceph and ceph-osd but couldn't find anything useable.

Aside, your cluster is very uneven balanced (ATM), this might introduce additional load on the cluster and may have a reduced storage capacity

Yeah, I've done that on purpose; I'm more worried about IOPS than storage capacity or CPU at the moment as I'm heavily relying on HDD's without SSD's.
 
Long shot, but have you got any idea if there's a way to pull the raw blocks out of Ceph directly - or wipe them? (so past the rados layer) I've tried to have a look in ceph and ceph-osd but couldn't find anything useable.
That's what I am wondering about, as it was removed (not seen) from rados anymore. Well I guess, it may be time to pump up the logging, especially for the OSDs. This will produce a lot of data, increase logging only for a short time just enough to catch the crashing OSDs and a litte bit before and after. Maybe this will provide more clues.
http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/

EDIT: Can you please tell me a litte bit more about the cluster setup in general?
 
Thanks. I'm just moving the remainder of my VM disks across to the new pool - probably should be finished by tonight/tomorrow morning, after which I'll rebalance the cluster and then enable the logging.

My cluster setup;
- 4 nodes, 24-32 CPU cores each, 128GB ram each (CPU load normally between 10 and 40%)
- about 27 HDD's,
- 1Gbit non-blocking switch (for both frontend and backend) I know I should really separate them, but for now I've kept a close eye on it and have not seen any blocks in the system due to network congestion. (...saving up for a 10G switch ;)
- Normally running between 50-100 VM's, most of which most get reset back to snapshot every week (they are used for training - you familiar with F5?), as well as some small virtual network devices, webservers and some desktops environments.

Will keep you posted on the logging results and thanks for your help so far! Made a big difference.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!