Hi Alwin,
I've got a nice steaming pile of log files for you!
https://www.dropbox.com/s/iykxek4hwqj3sj2/ceph-logs-extended.zip?dl=0
This contains:
OSD logs - switched log/mem levels to 20/20 (assumed that was the best to choose)
Ceph log
Ceph Audit log
Ceph Mon/Mgr/Mds logs
Order of events;
Around 00:15, cleared out most logs to only show exact events (some I may have missed - sorry)
Then set the log levels to 20/20 on the OSD's
Around 00:16, unset the norecover flag
Almost immediately followed by OSD's 17 and 23 crashing (presumably as there is now so little data in this pool, it got around syncing data for PG 1.3e4 a lot quicker)
Then follows about 3 minutes of OSD's 17 and 23 flapping (17 flapping more than 23, but both in pretty bad shape), the cluster trying to recover in the mean time.
Around 00:20, I manually start OSD 23 again, as it doesn't seem to come back up, works well, both OSD's are up for a few seconds again and both go down again.
Around 00:21, I set the norecover flag again and within about 10 - 20 seconds both OSD 17 and 23 come back online again and all stable again.
Set the log levels back to 1/5 again and captured the log files.
Note; during the process there was still about 15% of objects in the wrong locations, so that may give some background noise - sorry for that - I couldn't get the cluster to stabilize.
As I need to move forward with the cluster, I've now removed the offending pool and although it's not done rebalancing, it's already looking a lot better. As such, don't feel obliged to look through the log files if you don't have time for it, but hopefully it will be useful to figure out what exactly happened and maybe fix a bug in the awesome Ceph code...