Ceph - OSD's crashing when trying to backfill a specific PG

Alwin · Apr 4, 2019

alext said:
- 1Gbit non-blocking switch (for both frontend and backend) I know I should really separate them, but for now I've kept a close eye on it and have not seen any blocks in the system due to network congestion. (...saving up for a 10G switch

You could use a full mesh network and skip the switch.

https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

alext said:
- Normally running between 50-100 VM's, most of which most get reset back to snapshot every week (they are used for training - you familiar with F5?), as well as some small virtual network devices, webservers and some desktops environments.

Yup. We do the same for our training environment, but use local storage.

alext · Apr 5, 2019

Hi Alwin,

I've got a nice steaming pile of log files for you!
https://www.dropbox.com/s/iykxek4hwqj3sj2/ceph-logs-extended.zip?dl=0

This contains:
OSD logs - switched log/mem levels to 20/20 (assumed that was the best to choose)
Ceph log
Ceph Audit log
Ceph Mon/Mgr/Mds logs

Order of events;
Around 00:15, cleared out most logs to only show exact events (some I may have missed - sorry)
Then set the log levels to 20/20 on the OSD's
Around 00:16, unset the norecover flag
Almost immediately followed by OSD's 17 and 23 crashing (presumably as there is now so little data in this pool, it got around syncing data for PG 1.3e4 a lot quicker)
Then follows about 3 minutes of OSD's 17 and 23 flapping (17 flapping more than 23, but both in pretty bad shape), the cluster trying to recover in the mean time.
Around 00:20, I manually start OSD 23 again, as it doesn't seem to come back up, works well, both OSD's are up for a few seconds again and both go down again.
Around 00:21, I set the norecover flag again and within about 10 - 20 seconds both OSD 17 and 23 come back online again and all stable again.
Set the log levels back to 1/5 again and captured the log files.

Note; during the process there was still about 15% of objects in the wrong locations, so that may give some background noise - sorry for that - I couldn't get the cluster to stabilize.

As I need to move forward with the cluster, I've now removed the offending pool and although it's not done rebalancing, it's already looking a lot better. As such, don't feel obliged to look through the log files if you don't have time for it, but hopefully it will be useful to figure out what exactly happened and maybe fix a bug in the awesome Ceph code...

Alwin · Apr 8, 2019

After skimming through the log, the only thing that I could see so far, was the failing object belonged to a snapshot.

Search

Search

Ceph - OSD's crashing when trying to backfill a specific PG

Alwin

Proxmox Retired Staff

alext

Active Member

Alwin

Proxmox Retired Staff

We value your privacy