Ceph - OSD's crashing when trying to backfill a specific PG

- 1Gbit non-blocking switch (for both frontend and backend) I know I should really separate them, but for now I've kept a close eye on it and have not seen any blocks in the system due to network congestion. (...saving up for a 10G switch ;)
You could use a full mesh network and skip the switch. ;)
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

- Normally running between 50-100 VM's, most of which most get reset back to snapshot every week (they are used for training - you familiar with F5?), as well as some small virtual network devices, webservers and some desktops environments.
Yup. We do the same for our training environment, but use local storage.
 
Hi Alwin,

I've got a nice steaming pile of log files for you!
https://www.dropbox.com/s/iykxek4hwqj3sj2/ceph-logs-extended.zip?dl=0

This contains:
OSD logs - switched log/mem levels to 20/20 (assumed that was the best to choose)
Ceph log
Ceph Audit log
Ceph Mon/Mgr/Mds logs

Order of events;
Around 00:15, cleared out most logs to only show exact events (some I may have missed - sorry)
Then set the log levels to 20/20 on the OSD's
Around 00:16, unset the norecover flag
Almost immediately followed by OSD's 17 and 23 crashing (presumably as there is now so little data in this pool, it got around syncing data for PG 1.3e4 a lot quicker)
Then follows about 3 minutes of OSD's 17 and 23 flapping (17 flapping more than 23, but both in pretty bad shape), the cluster trying to recover in the mean time.
Around 00:20, I manually start OSD 23 again, as it doesn't seem to come back up, works well, both OSD's are up for a few seconds again and both go down again.
Around 00:21, I set the norecover flag again and within about 10 - 20 seconds both OSD 17 and 23 come back online again and all stable again.
Set the log levels back to 1/5 again and captured the log files.

Note; during the process there was still about 15% of objects in the wrong locations, so that may give some background noise - sorry for that - I couldn't get the cluster to stabilize.

As I need to move forward with the cluster, I've now removed the offending pool and although it's not done rebalancing, it's already looking a lot better. As such, don't feel obliged to look through the log files if you don't have time for it, but hopefully it will be useful to figure out what exactly happened and maybe fix a bug in the awesome Ceph code... ;)
 
After skimming through the log, the only thing that I could see so far, was the failing object belonged to a snapshot.
 
  • Like
Reactions: alext

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!