ceph 16.2 pacific cluster crash

Backups are important, m'kayyyy

Yo.... seriously....... You just saved me my sanity!! I was going through my old NAS for old VMs, bringing them up one at a time to see how much data loss I was going to have to deal with, and then saw the response from @sippe.

As you suggested, I added the below to /etc/pve/ceph.conf, upgraded the packages, and rebooted the cluster, and rebooted.

[osd]
bluestore_allocator = bitmap

My CephFS is now up and running... VMs are up! One PG looks down, but 192/193 are up! I'll take it!

ceph -s

root@sin-dc-196m-hv-cpu-001:~# ceph -s
cluster:
id: 9f0013c5-930e-4dc4-b359-02e8f0af74ad
health: HEALTH_WARN
1 nearfull osd(s)
6 pgs not deep-scrubbed in time
4 pool(s) nearfull
94 daemons have recently crashed

services:
mon: 4 daemons, quorum sin-dc-196m-hv-cpu-001,sin-dc-196m-hv-cpu-002,sin-dc-196m-hv-cpu-003,sin-dc-196m-hv-cpu-004 (age 58m)
mgr: sin-dc-196m-hv-cpu-001(active, since 65m), standbys: sin-dc-196m-hv-cpu-002, sin-dc-196m-hv-cpu-004, sin-dc-196m-hv-cpu-003
mds: 3/3 daemons up, 1 standby
osd: 4 osds: 4 up (since 58m), 4 in (since 69m)

data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 451.33k objects, 1.7 TiB
usage: 3.0 TiB used, 687 GiB / 3.6 TiB avail
pgs: 192 active+clean
1 active+clean+scrubbing+deep

io:
client: 12 KiB/s wr, 0 op/s rd, 0 op/s wr

Time to add more nodes with more space and, most importantly, be more vigilant about backups.
 
  • Like
Reactions: sippe
Glad to hear your system is back on rails.

I suppose after update to Ceph 16.2.5 you don't need bluestore_allocator = bitmap value in configuration file anymore. It's easy to check wich version you are using by looking Ceph section under Proxmox or writing command ceph -v. It should tell you something like this: "ceph version 16.2.5 (9b9dd76e12f1907fe5dcc0c1fadadbb784022a42) pacific (stable)"

Perhaps you can try to ask Ceph to repair damaged PG? https://docs.ceph.com/en/latest/rados/operations/pg-repair/ Like you see there is plenty of commands to try fixing...

And yes you are right. Ceph is not backup and raid is not backup. Only way to save nerves is make real backups. Actually backing up VMs is very easy with Proxmox. Just mount NAS to your Proxmox system with iscsi, smb/cifs, nfs or what ever you like to use and make automatic backup schedule under datacenter section.

Actually... Your system says: 1 active+clean+scrubbing+deep It's possible that Ceph can fix it by it self without your attention. It may take some time.
 
Last edited:
Yeah, I just let it do its thing, and all PGs are clean now. I couldn't be more relieved.

Setting up offsite backups now.
 
  • Like
Reactions: sippe
The final conclusion of mine... This was good disaster for me. I learned a lot about Ceph and realized that automatic snapshot/backups are very good thing. There is no safe update. Always something can make things pretty messy. I have made snapshots when I am going to update some VMs but now I automated the whole thing to offsite. So... If permanent disaster happens I can start over without major data loss and use my snapshots. There is always something good when bad things happens. :) Perhaps it's good moment to rebuild the whole system from scratch. My system is pretty old...
 
That's what I'm taking out of this as well...

It's a good opportunity to also reduce my VM footprint and stop the procrastination of moving stuff into containers. So many of these VM's are only running a single application that I could probably shrink the cluster disk space usage by 55%. That would end up making backups even easier.
 
  • Like
Reactions: sippe
I have dedicated Docker and Kubernetes VM's running under proxmox. Hopefully those are implemented directly to Proxmox some day. I must say that I like Docker containers. No need to buy tanker because tiny boat is enough. :)
 
For what its worth, I had a similar issue which presented itself with;

Code:
Jul 15 00:28:53 sh-prox02 systemd[1]: ceph-osd@2.service: Scheduled restart job, restart counter is at 6.
Jul 15 00:28:53 sh-prox02 systemd[1]: Stopped Ceph object storage daemon osd.2.
Jul 15 00:28:53 sh-prox02 systemd[1]: ceph-osd@2.service: Consumed 14.647s CPU time.
Jul 15 00:28:53 sh-prox02 systemd[1]: ceph-osd@2.service: Start request repeated too quickly.
Jul 15 00:28:53 sh-prox02 systemd[1]: ceph-osd@2.service: Failed with result 'signal'.
Jul 15 00:28:53 sh-prox02 systemd[1]: Failed to start Ceph object storage daemon osd.2.


Digging through the logs revealed;
Code:
bluefs _allocate allocation failed, needed 0x400000

Looking at this bugreport there were some comments stating that changing bluestore_allocator from hybrid over to bitmap seem to triage the issue.
For my part I've set bluestore_allocator to bitmap on my pure nvme osds which allowed the osds to start.

https://tracker.ceph.com/issues/50656

EDIT: I was presented with this issue some time after upgrading from ceph octopus to pacific after a recent upgrade to pve7 from 6.4.
Weirdly enough it seemed only to affect NVME specific drives. All other drives (spinners (with db/wal on nvme lvm) and ghetto ssds) have started fine and seem to be operational.
I just wanted to share that this post saved my butt. Thank you tmolberg and everyone else!

Our PVE cluster consumes storage from a standalone Ceph cluster running 16.2.4. About half of our 76 NVMe OSDs started crashing a few hours ago and my colleague's google-fu led us here. I only applied the bitmap fix to the NVMe OSDs; this cluster also has about 150 each of SSD and HDD OSDs. No ceph.conf for us these days:

Code:
ceph osd tree | grep -v root | grep \ nvme | awk '{ system ("ceph config set osd." $1 " bluestore_allocator bitmap")}'

I'm going to be looking closely at the commit in 16.2.5 that was mentioned above. I would have already upgraded to 16.2.5, but I'm waiting on a bluestore patch that barely missed that release but should be in 16.2.6.

I have two other Ceph clusters (one standalone, one PVE hyperconverged) to keep an eye on now... FWIW, this cluster skipped from Nautilus to Pacific and some of the NVMe OSDs likely have been created in Luminous (or even Jewel???). o_O
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!