ceph 16.2 pacific cluster crash

sippe · Jul 19, 2021

Thank you spirit! Work around fixed the problem and system is up and running. So the conclusion was right! Super.

branto · Jul 20, 2021

Backups are important, m'kayyyy

Yo.... seriously....... You just saved me my sanity!! I was going through my old NAS for old VMs, bringing them up one at a time to see how much data loss I was going to have to deal with, and then saw the response from @sippe.

As you suggested, I added the below to /etc/pve/ceph.conf, upgraded the packages, and rebooted the cluster, and rebooted.

[osd]
bluestore_allocator = bitmap

My CephFS is now up and running... VMs are up! One PG looks down, but 192/193 are up! I'll take it!

ceph -s

root@sin-dc-196m-hv-cpu-001:~# ceph -s
cluster:
id: 9f0013c5-930e-4dc4-b359-02e8f0af74ad
health: HEALTH_WARN
1 nearfull osd(s)
6 pgs not deep-scrubbed in time
4 pool(s) nearfull
94 daemons have recently crashed

services:
mon: 4 daemons, quorum sin-dc-196m-hv-cpu-001,sin-dc-196m-hv-cpu-002,sin-dc-196m-hv-cpu-003,sin-dc-196m-hv-cpu-004 (age 58m)
mgr: sin-dc-196m-hv-cpu-001(active, since 65m), standbys: sin-dc-196m-hv-cpu-002, sin-dc-196m-hv-cpu-004, sin-dc-196m-hv-cpu-003
mds: 3/3 daemons up, 1 standby
osd: 4 osds: 4 up (since 58m), 4 in (since 69m)

data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 451.33k objects, 1.7 TiB
usage: 3.0 TiB used, 687 GiB / 3.6 TiB avail
pgs: 192 active+clean
1 active+clean+scrubbing+deep

io:
client: 12 KiB/s wr, 0 op/s rd, 0 op/s wr

Time to add more nodes with more space and, most importantly, be more vigilant about backups.

sippe · Jul 20, 2021

Glad to hear your system is back on rails.

I suppose after update to Ceph 16.2.5 you don't need bluestore_allocator = bitmap value in configuration file anymore. It's easy to check wich version you are using by looking Ceph section under Proxmox or writing command ceph -v. It should tell you something like this: "ceph version 16.2.5 (9b9dd76e12f1907fe5dcc0c1fadadbb784022a42) pacific (stable)"

Perhaps you can try to ask Ceph to repair damaged PG? https://docs.ceph.com/en/latest/rados/operations/pg-repair/ Like you see there is plenty of commands to try fixing...

And yes you are right. Ceph is not backup and raid is not backup. Only way to save nerves is make real backups. Actually backing up VMs is very easy with Proxmox. Just mount NAS to your Proxmox system with iscsi, smb/cifs, nfs or what ever you like to use and make automatic backup schedule under datacenter section.

Actually... Your system says: 1 active+clean+scrubbing+deep It's possible that Ceph can fix it by it self without your attention. It may take some time.

branto · Jul 20, 2021

Yeah, I just let it do its thing, and all PGs are clean now. I couldn't be more relieved.

Setting up offsite backups now.

sippe · Jul 20, 2021

The final conclusion of mine... This was good disaster for me. I learned a lot about Ceph and realized that automatic snapshot/backups are very good thing. There is no safe update. Always something can make things pretty messy. I have made snapshots when I am going to update some VMs but now I automated the whole thing to offsite. So... If permanent disaster happens I can start over without major data loss and use my snapshots. There is always something good when bad things happens.

Perhaps it's good moment to rebuild the whole system from scratch. My system is pretty old...

branto · Jul 20, 2021

That's what I'm taking out of this as well...

It's a good opportunity to also reduce my VM footprint and stop the procrastination of moving stuff into containers. So many of these VM's are only running a single application that I could probably shrink the cluster disk space usage by 55%. That would end up making backups even easier.

sippe · Jul 21, 2021

I have dedicated Docker and Kubernetes VM's running under proxmox. Hopefully those are implemented directly to Proxmox some day. I must say that I like Docker containers. No need to buy tanker because tiny boat is enough.

Jay Sullivan · Aug 1, 2021

tmolberg said:
For what its worth, I had a similar issue which presented itself with;

Code:

Jul 15 00:28:53 sh-prox02 systemd[1]: ceph-osd@2.service: Scheduled restart job, restart counter is at 6. Jul 15 00:28:53 sh-prox02 systemd[1]: Stopped Ceph object storage daemon osd.2. Jul 15 00:28:53 sh-prox02 systemd[1]: ceph-osd@2.service: Consumed 14.647s CPU time. Jul 15 00:28:53 sh-prox02 systemd[1]: ceph-osd@2.service: Start request repeated too quickly. Jul 15 00:28:53 sh-prox02 systemd[1]: ceph-osd@2.service: Failed with result 'signal'. Jul 15 00:28:53 sh-prox02 systemd[1]: Failed to start Ceph object storage daemon osd.2.

Digging through the logs revealed;

Code:

bluefs _allocate allocation failed, needed 0x400000

Looking at this bugreport there were some comments stating that changing bluestore_allocator from hybrid over to bitmap seem to triage the issue.
For my part I've set bluestore_allocator to bitmap on my pure nvme osds which allowed the osds to start.

https://tracker.ceph.com/issues/50656

EDIT: I was presented with this issue some time after upgrading from ceph octopus to pacific after a recent upgrade to pve7 from 6.4.
Weirdly enough it seemed only to affect NVME specific drives. All other drives (spinners (with db/wal on nvme lvm) and ghetto ssds) have started fine and seem to be operational.

I just wanted to share that this post saved my butt. Thank you tmolberg and everyone else!

Our PVE cluster consumes storage from a standalone Ceph cluster running 16.2.4. About half of our 76 NVMe OSDs started crashing a few hours ago and my colleague's google-fu led us here. I only applied the bitmap fix to the NVMe OSDs; this cluster also has about 150 each of SSD and HDD OSDs. No ceph.conf for us these days:

Code:

ceph osd tree | grep -v root | grep \ nvme | awk '{ system ("ceph config set osd." $1 " bluestore_allocator bitmap")}'

I'm going to be looking closely at the commit in 16.2.5 that was mentioned above. I would have already upgraded to 16.2.5, but I'm waiting on a bluestore patch that barely missed that release but should be in 16.2.6.

I have two other Ceph clusters (one standalone, one PVE hyperconverged) to keep an eye on now... FWIW, this cluster skipped from Nautilus to Pacific and some of the NVMe OSDs likely have been created in Luminous (or even Jewel???).

Search

Search

ceph 16.2 pacific cluster crash

sippe

Member

branto

Member

sippe

Member

branto

Member

sippe

Member

branto

Member

sippe

Member

Jay Sullivan

Member

We value your privacy