[SOLVED] Ceph Health - backfillfull / OSDs marked as out and down

ssaman

Active Member
Oct 28, 2015
38
2
28
Hello proxmox community,

today we noticed our heath error with the message:

Code:
HEALTH_ERR 1 backfillfull osd(s); 1 nearfull osd(s); 1 pool(s) backfillfull; Degraded data redundancy: 99961/8029671 objects degraded (1.245%), 19 pgs degraded, 19 pgs undersized; Degraded data redundancy (low space): 19 pgs backfill_toofull
For more Detail:
Code:
HEALTH_ERR 1 backfillfull osd(s); 1 nearfull osd(s); 1 pool(s) backfillfull; Degraded data redundancy: 99961/8029671 objects degraded (1.245%), 19 pgs degraded, 19 pgs undersized; Degraded data redundancy (low space): 19 pgs backfill_toofull
OSD_BACKFILLFULL 1 backfillfull osd(s)
    osd.10 is backfill full
OSD_NEARFULL 1 nearfull osd(s)
    osd.13 is near full
POOL_BACKFILLFULL 1 pool(s) backfillfull
    pool 'hdd_mainpool' is backfillfull
PG_DEGRADED Degraded data redundancy: 99961/8029671 objects degraded (1.245%), 19 pgs degraded, 19 pgs undersized
    pg 10.0 is stuck undersized for 304111.529099, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,6]
    pg 10.3 is stuck undersized for 1702843.879105, current state active+undersized+degraded+remapped+backfill_toofull, last acting [2,8]
    pg 10.4f is stuck undersized for 304111.866785, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,6]
    pg 10.59 is stuck undersized for 304111.866899, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,6]
    pg 10.6b is stuck undersized for 1702843.872904, current state active+undersized+degraded+remapped+backfill_toofull, last acting [2,7]
    pg 10.a3 is stuck undersized for 1702843.822618, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,9]
    pg 10.a7 is stuck undersized for 304111.865494, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,6]
    pg 10.ba is stuck undersized for 304111.529376, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,6]
    pg 10.c7 is stuck undersized for 304111.866182, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,6]
    pg 10.d4 is stuck undersized for 1702843.875008, current state active+undersized+degraded+remapped+backfill_toofull, last acting [2,7]
    pg 10.106 is stuck undersized for 1702843.815712, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,7]
    pg 10.109 is stuck undersized for 1702843.856827, current state active+undersized+degraded+remapped+backfill_toofull, last acting [8,3]
    pg 10.110 is stuck undersized for 1702843.814649, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,9]
    pg 10.12b is stuck undersized for 1702843.808098, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,9]
    pg 10.1a4 is stuck undersized for 304111.867262, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,6]
    pg 10.1c2 is stuck undersized for 1702843.812091, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,8]
    pg 10.1d9 is stuck undersized for 304111.867532, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,6]
    pg 10.1e1 is stuck undersized for 1702843.817784, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,8]
    pg 10.1e8 is stuck undersized for 1702843.813599, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,9]
PG_DEGRADED_FULL Degraded data redundancy (low space): 19 pgs backfill_toofull
    pg 10.0 is active+undersized+degraded+remapped+backfill_toofull, acting [1,6]
    pg 10.3 is active+undersized+degraded+remapped+backfill_toofull, acting [2,8]
    pg 10.4f is active+undersized+degraded+remapped+backfill_toofull, acting [3,6]
    pg 10.59 is active+undersized+degraded+remapped+backfill_toofull, acting [3,6]
    pg 10.6b is active+undersized+degraded+remapped+backfill_toofull, acting [2,7]
    pg 10.a3 is active+undersized+degraded+remapped+backfill_toofull, acting [1,9]
    pg 10.a7 is active+undersized+degraded+remapped+backfill_toofull, acting [3,6]
    pg 10.ba is active+undersized+degraded+remapped+backfill_toofull, acting [1,6]
    pg 10.c7 is active+undersized+degraded+remapped+backfill_toofull, acting [3,6]
    pg 10.d4 is active+undersized+degraded+remapped+backfill_toofull, acting [2,7]
    pg 10.106 is active+undersized+degraded+remapped+backfill_toofull, acting [3,7]
    pg 10.109 is active+undersized+degraded+remapped+backfill_toofull, acting [8,3]
    pg 10.110 is active+undersized+degraded+remapped+backfill_toofull, acting [3,9]
    pg 10.12b is active+undersized+degraded+remapped+backfill_toofull, acting [3,9]
    pg 10.1a4 is active+undersized+degraded+remapped+backfill_toofull, acting [3,6]
    pg 10.1c2 is active+undersized+degraded+remapped+backfill_toofull, acting [1,8]
    pg 10.1d9 is active+undersized+degraded+remapped+backfill_toofull, acting [3,6]
    pg 10.1e1 is active+undersized+degraded+remapped+backfill_toofull, acting [3,8]
    pg 10.1e8 is active+undersized+degraded+remapped+backfill_toofull, acting [3,9]

Also, there are 2 OSDs down / out from nothing.
Code:
ceph osd tree
ID  CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
 11   hdd  5.45740         osd.11   down        0 1.00000
 12   hdd  5.45789         osd.12   down        0 1.00000

We set the OSDs back to "in" and tried to start the OSDs but we had no success (no response)
Code:
sudo systemctl start ceph-osd@12
sudo systemctl start ceph-osd@11
 
  • Like
Reactions: Mihail Gyoshev
There are some errors that I can't interpret or know how to fix like:
Code:
/var/log/ceph/ceph-osd.11.log
Code:
     0> 2019-05-09 13:03:43.860878 7fb79ab8e700 -1 /mnt/pve/store/tlamprecht/sources/ceph/ceph-12.2.12/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fb79ab8e700 time 2019-05-09 13:03:43.859862
/mnt/pve/store/tlamprecht/sources/ceph/ceph-12.2.12/src/os/bluestore/BlueStore.cc: 8877: FAILED assert(r == 0)
Code:
    -3> 2019-05-09 13:03:42.882799 7fb79938b700  2 rocksdb: [/mnt/pve/store/tlamprecht/sources/ceph/ceph-12.2.12/src/rocksdb/db/db_impl_compaction_flush.cc:1275] Waiting after background compaction error: Corruption: block checksum mismatch, Accumulated background error counts: 1
 
Can you please post more than just these 2 lines?
Before digging into this, was your cluster near full before those osds got down?
 
Before digging into this, was your cluster near full before those osds got down?
No, the cluster wasn't near full.

Current Usage:
upload_2019-5-9_14-47-13.png

Maybe the problem is that we set our replica count to 3.
And we have 4 OSDs each node.

Can you please post more than just these 2 lines?
I have uploaded the log file. I hope it helps.
 

Attachments

  • ceph-osd.11.zip
    362.9 KB · Views: 2
We fixed the problem by removing the disks and reuse the same disk again. We don't know why Ceph throw out the OSDs even though the disks are still good.
Currently, we struggle with an LSI MegaRAID Controller.
We put back the same disk in the same slot from the RAID-Controller (RAID0 / single non RAID disk) but the controller doesn't pass through the disk to Proxmox.
 
We strongly advise against any RAID-Controller with Ceph and I'm very sorry that you are facing this issues, but that's exactly why we always advise to not use raid controllers with ceph or zfs.
 
Thank you, Tim for the information.

If anyone wanted to know how we fixed it right now.

Like already mentioned before we have an LSI MegaRAID Controller MR9260-i4. This Controller isn't able to set a disk to JBOD-Mode.
The workaround is to create a RAID0 with a single disk.

We noticed that the RAID0 associated to the disk we missed was gone.
We don't know how but it happened. So the disk wasn't in any RAID (unconfigured(good)).

So we created a new RAID0 and added the disk to the raid.

There we were able to see the disk again on the OS.

Our Ceph status changed from error to warning. We just have to wait that Ceph is healing "misplaced objects".
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!