Please help! CEPH pool stuck at "undersized+degraded+remapped+backfill_toofull+peered"

proxwolfe · Jan 27, 2023

Backstory:

I have a three node CEPH cluster with 3 disks (one per node). Somehow (to be investigated) two OSDs went down. The third one started backfilling. As the whole cluster became unresponsive, I started checking what the cause might be and found the pool in this state. Since the two OSDs couldn't be restarted, I destroyed them and recreated them. The pool started rebalancing but stopped short of being whole again (at 99.xy%). 4 PGs are "undersized+degraded+remapped+backfill_toofull+peered". I thought that was because, of the three disks, two had reached their limits (at slightly under 90%) while the third one remained over its limit (at 93%) where it had got to backfilling when the other two were down and out.

So, I thought: overall too much data in the pool. Trying to delete some experimental VMs did not change the result (surprisingly). Neither did scrubbing.

Then I had, what I thought, was a stroke of genius: Add another disk on the node where the overly full OSD sits, reweight that one down and watch the data rebalance between the two, bringing the utilization rate of the old one down below its critical threshold.

Well, the rebalancing happened: While the other almost full OSDs didn't change (as planned), the overly full OSD went down to 72% (and the new OSD went up to almost 90 - it is smaller than the other three).

BUT: For some reason I fail to see, that did not change anything with respect to 4 PGs being "undersized+degraded+remapped+backfill_toofull+peered". Which seems to be the cause for the pool being still totally unusable. It did change things for some warning but I would have expected CEPH to take care of the worst problems first.

Since then I have tried to tinker with the weights of the old, previously overly full, OSD and the new one. And it keeps rebalancing but keeps stopping (shorter and shorter but still) short of 100% and the pool remains unresponsive.

Can someone who understands how CEPH works (I don't) please tell me how to get this mess sorted?

Thanks!

gurubert · Jan 27, 2023

proxwolfe said:
The third one started backfilling.

Have you configured the pools with size=2 and min_size=1? With size=3 and min_size=2 this would not have happened.

proxwolfe · Jan 27, 2023

The pool is configured as size=3 and min_size=2

gurubert · Jan 27, 2023

With 2 out of 3 OSDs lost you would have read-only data and the cluster would not be able to heal itself.

proxwolfe · Jan 27, 2023

Well, I don't know about that.

But this is from the config:

osd_pool_default_min_size = 2
osd_pool_default_size = 3

alexskysilk · Jan 27, 2023

please provide the output of

ceph health
ceph osd tree.

proxwolfe · Jan 27, 2023

ceph health:

HEALTH_WARN 2 nearfull osd(s); Reduced data availability: 4 pgs inactive; Low space hindering backfill (add storage if this doesn't resolve itself): 4 pgs backfill_toofull; Degraded data redundancy: 7240/1384452 objects degraded (0.523%), 4 pgs degraded, 4 pgs undersized; 4 pool(s) nearfull

ceph osd tree:


ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         9.77612  root default                                   
-3         3.43817      host tx1330m2-1                            
 0    hdd  2.72899          osd.0             up   1.00000  1.00000
 3   nvme  0.59999          osd.3             up   1.00000  1.00000
 6   nvme  0.10919          osd.6             up   1.00000  1.00000
-5         3.16898      host tx1330m3-1                            
 1    hdd  2.72899          osd.1             up   1.00000  1.00000
 4   nvme  0.43999          osd.4             up   1.00000  1.00000
-7         3.16898      host vtx1330m2-2                           
 2    hdd  2.72899          osd.2             up   1.00000  1.00000
 5   nvme  0.43999          osd.5             up   1.00000  1.00000

osd.0, osd.1 and osd.2 are not part of the problem and belong to another pool.

Originally, the weight of osd.3, osd.4 and osd.5 was 0.45something (they are all the same size: 500MB). osd.3 is the one that was beyond 90% before I added osd.6 (120MB) and reduced the weight of osd.3.

When that didn't help, I started increasing the weight of osd.3 again. When that didn't help, I tried reducing the weights of osd.4 and osd.5. But that, too, didn't help.

gurubert · Jan 27, 2023

Please show the output of
ceph osd pool ls detail

alexskysilk · Jan 27, 2023

looks like your pool(s) are too full. add more drives.

Also, the lopsided OSD distribution doesnt mean your physical capacity can actually be used. You have ~700G of nvme OSD space on node m2-1, but ~440 on the other 2 nodes; the total theoretical space you can actually USE is ~80% of 440, or ~350GB, but because you have that 120GB nvme in there it would end up distorting the distribution and potentially lead to a smaller capacity still.

This isnt really workable for production.

proxwolfe · Jan 27, 2023

alexskysilk said:
looks like your pool(s) are too full. add more drives.

Also, the lopsided OSD distribution doesnt mean your physical capacity can actually be used. You have ~700G of nvme OSD space on node m2-1, but ~440 on the other 2 nodes; the total theoretical space you can actually USE is ~80% of 440, or ~350GB, but because you have that 120GB nvme in there it would end up distorting the distribution and potentially lead to a smaller capacity still.

This isnt really workable for production.

Yeah, it wasn't planned for production like this. I used to have only 3 x 500 MB.

The additional 120MB I added in order to try and get the one that had gone above 90% to come down to below 90%. So that wasn't the great idea I took it for...

So what do I do now? Add 120MB each in the other nodes? Or swap out the 500MB to 1000MB in the other nodes?

proxwolfe · Jan 27, 2023

gurubert said:
Please show the output of
ceph osd pool ls detail

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1377 flags hashpspool,nearfull stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'pool_nvme' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 1377 flags hashpspool,nearfull,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'pool_hdd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 643 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 1377 flags hashpspool,nearfull stripe_width 0 application cephfs
pool 5 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1377 flags hashpspool,nearfull stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs

alexskysilk · Jan 27, 2023

proxwolfe said:
So what do I do now? Add 120MB each in the other nodes? Or swap out the 500MB to 1000MB in the other nodes?

there's a short term answer and a long term answer.

if you just want to regain use of your pool TEMPORARILY, you can just set the cutoff threshhold to a higher percentage (ceph osd set-full-ratio, ceph osd set-backfillfull-ratio)

long term- redeploy with a sane configuration; have a minimum of 4osd's per node all same size, the total capacity of which is a MINIMUM of 375% of your data, eg:

you have 80G of data, which is 80% of 100G, which means 4x25 per node x 3 nodes = 300GB (375% of 80)

proxwolfe · Jan 27, 2023

Thanks. For the moment, I would be happy to just get the pool back to a working state. In the long run, I will redeploy.

But I still don't understand (and would be happy to learn): While the current configuration is less than ideal (or may even be terrible), what it did achieve - apparently - was to bring down the overfull OSD below 90%, which would seem to mean that it is now below the (standard) backfill threshold - so it shouldn't be backfill_toofull anymore. And if that is the case, those 4 PGs should not need to be disabled. Or maybe I just completely misunderstand how this all fits together...

Or are the 4 PGs still disabled because Ceph can't re-allocate them without going over the backfill-threshold again?

alexskysilk · Jan 28, 2023

edit- typos and clarity.

You have a 3:2 pool configuration.

That means, each placement group needs a minimum of 3 OSDs (on three nodes) in order to complete a write. (it will allow you to complete an operation with 2 osds but will not be "whole" until a third is available)

Once you have your ONLY OSDs on nodes m3-1 and m3-2 trip their full threshhold, they both become unavailable to process NEW commit (eg, writes, deletions, etc.) with only one remaining node available, it didnt matter that you have 2 osds on it; the rules are you need 3 seperate nodes to process a write.

Your virtual machines are oblivious to this predicament, and proceeded to operate normally until a sync to disk operation did not return an acknowledgement. the write IS pending and will not release, as the transaction(s) completed to the one osd, and will continue to wait until there are osd's available on other nodes OR you knock the whole system over. until then, those 4pgs will remain in locked state so not to lose any committed data. (before you conclude that knocking the system over is an option, chances are that when the system comes back up and vms boot you're probably just going to create more locked pgs.)

proxwolfe said:
while the current configuration is less than ideal (or may even be terrible), what it did achieve - apparently - was to bring down the overfull OSD below 90%,

I dont really know what this is in reference to.

gurubert · Jan 28, 2023

Your cephfs_data and cephfs_metadata pools use the default CRUSH rule which means they place objects without looking at the device class. This is why you saw some recovery going on.
I recommend to change that. You do not want to mix HDD and NVMe OSDs in one pool.

proxwolfe · Jan 28, 2023

alexskysilk said:
I dont really know what this is in reference to.

My understanding was that the standard backfill threshold is 90%. So I thought that when all OSDs are under 90, backfilling should be possible again. And if that were possible, there should be no backfill_toofull anymore and my problem would go away. I guess I was wrong. As I said, I don't know how Ceph works... still trying to find my feet.

So for the moment, I understand from one of your previous posts that I could change the thresholds

alexskysilk said:
if you just want to regain use of your pool TEMPORARILY, you can just set the cutoff threshhold to a higher percentage (ceph osd set-full-ratio, ceph osd set-backfillfull-ratio)

That will allow me to delete some unnecessary stuff and migrate my VMs off the pool, right? So what would I set the thresholds to? both to 95% (a bit above what is available now)?

Thanks!

alexskysilk · Jan 28, 2023

proxwolfe said:
My understanding was that the standard backfill threshold is 90%. So I thought that when all OSDs are under 90,

I havent seen any evidence of this. where are you looking at for your OSD utilization? simplest way to get this is
ceph osd df tree.

also, if the pending writes exceed 90% when written, they may trigger threshhold despite the OSDs not meeting the limit.

proxwolfe said:
That will allow me to delete some unnecessary stuff and migrate my VMs off the pool, right?

yes, as long as your pending writes aren't gonna clobber the remaining threshhold. also, as the disks get FULL bad things can happen which is why the limit exists. There is also the nuclear option, which is to reduce the min OSD number in your crush rule for the pool to 1- bear in mind that this will allow you to write data to a single osd, which means those writes have no data integrity assurance at all, but will allow you to resume operation.

proxwolfe · Jan 28, 2023

gurubert said:
Your cephfs_data and cephfs_metadata pools use the default CRUSH rule which means they place objects without looking at the device class. This is why you saw some recovery going on.
I recommend to change that. You do not want to mix HDD and NVMe OSDs in one pool.

I must have missed that when I changed the rule for the normal pool.

Thanks for pointing that out to me. I shall change that as soon as I am back in the game.

proxwolfe · Jan 28, 2023

alexskysilk said:
I havent seen any evidence of this. where are you looking at for your OSD utilization? simplest way to get this is
ceph osd df tree.

I I wrote that in my OP - but I didn't provide any "evidence" from the system. I checked with ceph osd df tree and it shows the same values as PVE does in the GUI under Ceph > OSD. That is where I took my numbers from.

The values are currently:

osd.3: 71.69
osd.4: 89.53
osd.5: 89.51
osd.6: 87.38

So this brings me back to my above question:

proxwolfe said:
But I still don't understand (and would be happy to learn): While the current configuration is less than ideal (or may even be terrible), what it did achieve - apparently - was to bring down the overfull OSD below 90%, which would seem to mean that it is now below the (standard) backfill threshold - so it shouldn't be backfill_toofull anymore. And if that is the case, those 4 PGs should not need to be disabled. Or maybe I just completely misunderstand how this all fits together...

Or are the 4 PGs still disabled because Ceph can't re-allocate them without going over the backfill-threshold again?

And I guess you answered it:

alexskysilk said:
also, if the pending writes exceed 90% when written, they may trigger threshhold despite the OSDs not meeting the limit.

But with osd.3 having now "so much" room, why is Ceph not rebalancing to move the 4 PGs? Or are they sitting on osd.3 and Ceph wants to get them to osd.4 and osd.5 but can't, because those two are just under the threshold and rebalancing would push them over?

If so, would Ceph rebalance, if I could get osd.4 and osd.5 farther away from the threshold. Down to, say 70?

alexskysilk said:
as long as your pending writes aren't gonna clobber the remaining threshhold.

Is there a way to check what volume the pending writes have?

alexskysilk said:
There is also the nuclear option, which is to reduce the min OSD number in your crush rule for the pool to 1- bear in mind that this will allow you to write data to a single osd, which means those writes have no data integrity assurance at all, but will allow you to resume operation.

So that would let Ceph put those pending writes on osd.3 and then allow me immediately to use the pool again - without rebalancing - because osd.3 at 72 (+ the pending writes) has capacity left at the moment to commence operations, right?

I would just use this opportunity to move the VMs off this pool and then destroy it.

Thanks!

proxwolfe · Jan 29, 2023

Okay, so I took the plunge.

Originally, I wanted to create a new pool and add new OSDs to that in order to start new, but I was wondering how I would avoid that the old pool would spill onto the new OSDs. So I considered (falsely) assigning the SSD class to the new NVMe OSDs just to keep the existing pool from using them (as I have a crush rule for this pool to only use NVMe class OSDs).

But then it occurred to me that more pools than one can exist on a class of OSDs and that I could create another one even if the old one spilled over. But if the old pool was going to spill over onto the new OSDs, this would probably bring down the utilization rate on the other OSDs that were too full. So I just added the new OSDs as NVMes and they were immediately occupied by the existing pool. As expected, the utilization of the old OSDs started coming down and, as hoped, the 4 PGs were finally backfilled and the backfill_toofull issue went away.

I also used the momentum to change the crush rules for the cephfs_data and cephfs_metada pools to also use NVMes only (I thought it wouldn't hurt to get every up to NVMe speed instead of everything down to HDD speed).

And I also removed osd.6 again which I had brought in only to bring the utilization of osd.3 down - concluding that this approach did work but was required also on the other two OSDs osd.4 and osd.5 in order to render the pool working again.

All warnings and errors have disappeared and the rebalancing has just completed.

So I am very happy now. I am mindful that this is still not a good configuration (two OSDs per node) and I am considering added more OSDs over time.

Thanks for everyone's help!

Please help! CEPH pool stuck at "undersized+degraded+remapped+backfill_toofull+peered"

Active Member

Famous Member

Active Member

Famous Member

Active Member

Distinguished Member

Active Member

Famous Member

Distinguished Member

Active Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Famous Member

Active Member

Distinguished Member

Active Member

Active Member

Active Member