Please help! CEPH pool stuck at "undersized+degraded+remapped+backfill_toofull+peered"

proxwolfe

Active Member
Jun 20, 2020
448
38
33
49
Backstory:

I have a three node CEPH cluster with 3 disks (one per node). Somehow (to be investigated) two OSDs went down. The third one started backfilling. As the whole cluster became unresponsive, I started checking what the cause might be and found the pool in this state. Since the two OSDs couldn't be restarted, I destroyed them and recreated them. The pool started rebalancing but stopped short of being whole again (at 99.xy%). 4 PGs are "undersized+degraded+remapped+backfill_toofull+peered". I thought that was because, of the three disks, two had reached their limits (at slightly under 90%) while the third one remained over its limit (at 93%) where it had got to backfilling when the other two were down and out.


So, I thought: overall too much data in the pool. Trying to delete some experimental VMs did not change the result (surprisingly). Neither did scrubbing.

Then I had, what I thought, was a stroke of genius: Add another disk on the node where the overly full OSD sits, reweight that one down and watch the data rebalance between the two, bringing the utilization rate of the old one down below its critical threshold.

Well, the rebalancing happened: While the other almost full OSDs didn't change (as planned), the overly full OSD went down to 72% (and the new OSD went up to almost 90 - it is smaller than the other three).

BUT: For some reason I fail to see, that did not change anything with respect to 4 PGs being "undersized+degraded+remapped+backfill_toofull+peered". Which seems to be the cause for the pool being still totally unusable. It did change things for some warning but I would have expected CEPH to take care of the worst problems first.

Since then I have tried to tinker with the weights of the old, previously overly full, OSD and the new one. And it keeps rebalancing but keeps stopping (shorter and shorter but still) short of 100% and the pool remains unresponsive.

Can someone who understands how CEPH works (I don't) please tell me how to get this mess sorted?

Thanks!
 
Well, I don't know about that.

But this is from the config:

osd_pool_default_min_size = 2 osd_pool_default_size = 3
 
ceph health:
HEALTH_WARN 2 nearfull osd(s); Reduced data availability: 4 pgs inactive; Low space hindering backfill (add storage if this doesn't resolve itself): 4 pgs backfill_toofull; Degraded data redundancy: 7240/1384452 objects degraded (0.523%), 4 pgs degraded, 4 pgs undersized; 4 pool(s) nearfull

ceph osd tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 9.77612 root default -3 3.43817 host tx1330m2-1 0 hdd 2.72899 osd.0 up 1.00000 1.00000 3 nvme 0.59999 osd.3 up 1.00000 1.00000 6 nvme 0.10919 osd.6 up 1.00000 1.00000 -5 3.16898 host tx1330m3-1 1 hdd 2.72899 osd.1 up 1.00000 1.00000 4 nvme 0.43999 osd.4 up 1.00000 1.00000 -7 3.16898 host vtx1330m2-2 2 hdd 2.72899 osd.2 up 1.00000 1.00000 5 nvme 0.43999 osd.5 up 1.00000 1.00000

osd.0, osd.1 and osd.2 are not part of the problem and belong to another pool.

Originally, the weight of osd.3, osd.4 and osd.5 was 0.45something (they are all the same size: 500MB). osd.3 is the one that was beyond 90% before I added osd.6 (120MB) and reduced the weight of osd.3.

When that didn't help, I started increasing the weight of osd.3 again. When that didn't help, I tried reducing the weights of osd.4 and osd.5. But that, too, didn't help.
 
looks like your pool(s) are too full. add more drives.

Also, the lopsided OSD distribution doesnt mean your physical capacity can actually be used. You have ~700G of nvme OSD space on node m2-1, but ~440 on the other 2 nodes; the total theoretical space you can actually USE is ~80% of 440, or ~350GB, but because you have that 120GB nvme in there it would end up distorting the distribution and potentially lead to a smaller capacity still.

This isnt really workable for production.
 
looks like your pool(s) are too full. add more drives.

Also, the lopsided OSD distribution doesnt mean your physical capacity can actually be used. You have ~700G of nvme OSD space on node m2-1, but ~440 on the other 2 nodes; the total theoretical space you can actually USE is ~80% of 440, or ~350GB, but because you have that 120GB nvme in there it would end up distorting the distribution and potentially lead to a smaller capacity still.

This isnt really workable for production.
Yeah, it wasn't planned for production like this. I used to have only 3 x 500 MB.

The additional 120MB I added in order to try and get the one that had gone above 90% to come down to below 90%. So that wasn't the great idea I took it for...

So what do I do now? Add 120MB each in the other nodes? Or swap out the 500MB to 1000MB in the other nodes?
 
Please show the output of
ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1377 flags hashpspool,nearfull stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'pool_nvme' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 1377 flags hashpspool,nearfull,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'pool_hdd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 643 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 1377 flags hashpspool,nearfull stripe_width 0 application cephfs
pool 5 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1377 flags hashpspool,nearfull stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
 
So what do I do now? Add 120MB each in the other nodes? Or swap out the 500MB to 1000MB in the other nodes?
there's a short term answer and a long term answer.

if you just want to regain use of your pool TEMPORARILY, you can just set the cutoff threshhold to a higher percentage (ceph osd set-full-ratio, ceph osd set-backfillfull-ratio)

long term- redeploy with a sane configuration; have a minimum of 4osd's per node all same size, the total capacity of which is a MINIMUM of 375% of your data, eg:

you have 80G of data, which is 80% of 100G, which means 4x25 per node x 3 nodes = 300GB (375% of 80)
 
Thanks. For the moment, I would be happy to just get the pool back to a working state. In the long run, I will redeploy.

But I still don't understand (and would be happy to learn): While the current configuration is less than ideal (or may even be terrible), what it did achieve - apparently - was to bring down the overfull OSD below 90%, which would seem to mean that it is now below the (standard) backfill threshold - so it shouldn't be backfill_toofull anymore. And if that is the case, those 4 PGs should not need to be disabled. Or maybe I just completely misunderstand how this all fits together...

Or are the 4 PGs still disabled because Ceph can't re-allocate them without going over the backfill-threshold again?
 
edit- typos and clarity.

You have a 3:2 pool configuration.

That means, each placement group needs a minimum of 3 OSDs (on three nodes) in order to complete a write. (it will allow you to complete an operation with 2 osds but will not be "whole" until a third is available)

Once you have your ONLY OSDs on nodes m3-1 and m3-2 trip their full threshhold, they both become unavailable to process NEW commit (eg, writes, deletions, etc.) with only one remaining node available, it didnt matter that you have 2 osds on it; the rules are you need 3 seperate nodes to process a write.

Your virtual machines are oblivious to this predicament, and proceeded to operate normally until a sync to disk operation did not return an acknowledgement. the write IS pending and will not release, as the transaction(s) completed to the one osd, and will continue to wait until there are osd's available on other nodes OR you knock the whole system over. until then, those 4pgs will remain in locked state so not to lose any committed data. (before you conclude that knocking the system over is an option, chances are that when the system comes back up and vms boot you're probably just going to create more locked pgs.)

while the current configuration is less than ideal (or may even be terrible), what it did achieve - apparently - was to bring down the overfull OSD below 90%,
I dont really know what this is in reference to.
 
Last edited:
Your cephfs_data and cephfs_metadata pools use the default CRUSH rule which means they place objects without looking at the device class. This is why you saw some recovery going on.
I recommend to change that. You do not want to mix HDD and NVMe OSDs in one pool.
 
I dont really know what this is in reference to.
My understanding was that the standard backfill threshold is 90%. So I thought that when all OSDs are under 90, backfilling should be possible again. And if that were possible, there should be no backfill_toofull anymore and my problem would go away. I guess I was wrong. As I said, I don't know how Ceph works... still trying to find my feet.

So for the moment, I understand from one of your previous posts that I could change the thresholds
if you just want to regain use of your pool TEMPORARILY, you can just set the cutoff threshhold to a higher percentage (ceph osd set-full-ratio, ceph osd set-backfillfull-ratio)
That will allow me to delete some unnecessary stuff and migrate my VMs off the pool, right? So what would I set the thresholds to? both to 95% (a bit above what is available now)?

Thanks!
 
My understanding was that the standard backfill threshold is 90%. So I thought that when all OSDs are under 90,
I havent seen any evidence of this. where are you looking at for your OSD utilization? simplest way to get this is
ceph osd df tree.

also, if the pending writes exceed 90% when written, they may trigger threshhold despite the OSDs not meeting the limit.

That will allow me to delete some unnecessary stuff and migrate my VMs off the pool, right?
yes, as long as your pending writes aren't gonna clobber the remaining threshhold. also, as the disks get FULL bad things can happen which is why the limit exists. There is also the nuclear option, which is to reduce the min OSD number in your crush rule for the pool to 1- bear in mind that this will allow you to write data to a single osd, which means those writes have no data integrity assurance at all, but will allow you to resume operation.
 
Your cephfs_data and cephfs_metadata pools use the default CRUSH rule which means they place objects without looking at the device class. This is why you saw some recovery going on.
I recommend to change that. You do not want to mix HDD and NVMe OSDs in one pool.
I must have missed that when I changed the rule for the normal pool.

Thanks for pointing that out to me. I shall change that as soon as I am back in the game.
 
I havent seen any evidence of this. where are you looking at for your OSD utilization? simplest way to get this is
ceph osd df tree.
I I wrote that in my OP - but I didn't provide any "evidence" from the system. I checked with ceph osd df tree and it shows the same values as PVE does in the GUI under Ceph > OSD. That is where I took my numbers from.

The values are currently:

osd.3: 71.69
osd.4: 89.53
osd.5: 89.51
osd.6: 87.38

So this brings me back to my above question:
But I still don't understand (and would be happy to learn): While the current configuration is less than ideal (or may even be terrible), what it did achieve - apparently - was to bring down the overfull OSD below 90%, which would seem to mean that it is now below the (standard) backfill threshold - so it shouldn't be backfill_toofull anymore. And if that is the case, those 4 PGs should not need to be disabled. Or maybe I just completely misunderstand how this all fits together...

Or are the 4 PGs still disabled because Ceph can't re-allocate them without going over the backfill-threshold again?
And I guess you answered it:
also, if the pending writes exceed 90% when written, they may trigger threshhold despite the OSDs not meeting the limit.
But with osd.3 having now "so much" room, why is Ceph not rebalancing to move the 4 PGs? Or are they sitting on osd.3 and Ceph wants to get them to osd.4 and osd.5 but can't, because those two are just under the threshold and rebalancing would push them over?

If so, would Ceph rebalance, if I could get osd.4 and osd.5 farther away from the threshold. Down to, say 70?

as long as your pending writes aren't gonna clobber the remaining threshhold.
Is there a way to check what volume the pending writes have?

There is also the nuclear option, which is to reduce the min OSD number in your crush rule for the pool to 1- bear in mind that this will allow you to write data to a single osd, which means those writes have no data integrity assurance at all, but will allow you to resume operation.
So that would let Ceph put those pending writes on osd.3 and then allow me immediately to use the pool again - without rebalancing - because osd.3 at 72 (+ the pending writes) has capacity left at the moment to commence operations, right?

I would just use this opportunity to move the VMs off this pool and then destroy it.

Thanks!
 
Okay, so I took the plunge.

Originally, I wanted to create a new pool and add new OSDs to that in order to start new, but I was wondering how I would avoid that the old pool would spill onto the new OSDs. So I considered (falsely) assigning the SSD class to the new NVMe OSDs just to keep the existing pool from using them (as I have a crush rule for this pool to only use NVMe class OSDs).

But then it occurred to me that more pools than one can exist on a class of OSDs and that I could create another one even if the old one spilled over. But if the old pool was going to spill over onto the new OSDs, this would probably bring down the utilization rate on the other OSDs that were too full. So I just added the new OSDs as NVMes and they were immediately occupied by the existing pool. As expected, the utilization of the old OSDs started coming down and, as hoped, the 4 PGs were finally backfilled and the backfill_toofull issue went away.

I also used the momentum to change the crush rules for the cephfs_data and cephfs_metada pools to also use NVMes only (I thought it wouldn't hurt to get every up to NVMe speed instead of everything down to HDD speed).

And I also removed osd.6 again which I had brought in only to bring the utilization of osd.3 down - concluding that this approach did work but was required also on the other two OSDs osd.4 and osd.5 in order to render the pool working again.

All warnings and errors have disappeared and the rebalancing has just completed.

So I am very happy now. I am mindful that this is still not a good configuration (two OSDs per node) and I am considering added more OSDs over time.

Thanks for everyone's help!
 
  • Like
Reactions: gurubert

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!