Ceph: Balancing disk space unequally!?!?!?!

proxwolfe · Nov 27, 2023

sb-jw said:
If you only had one OSD of each type, then your thinking would be correct, because CEPH can no longer produce its replica and would then automatically go into degraded+undersized, but there would be no standstill because the other OSDs can not receive this data according to Crush Rule. With two OSDs per node, things have to be thought differently.

Yes, I had one disk (per type) per node for the longest time.

sb-jw said:
CEPH wants to keep the replica of 3 and distributes your bulk of data across three hosts. Each of your nodes must therefore keep a complete copy of the data. You are currently distributing this across 2 OSDs (HDD). If one fails, CEPH will have to pack this data from the failed OSD somewhere else and will therefore want to pack the entire fill level from, for example, OSD.1 to OSD.9. In this scenario, you can only fill the two HDDs up to a level of 42.5% each so that the other can hold all the data in the event of a failure. But you now have 167.94% of data per node, if an HDD can hold a maximum of 100%, where should the remaining 67.94% go? CEPH will run with the OSD.9 in full ratio and then pull the emergency brake and switch the pool to read-only to protect the integrity.

Huh, and there was I thinking that adding an HDD per node to the HDD pool would actually improve operational safety...

At the moment, I have approx. 14TB worth of data across the two HDDs per node. What you are telling me, if I understand you correctly, is that I need three 14TB drives per node, because two would not be enough (as each would get filled to 50%, whereas I should not exceed 42.5%).

In that case, I might be better off removing the 3 4TB drives and replacing the 3 14TB drives with 3 18TB drives. Then I would again only have 1 HDD per node, like I used to have before.

This being a hobby, I need to be mindful of the costs: Adding 4 (14TB) 3.5" drives to the pool would drive up my power bill considerably. Its already high as it is.

Or would it make sense to reduce the replication number?

sb-jw · Nov 27, 2023

proxwolfe said:
At the moment, I have approx. 14TB worth of data across the two HDDs per node. What you are telling me, if I understand you correctly, is that I need three 14TB drives per node, because two would not be enough (as each would get filled to 50%, whereas I should not exceed 42.5%).

At least I wouldn't recommend breaking the 85% threshold, you can of course still do it and set e.g. nearfull to 90% and full to e.g. 98%. Certainly okay for home use. I've never had a data loss like this with CEPH, not even when all mons were lost. But with read-only, new data can no longer be written; with metrics, for example, you would then have a data loss. You have to know for yourself whether you can live with it.

proxwolfe said:
In that case, I might be better off removing the 3 4TB drives and replacing the 3 14TB drives with 3 18TB drives. Then I would again only have 1 HDD per node, like I used to have before.

Then at least you won't have the problem of a pool running into the full ratio. If you then put a 4th on spare, you can at least quickly restore a healthy state.
But also keep in mind that even wear and tear of hard drives can be a problem and several can die at the same time. Recovery in particular can be very stressful. This applies to HDDs as well as flash memory. It is therefore advisable to choose one plate as a sacrifice and swap others every now and then - especially in scenarios like yours.

proxwolfe said:
Or would it make sense to reduce the replication number?

Absolute no-go, a disk dies while you're rebooting for maintenance. Total disaster, if you're lucky everything comes back up afterwards. I wouldn't recommend this at all, especially if there are few OSDs. The probability that another one will die is potentially high, so you should keep at least 3 copies. At 2/1 data loss is otherwise more obvious.

Unless of course the data is not important to you or you have backups, then you can also go 2/1. For a while I used a single CEPH node with Replica 2/1 as a backup target so that I didn't lose the scaling like with a fixed RAID. The server simply got all the old disks. Never had any problems, but you just have to be aware of the risks.

alexskysilk · Nov 27, 2023

sb-jw said:
CEPH is never able to distribute it 100% optimally. A discrepancy of +/- 10 - 15 PGs is definitely the rule and that also applies to you.

the effect of this goes down the more OSDs you have. OP has 6 OSDs per device class. I'd say it would be borderline unmanageable at anything over 50% full.

proxwolfe · Dec 1, 2023

sb-jw said:
At least I wouldn't recommend breaking the 85% threshold, you can of course still do it and set e.g. nearfull to 90% and full to e.g. 98%. Certainly okay for home use. I've never had a data loss like this with CEPH, not even when all mons were lost. But with read-only, new data can no longer be written; with metrics, for example, you would then have a data loss. You have to know for yourself whether you can live with it.

Then at least you won't have the problem of a pool running into the full ratio. If you then put a 4th on spare, you can at least quickly restore a healthy state.

One more question for my understanding please:

So when I have two disks on a node and one drive goes down, Ceph will try to push its content to the other drive (and if that one isn't large enough, I have a problem). If I only have one drive on a node that doesn't happen. Understood.

But what happens, if I combine the two drives into one OSD (I think that is possible)? I'm guessing that if then one drive goes down, the entire OSD o that node goes down and Ceph would not push data around on that node? And with my minimum of 2 replica rule the other two nodes would keep the lights on in the cluster? Could that be an option for my case? Or would I create other issues with that?

sb-jw · Dec 1, 2023

proxwolfe said:
But what happens, if I combine the two drives into one OSD (I think that is possible)?

No, thats not possible.

Search

Search

Ceph: Balancing disk space unequally!?!?!?!

proxwolfe

Renowned Member

sb-jw

Famous Member

alexskysilk

Distinguished Member

proxwolfe

Renowned Member

sb-jw

Famous Member

We value your privacy