Ceph: Balancing disk space unequally!?!?!?!

proxwolfe

Well-Known Member
Jun 20, 2020
499
51
48
49
Hi,

I have a three node PVE cluster in which each node is also a Ceph node.

Each Ceph node used to have one identical HDD and the pool was getting full. Therefore, and because one is supposed to have more OSDs anyway, I added one additional identical HDD OSD per node. Ceph rebalanced between them but the outcome is strange:

Node 1: OSD1 73.5%, OSD2 75%
Node 2: OSD3 71%, OSD4 86%
Node 3: OSD 5 71%, OSD5 86%

OSDs 1, 3 and 5 are 14TB each and weighted 12.7334 while OSDs 2, 4 and 6 are 4TB each and weighted 3.63689.

I did not change "reweight" and it is 1 for all OSDs.

The balance on node 1 seems more or less alright and expected. But what might cause the imbalances on nodes 2 and 3?

Thanks!
 
Not sure - I have what comes as standard in PVE. If you are referring to a separate piece of software, then I don't have that installed.

In any case, I can see the Crush Map in the PVE GUI. It shows the same weights I reported above.
 
on the (more or less) balanced node there are 226 and 63 pgs on the OSDs
while on the unbalanced nodes there are 218 vs 71 and 225 vs 64, respectively.

There doesn't seem to be rhyme or reason behind it.
 
The difference in used % is not unreasonable, but it will probably go a bit lower with a higher number of PGs. Could you please share us the output of `pveceph pool ls`.
 
Bash:
Name               │ Size │ Min Size │ PG Num │ min. PG Num │ Optimal PG Num │ PG Autoscale Mode │ PG Autoscale Target Size │ PG Autoscale Target Ratio │ C
╞════════════════════╪══════╪══════════╪════════╪═════════════╪════════════════╪═══════════════════╪══════════════════════════╪═══════════════════════════╪══
│ .mgr               │    3 │        2 │      1 │             │                │ on                │                          │                           │ r
├────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼──
│ cephfs_data        │    3 │        2 │    128 │             │                │ on                │                          │                           │ c
├────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼──
│ cephfs_metadata    │    3 │        2 │     32 │             │                │ on                │                          │                           │ c
├────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼──
│ cephfshdd_data     │    3 │        2 │    128 │             │                │ on                │                          │                           │ c
├────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼──
│ cephfshdd_metadata │    3 │        2 │     32 │             │                │ on                │                          │                           │ c
├────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼──
│ pool_hdd           │    3 │        2 │    128 │             │                │ on                │                          │                           │ c
├────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────┼──
│ pool_nvme          │    3 │        2 │    128 │             │                │ on                │                          │                           │ c
└────────────────────┴──────┴──────────┴────────┴─────────────┴────────────────┴───────────────────┴──────────────────────────┴───────────────────────────┴──

There is another Ceph pool (pool_nvme) that I am having no issues with (and so I didn't mention it before).
 
At 6 OSDs you should have the PG number set to 256 according to [1], but before manually setting it I would advice to let the autoscaler determine which is the optimal number of PGs and *then* set the number of PGs to the Optimal number of PGs. For the autoscaler to work you have to set the "Target Ratio" of the pool to *a* value (any value).

Once you set the target ratio, you can query the state of the autoscaler with

Code:
ceph osd pool get POOL_NAME pg_autoscale_mode

and use the `ceph osd pool set ...` to set it to `on`, in case it is not enabled. You can use `pveceph pool ls` to see the optimal number of PGs. You should be able to do all these operations from the web UI too.

Do note that if your OSDs are not of the same size, some discrepancy in used % is bound to happen, but it is hard to say how big will it be.

[1] https://old.ceph.com/pgcalc/
 
For the autoscaler to work you have to set the "Target Ratio" of the pool to *a* value (any value).
Done

Once you set the target ratio, you can query the state of the autoscaler with

Code:
ceph osd pool get POOL_NAME pg_autoscale_mode
Confirm it's on.

You can use `pveceph pool ls` to see the optimal number of PGs.
Column "Optimal PG Num" remains empty

Do note that if your OSDs are not of the same size, some discrepancy in used % is bound to happen, but it is hard to say how big will it be.
The 12TB OSDs are all the exact same make and model and the 4TB OSDs are too.

The allocation remains unchanged (i.e. uneven on two of the three nodes).

What else could I try?

Thanks!
 
I noticed from he output that you have your .mgr pool using a rule starting with `r`, probably the default `replicated_rule` and your other pools use rules starting with `c` (the output of `pveceph pool ls` is cropped). for the autoscaler to work you will have to assign the .mgr to one of those rules starting with `c`. You can set the rule through the web UI.
 
Yes, the "r" was from the "replicated_rule" whereas the "c" was from "ceph_hdd" and "ceph_sdd" - my own replicated rules for two pools.

I did change the default replicated rule for .mgr to "ceph_ssd" as suggested. There was a very brief spike of activity in Ceph but overall nothing has changed. I will give it some time and see if something happens over night.
 
Thought it was balancing automatically.

This is the output:

Code:
{
    "active": true,
    "last_optimize_duration": "0:00:00.001408",
    "last_optimize_started": "Mon Nov 27 20:02:25 2023",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

So apparently, the balancer thinks the uneven distribution is not just not a problem but actually optimal o_O
 
Code:
-1         79.43169         -   65 TiB   44 TiB   44 TiB  446 KiB   79 GiB   21 TiB  67.88  1.00    -          root default       
-3         21.61177         -   22 TiB   15 TiB   15 TiB  153 KiB   26 GiB  6.9 TiB  67.88  1.00    -              host 1
 0    hdd  12.73340   1.00000   13 TiB   10 TiB   10 TiB   30 KiB   16 GiB  2.6 TiB  79.77  1.18  225      up          osd.0       
 3    hdd   3.63869   1.00000  3.6 TiB  2.8 TiB  2.8 TiB   15 KiB  5.2 GiB  813 GiB  78.18  1.15   60      up          osd.3       
 6   nvme   1.74660   1.00000  1.7 TiB  169 GiB  168 GiB   51 KiB  1.4 GiB  1.6 TiB   9.48  0.14   32      up          osd.6       
11    ssd   3.49309   1.00000  3.5 TiB  1.5 TiB  1.5 TiB   57 KiB  3.3 GiB  2.0 TiB  43.01  0.63   97      up          osd.11     
-9         21.61177         -   22 TiB   15 TiB   15 TiB  157 KiB   27 GiB  6.9 TiB  67.88  1.00    -              host 2
 7    hdd   3.63869   1.00000  3.6 TiB  3.4 TiB  3.3 TiB   16 KiB  5.8 GiB  291 GiB  92.18  1.36   63      up          osd.7       
 8    hdd  12.73340   1.00000   13 TiB  9.6 TiB  9.6 TiB   42 KiB   16 GiB  3.1 TiB  75.77  1.12  222      up          osd.8       
10   nvme   1.74660   1.00000  1.7 TiB  170 GiB  168 GiB   42 KiB  1.9 GiB  1.6 TiB   9.50  0.14   32      up          osd.10     
13    ssd   3.49309   1.00000  3.5 TiB  1.5 TiB  1.5 TiB   57 KiB  3.3 GiB  2.0 TiB  43.01  0.63   97      up          osd.13     
-5         21.61177         -   22 TiB   15 TiB   15 TiB  136 KiB   27 GiB  6.9 TiB  67.88  1.00    -              host 3
 1    hdd  12.73340   1.00000   13 TiB  9.6 TiB  9.6 TiB   22 KiB   15 GiB  3.1 TiB  75.77  1.12  215      up          osd.1       
 9    hdd   3.63869   1.00000  3.6 TiB  3.4 TiB  3.3 TiB    7 KiB  6.0 GiB  292 GiB  92.17  1.36   70      up          osd.9       
 5   nvme   1.74660   1.00000  1.7 TiB  170 GiB  168 GiB   50 KiB  2.0 GiB  1.6 TiB   9.51  0.14   32      up          osd.5       
14    ssd   3.49309   1.00000  3.5 TiB  1.5 TiB  1.5 TiB   57 KiB  3.3 GiB  2.0 TiB  43.01  0.63   97      up          osd.14           
                        TOTAL   65 TiB   44 TiB   44 TiB  452 KiB   79 GiB   21 TiB  67.88                                         
MIN/MAX VAR: 0.14/1.36  STDDEV: 33.71

The issue is about the HDDs on Hosts 2 (7/8) and 3 (1/9)
 
The HDDs, the NVMe and the SSDs each form their own pools with their own crush rule?

If that's the case and the 14 and 4 TB HDD run together in a pool, then it doesn't surprise me. You also can't throw two different sizes together and expect an optimal distribution. It is clear that the large HDD gets many more PGs in relation to the small one. Each PG has a defined size and CEPH tries to distribute the PGs based on the number and size of the disk. But the gap between 4 and 14 TB is simply too big.

If you are running Replica 3, then I have to tell you that you are already driving your cluster well over the limit. If an HDD fails, your HDD pool is immediately in read-only state. The failure of an SSD or NVMe also directly results in a degraded and undersized state.

OSD 7 and 9 both run at near full ratio of 85% and are already hard on the full ratio. Your cluster cannot run in a healthy state, either you have increased the values drastically or simply ignored the state.

If you care about the data, you should urgently resolve this situation with data reduction or additional disks. So it's a ticking time bomb.
 
The HDDs, the NVMe and the SSDs each form their own pools with their own crush rule?
Yes.

It is clear that the large HDD gets many more PGs in relation to the small one. Each PG has a defined size and CEPH tries to distribute the PGs based on the number and size of the disk. But the gap between 4 and 14 TB is simply too big.
Maybe there was a misunderstanding. I am not wondering why the 14TB and the 4TB HDDs get a different number of PGs. That is expected, as you explain.

My issue is that on Host 2 and Host 3 the 14TB HDD gets filled only to 76% whereas the 4TB HDD gets filled to 92% capacity while on Host 1 the balancing results in the 14TB and the 4TB HDDs being each filled to 78%/79% of capacity whereas - which I consider the optimal distribution. Why does it work on Host 1 but not on Host 2 and Host 3?

OSD 7 and 9 both run at near full ratio of 85% and are already hard on the full ratio.
Exactly, that's my point. If the distribution worked on Host 2 and Host 3 like it does on Host 1, OSD7 and OSD9 would only be filled to 79%. I can put in another disk but I want to understand what is going on first.

If you are running Replica 3, then I have to tell you that you are already driving your cluster well over the limit.
My rule is 2/3, so 3 replicas with a minimum of 2. That means, if I understand it correctly, that the loss of 1 disk per pool would not be catastrophic.
 
My issue is that on Host 2 and Host 3 the 14TB HDD gets filled only to 76% whereas the 4TB HDD gets filled to 92% capacity while on Host 1 the balancing results in the 14TB and the 4TB HDDs being each filled to 78%/79% of capacity whereas - which I consider the optimal distribution. Why does it work on Host 1 but not on Host 2 and Host 3?
CEPH is never able to distribute it 100% optimally. A discrepancy of +/- 10 - 15 PGs is definitely the rule and that also applies to you.

CEPH distributes PGs, not files. A PG has around 45 GB of data. In this respect, everything is absolutely within limits and the distribution according to PG is fine. Then you would have to adjust it manually and configure the balancer differently or change the weighting of the OSD.7. If you reduce it, then the OSD.8 automatically gets more data - but you should do this slowly so that you don't fill up the other one.

That means, if I understand it correctly, that the loss of 1 disk per pool would not be catastrophic.
Yes, it will, you have a mistake in your thinking and you have to think more abstractly here.

CEPH wants to keep the replica of 3 and distributes your bulk of data across three hosts. Each of your nodes must therefore keep a complete copy of the data. You are currently distributing this across 2 OSDs (HDD). If one fails, CEPH will have to pack this data from the failed OSD somewhere else and will therefore want to pack the entire fill level from, for example, OSD.1 to OSD.9. In this scenario, you can only fill the two HDDs up to a level of 42.5% each so that the other can hold all the data in the event of a failure. But you now have 167.94% of data per node, if an HDD can hold a maximum of 100%, where should the remaining 67.94% go? CEPH will run with the OSD.9 in full ratio and then pull the emergency brake and switch the pool to read-only to protect the integrity.

If you only had one OSD of each type, then your thinking would be correct, because CEPH can no longer produce its replica and would then automatically go into degraded+undersized, but there would be no standstill because the other OSDs can not receive this data according to Crush Rule. With two OSDs per node, things have to be thought differently.

With Replica 3 and three CEPH nodes, you basically have to view one server and hide the others. If you think like this, you will understand that the data from the other HDD can never fit on the other one.

So you have currently pushed your HDD pool so hard to the limit that you can only expand it to acutely eliminate the risk of a standstill. Theoretically, you can also use noout to swap the current 4 TB drive for a larger one. However, I would recommend putting in another HDD so that you can get the space problem under control.

For precisely these reasons, I would also advise you to always use the same size and keep in mind that in your scenario, the remaining OSD in the Crush Rule MUST always absorb the loss of another one.

If you have three 4 TB disks in there and each has 50% of data, then the other two each get about 25% of data and are then at 75%. But if you have two 4 TB disks and one 14 TB disk and all three are at 50%, then the 4 TB disks still have a maximum of 2 TB free while the 14 TB disk now has to move 3.5 TB each.

But always remember that a 4 TB disk cannot store 4 TB and that the nearfull ratio is 85% and fullratio is 95% (= read-only for the affected PGs). That's why you can't park 4 TB of data here and you should always keep the 85% limit in mind.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!