[SOLVED] Usable space on a Ceph cluster with LZ4 compression

lucaferr · Jan 23, 2019

Hi! We have a 5 node Proxmox+Ceph cluster (we use the same nodes for computing and distributed storage). We have LZ4 compression enabled, which works pretty good (we're saving more than 16%). My ceph df detail looks like this:

As you can see, I have 15 TB of physical storage (each of the 5 nodes has 3 x 1TB NVMe SSDs) and I'm using 62.38% of the physical space, with 3x replica. But in the pool I'm using 80.27% of storage (the difference is due to LZ4 compression I guess). So it says that I can only write about 872GB of other data (MAX AVAIL column) but it should be much more thanks to compression.
Will writes be denied when my pool will be at 100% usage, even if I do have physical space (global will be like 80%), or it's just an estimate and I'll be able to get to 120% pool usage?
I need this information to plan my OSDs upgrades...
Thank you very much!

AlexLup · Jan 24, 2019

Dont know if this is applicable, but I was running into ceph pool too full warnings for a long time even though I had plenty of space...

So what I ended up doing is to increase the size of the PGs for that pool, and I went from 80% of usage to about 35% in one night. I had turned on the compression right from the start (before the warnings) so it shouldnt have been that in my opinion but YMMV....

RAM usage is extensively higher since I done that BUT I have more available space now AND ceph feels much faster.

lucaferr · Jan 24, 2019

I have 512 PGs, which is the recommended value for 15 OSDs with a 3x replica, so I don't think that PGs number is the problem. Any other ideas?

lucaferr · Jan 30, 2019

Really, nobody is interested in such a topic?

Alwin · Jan 31, 2019

Do you have more than one pool? How does your 'ceph osd df tree' look like?

lucaferr · Jan 31, 2019

I only have one pool, named "nvme01". Here's the output of 'ceph osd df tree':

Alwin · Jan 31, 2019

lucaferr said:
Here's the output of 'ceph osd df tree':

The tree part on the end of the command, adds the crush hierarchy to the output.

The PG distribution has a delta from 86 - 126 PG on the OSDs. This may reduce available disk space and performance. You could try with the 'ceph osd reweight-by-*' commands to get a better distribution. But this will redistribute data.

As an example, my test cluster has a more even distribution, hence VAR / STDDEV is way lower.

Code:

root@p5c02:~# ceph osd df tree
ID  CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE VAR  PGS TYPE NAME     
 -1       0.31189        -  319GiB 13.6GiB  305GiB 4.28 1.00   - root default   
 -3       0.06238        - 63.8GiB 2.68GiB 61.1GiB 4.19 0.98   -     host p5c01 
  1   hdd 0.03119  1.00000 31.9GiB 1.27GiB 30.6GiB 3.98 0.93  82         osd.1 
  2   hdd 0.03119  1.00000 31.9GiB 1.41GiB 30.5GiB 4.41 1.03  83         osd.2 
 -5       0.06238        - 63.8GiB 2.68GiB 61.1GiB 4.21 0.98   -     host p5c02 
  0   hdd 0.03119  1.00000 31.9GiB 1.31GiB 30.6GiB 4.10 0.96  94         osd.0 
  3   hdd 0.03119  1.00000 31.9GiB 1.37GiB 30.5GiB 4.31 1.01  79         osd.3 
 -7       0.06238        - 63.8GiB 2.73GiB 61.1GiB 4.28 1.00   -     host p5c03 
  4   hdd 0.03119  1.00000 31.9GiB 1.35GiB 30.5GiB 4.25 0.99  89         osd.4 
  5   hdd 0.03119  1.00000 31.9GiB 1.38GiB 30.5GiB 4.31 1.01  92         osd.5 
 -9       0.06238        - 63.8GiB 2.76GiB 61.0GiB 4.33 1.01   -     host p5c04 
  6   hdd 0.03119  1.00000 31.9GiB 1.43GiB 30.5GiB 4.49 1.05  99         osd.6 
  7   hdd 0.03119  1.00000 31.9GiB 1.33GiB 30.6GiB 4.17 0.98  78         osd.7 
-11       0.06238        - 63.8GiB 2.79GiB 61.0GiB 4.37 1.02   -     host p5c05 
  8   hdd 0.03119  1.00000 31.9GiB 1.32GiB 30.6GiB 4.15 0.97  85         osd.8 
  9   hdd 0.03119  1.00000 31.9GiB 1.46GiB 30.4GiB 4.59 1.07  83         osd.9 
                     TOTAL  319GiB 13.6GiB  305GiB 4.28                         
MIN/MAX VAR: 0.93/1.07  STDDEV: 0.18

lucaferr · Feb 25, 2019

Thank you very much Alwin, you were right: my data distribution was not balanced. I run

Code:

ceph osd reweight-by-utilization

preceded by

Code:

ceph osd test-reweight-by-utilization

just to make sure of what was about to happen. Ceph moved some data (not much, just a few gigabytes in a few minutes) and MAX AVAIL space has grown from 734G to 888G. Then I run it again lowering the threshold from 120% to 110% (command complete is ceph osd reweight-by-utilization 110 preceded by ceph osd test-reweight-by-utilization 110 for dry run) and now my MAX AVAIL is aroung 1000G

Probably everyone should run a "ceph osd test-reweight-by-utilization" followed by "ceph osd reweight-by-utilization" if everything sounds good once in a while (it increases both cluster capacity and cluster performance)

Search

Search

[SOLVED] Usable space on a Ceph cluster with LZ4 compression

lucaferr

Renowned Member

AlexLup

Well-Known Member

lucaferr

Renowned Member

lucaferr

Renowned Member

Alwin

Proxmox Retired Staff

lucaferr

Renowned Member

Alwin

Proxmox Retired Staff

lucaferr

Renowned Member