Usable space on Ceph Storage

aychprox · Dec 18, 2015

Hi,

I am trying to understand the usable space showed in proxmox under ceph storage. I tried to google but no luck to get direct answer. I am appreciate senior here can guide me about how to calculate usable space.

I had refer to https://forum.proxmox.com/threads/newbie-need-your-input.24176/page-2 but seems different from the example given by Q-wulf.

Current setup:

4 Nodes, 4 x 1TB OSD each nodes, 1 x 120GB SSD for journal and 1 x 500GB HDD for OS

Pool:

Size: 3
Min: 1
Pg_number: 1024

In ceph storage summary:

Type: RBD
Size: 14.55TB

Ceph Configuration:

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.50.51.0/24
filestore xattr use omap = true
fsid = bf5d56ae-xxx-4db1-xxx-b11ddxxxcbd6a
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.50.51.0/24

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
osd max backfills = 1
osd recovery max active = 1
filestore flusher = false

[mon.1]
host = node2 mon addr = 10.50.51.16:6789

[mon.0]
host = node1 mon addr = 10.50.51.15:6789

[mon.2]
host = node3 mon addr = 10.50.51.17:6789

Crush map rules:

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit }

spirit · Dec 18, 2015

the available space show in proxmox gui, is the total space of osd.

if you have a pool with size=3, you need to divide the space/3 .

(you can have multiple pools with difference size values)

aychprox · Dec 18, 2015

spirit said:
(you can have multiple pools with difference size values)

Do you mean on same storage but with different pool configuration, eg. size =2 for replication of 2 per files?

spirit · Dec 18, 2015

aychprox said:
Do you mean on same storage but with different pool configuration, eg. size =2 for replication of 2 per files?

yes.

Q-wulf · Dec 18, 2015

lets assume you did a "ceph -w" and get the following output:

2015-12-18 12:57:55.812746 mon.0 [INF] pgmap v455862: 512 pgs: 512 active+clean; 30904 MB data, 116 GB used, 6871 GB / 6988 GB avail; 28511 B/s wr, 10 op/s

That means the following:
available: 6871 GB / 6988 GB --> the total Capacity of all your osds is 6988 GB. Out of that there are 6871 Gb free
used: 116 GB --> there is 116 GB of space taken up my data in placement groups (of all pools)
data: 30904 MB --> there is about 31 GB of actual data (before replication or Erasure coding) residing on all of your pools combined.

Math

For replicated pools it works like this:
example 4/2 (size/min_size) --> each Gigabyte of actual data you put into the pool gets multiplied by "size" - so 4.

For Erasure Coded pools it works like this:

example k=3 m=1 --> each Gigabyte of actual data you put into the pool gets multiplied by a factor of (1+ k/m) so 1+1/3 roughly 1,3.
example k=2 m=2 --> each Gigabyte of actual data you put into the pool gets multiplied by a factor of (1+ k/m) so 1+2/2 so 2.
example k=20 m = 4 --> each Gigabyte of actual data you put into the pool gets multiplied by a factor of (1+ m/k) so 1+4/20 so 1,2.

hope that helps (and should be the same as the example you quoted (unless my math was off then))

Just a FYI your available Space is is mostly meaningless btw for operation of your node. It only gives you a general idea. This is because you can have multiple pools, that use different sizes and types (replicated vs erasure coded) and make use of different failure domains on a per pool basis (if you set it - e.g. OSD, Host, Room, Building, Datacenter)

You also can end up having different OSD sizes (once your ceph cluster grows organically) or different number of OSDs per Host.

What you have to do is look at "Ceph" > "OSD" overview in Proxmox and make sure that none of your OSD's ever gets to 100%. If it does, you can reweight it , so some data is moved off it and onto other osd's according to your crush rule-set.

Hope that sheds some light onto it.

aychprox · Dec 18, 2015

Thanks for the clear example and explanation.
Definitely another meaniful class for me today!

Q-wulf · Dec 19, 2015

aychprox said:
Thanks for the clear example and explanation.
Definitely another meaniful class for me today!

2 things i forgot:

on EC pools you typically choose a leaf of "OSD", since you want to as many OSD's in that EC-pool as you can to get the lowest overhead for parity chunks you can get.
on Replicated pools you typically go with at least a leaf of "Host" (unless it is a single node Ceph-Cluster), since you wanna make sure you can not only take failing osd's but also a failing Host. depending on the requirement(s) you might wanna take a leaf with a higher id (check your crushmap).

lemme know how your SSHD's are turning out

aychprox · Dec 19, 2015

Q-wulf said:
lemme know how your SSHD's are turning out

sorry about this, unfortunately at this moment I can't bring down in-operation sshd servers for ceph pool. but I just updated the benchmark for single SSHD, results seems weird and must slower compared to those with hardware RAID. https://forum.proxmox.com/threads/newbie-need-your-input.24176/

But for general use, the above setup, we use 7200rpm HGST as OSDs.

I will post an update once sshd pool is ready !

fips · Aug 5, 2017

I am pushing up that thread, but I still don't know how does it count:

My 3 node pool has 24x 136GB disks, so 3.2TB total space.

The Images in the lxc pool need 178GB, the images in the vm pool 784GB.
Than why it says usage: 1,43 TB??

Even if its counted twice cause of replication it should than be 1,9 TB..
Somehow it confuses me...

Q-wulf · Aug 7, 2017

fips said:
I am pushing up that thread, but I still don't know how does it count:

Been less then 20 months, its fine

fips said:
(...)
My 3 node pool has 24x 136GB disks, so 3.2TB total space.

The Images in the lxc pool need 178GB, the images in the vm pool 784GB.
Than why it says usage: 1,43 TB??

Even if its counted twice cause of replication it should than be 1,9 TB..
Somehow it confuses me...

Q1:
Replicated Pool ?

Q2:
8 OSD per node ?

Q3:
Same failure Domain ? (As in Host/Node as opposed to OSD)

Q4 (if Q1, Q2 and Q3 = yes):

Did you set size == 3 and min_size == 1 for said replicated pool ?
Is that your only pool ? What settings do those other pools use ?

Q5:
Can you provide output of command line command "ceph -w" from one of your mon's cli's ?

fips · Aug 10, 2017

Well I set size = 3 and min = 2, I have 2 pools called ceph-lxc, ceph-vm, they have been configured exactly like in the Proxmox video described.

Output:

Code:

cluster c4d0e591-a919-4df0-8627-d2fda956f7ff
    health HEALTH_OK
    monmap e3: 3 mons at {0=172.30.3.21:6789/0,1=172.30.3.22:6789/0,2=172.30.3.23:6789/0}
           election epoch 58, quorum 0,1,2 0,1,2
    osdmap e1964: 24 osds: 24 up, 24 in
           flags sortbitwise,require_jewel_osds
     pgmap v679695: 1024 pgs, 2 pools, 666 GB data, 167 kobjects
           1991 GB used, 1287 GB / 3279 GB avail
               1024 active+clean
 client io 41109 B/s rd, 430 kB/s wr, 5 op/s rd, 63 op/s wr

fabian · Aug 11, 2017

you have around 666GB of actual (logical) data, which is replicated 3x, for a total of 1991GB of physically used space on your OSD disks. the 666 are rounded, which is why you get a bit less than the expected 1998GB used. the (logical) usage as seen by Ceph is sometimes higher than what you see from the client side, because Ceph chunks your data into objects and counts those. also, reclaiming via trim does not always recover the full space like it would on a physical device.

fips · Aug 11, 2017

Thanks Fabian for that great explanation.

Search

Search

Usable space on Ceph Storage

aychprox

Renowned Member

spirit

Distinguished Member

aychprox

Renowned Member

spirit

Distinguished Member

Q-wulf

Well-Known Member

aychprox

Renowned Member

Q-wulf

Well-Known Member

aychprox

Renowned Member

fips

Renowned Member

Q-wulf

Well-Known Member

fips

Renowned Member

fabian

Proxmox Staff Member

fips

Renowned Member