Ceph Question about image size

adamb · Mar 4, 2019

Been playing with Ceph and there is one question I can't seem to find the correct answer for.

Here is a quick example.

root@cephprimary1:~# ceph osd map Test vm-100-disk-0
osdmap e563 pool 'Test' (6) object 'vm-100-disk-0' -> pg 6.720ca493 (6.93) -> up ([19,21,6,2], p19) acting ([19,21,6,2], p19)

Based on the above output, vm-100-disk-0 has copies on OSD's 19,21,6 and 2. Each one of these OSD's are 3.6TB, what happens when the above image grows above 3.6TB? Will the image then get mapped to more OSD's? I dug quite a bit on this question and didn't come up with much on any mailing lists. I appreciate the input!

sb-jw · Mar 4, 2019

Normally the image is splitted in multiple objects where distributed over multiple OSDs, depending on your pool configuration. With default it should be 3 copies per object in multiple placement groups.
Afaik every object has 4MiB, in a placement group are multiple objects stored, so every PG can get around 6GiB or more. So not only a object is distributed its more the PGs itself.

adamb · Mar 5, 2019

sb-jw said:
Normally the image is splitted in multiple objects where distributed over multiple OSDs, depending on your pool configuration. With default it should be 3 copies per object in multiple placement groups.
Afaik every object has 4MiB, in a placement group are multiple objects stored, so every PG can get around 6GiB or more. So not only a object is distributed its more the PGs itself.

Still having issues grasping this.

root@cephprimary5:~# ceph osd map Test vm-101-disk-0
osdmap e606 pool 'Test' (6) object 'vm-101-disk-0' -> pg 6.34927f8e (6.18e) -> up ([0,6,19,15], p0) acting ([0,6,19,15], p0)

The above makes me think vm-101-disk-0 is on OSD 0,6,19 and 15. I am keeping 4 replica, so 4 OSD's makes sense. I just don't understand what happens when this image becomes larger than the OSD it resides on.

Alwin · Mar 5, 2019

adamb said:
The above makes me think vm-101-disk-0 is on OSD 0,6,19 and 15. I am keeping 4 replica, so 4 OSD's makes sense. I just don't understand what happens when this image becomes larger than the OSD it resides on.

The first OSD that hits the 85% capacity limit will give you a 'near full OSD' warning. When hitting 95% capacity then all pools that use this specific OSD will be put into read-only mode.

adamb · Mar 5, 2019

Alwin said:
The first OSD that hits the 85% capacity limit will give you a 'near full OSD' warning. When hitting 95% capacity then all pools that use this specific OSD will be put into read-only mode.

So then based on that a VM image can never grow larger than the physical size of a OSD? So if my OSD's are all 4TB, I can't have images larger than 4TB?

Alwin · Mar 5, 2019

A image is not limited to a single OSD size. You can grow and shrink the usable space with adding or removing OSDs. You can check the usage with 'ceph df detail'.

adamb · Mar 5, 2019

Alwin said:
A image is not limited to a single OSD size. You can grow and shrink the usable space with adding or removing OSDs. You can check the usage with 'ceph df detail'.

Ok, well that is good. Once the VM image is larger than a single OSD, would the following output be different?

root@cephprimary5:~# ceph osd map Test vm-101-disk-0
osdmap e606 pool 'Test' (6) object 'vm-101-disk-0' -> pg 6.34927f8e (6.18e) -> up ([0,6,19,15], p0) acting ([0,6,19,15], p0)

Will I see the image on more than 4 OSD's?

sg90 · Mar 5, 2019

adamb said:
Ok, well that is good. Once the VM image is larger than a single OSD, would the following output be different?

root@cephprimary5:~# ceph osd map Test vm-101-disk-0
osdmap e606 pool 'Test' (6) object 'vm-101-disk-0' -> pg 6.34927f8e (6.18e) -> up ([0,6,19,15], p0) acting ([0,6,19,15], p0)

Will I see the image on more than 4 OSD's?

CEPH splits an image into 4MB chunks and then places each of these 4MB chunks on a separate PG's, each PG is assigned to 4 OSD's with each one having a different order of OSD's to spread the load.

The command you are running is just showing one example of a PG that the image is using. The image will technically/most probably have a small bit of data on every single PG, which is why even if one PG becomes down/unavailable there is a good chance a small subset of your data would be unreadable during that time.

The benefit of this is that due to each object being a max of 4MB you will never hit a single disk limit, your limit is set set by the total amount of storage you have available across all your OSD's and having a decent PG size set for the pool.

adamb · Mar 5, 2019

sg90 said:
CEPH splits an image into 4MB chunks and then places each of these 4MB chunks on a separate PG's, each PG is assigned to 4 OSD with each one having a different order of OSD's to spread the load.

The command you are running is just showing one example of a PG that the image is using. The image will technically/most probably have a small bit of data on every single PG, which is why even if one PG is to become down there is a good chance a small subset of your data would be unreadable during that time.

The benefit of this is that due to each object being a max of 4MB you will never hit a single disk limit, your limit is set set by the total amount of storage you have available across all your OSD's and having a decent PG size set for the pool.

This helped me understand this alot. I appreciate the input!

Search

Search

Ceph Question about image size

adamb

Famous Member

sb-jw

Famous Member

adamb

Famous Member

Alwin

Proxmox Retired Staff

adamb

Famous Member

Alwin

Proxmox Retired Staff

adamb

Famous Member

sg90

Renowned Member

adamb

Famous Member

We value your privacy