Ceph Question about image size

adamb

Famous Member
Mar 1, 2012
1,329
77
113
Been playing with Ceph and there is one question I can't seem to find the correct answer for.

Here is a quick example.

root@cephprimary1:~# ceph osd map Test vm-100-disk-0
osdmap e563 pool 'Test' (6) object 'vm-100-disk-0' -> pg 6.720ca493 (6.93) -> up ([19,21,6,2], p19) acting ([19,21,6,2], p19)

Based on the above output, vm-100-disk-0 has copies on OSD's 19,21,6 and 2. Each one of these OSD's are 3.6TB, what happens when the above image grows above 3.6TB? Will the image then get mapped to more OSD's? I dug quite a bit on this question and didn't come up with much on any mailing lists. I appreciate the input!
 
Normally the image is splitted in multiple objects where distributed over multiple OSDs, depending on your pool configuration. With default it should be 3 copies per object in multiple placement groups.
Afaik every object has 4MiB, in a placement group are multiple objects stored, so every PG can get around 6GiB or more. So not only a object is distributed its more the PGs itself.
 
  • Like
Reactions: Alwin
Normally the image is splitted in multiple objects where distributed over multiple OSDs, depending on your pool configuration. With default it should be 3 copies per object in multiple placement groups.
Afaik every object has 4MiB, in a placement group are multiple objects stored, so every PG can get around 6GiB or more. So not only a object is distributed its more the PGs itself.

Still having issues grasping this.

root@cephprimary5:~# ceph osd map Test vm-101-disk-0
osdmap e606 pool 'Test' (6) object 'vm-101-disk-0' -> pg 6.34927f8e (6.18e) -> up ([0,6,19,15], p0) acting ([0,6,19,15], p0)

The above makes me think vm-101-disk-0 is on OSD 0,6,19 and 15. I am keeping 4 replica, so 4 OSD's makes sense. I just don't understand what happens when this image becomes larger than the OSD it resides on.
 
The above makes me think vm-101-disk-0 is on OSD 0,6,19 and 15. I am keeping 4 replica, so 4 OSD's makes sense. I just don't understand what happens when this image becomes larger than the OSD it resides on.
The first OSD that hits the 85% capacity limit will give you a 'near full OSD' warning. When hitting 95% capacity then all pools that use this specific OSD will be put into read-only mode.
 
The first OSD that hits the 85% capacity limit will give you a 'near full OSD' warning. When hitting 95% capacity then all pools that use this specific OSD will be put into read-only mode.

So then based on that a VM image can never grow larger than the physical size of a OSD? So if my OSD's are all 4TB, I can't have images larger than 4TB?
 
A image is not limited to a single OSD size. You can grow and shrink the usable space with adding or removing OSDs. You can check the usage with 'ceph df detail'.
 
A image is not limited to a single OSD size. You can grow and shrink the usable space with adding or removing OSDs. You can check the usage with 'ceph df detail'.

Ok, well that is good. Once the VM image is larger than a single OSD, would the following output be different?

root@cephprimary5:~# ceph osd map Test vm-101-disk-0
osdmap e606 pool 'Test' (6) object 'vm-101-disk-0' -> pg 6.34927f8e (6.18e) -> up ([0,6,19,15], p0) acting ([0,6,19,15], p0)

Will I see the image on more than 4 OSD's?
 
Ok, well that is good. Once the VM image is larger than a single OSD, would the following output be different?

root@cephprimary5:~# ceph osd map Test vm-101-disk-0
osdmap e606 pool 'Test' (6) object 'vm-101-disk-0' -> pg 6.34927f8e (6.18e) -> up ([0,6,19,15], p0) acting ([0,6,19,15], p0)

Will I see the image on more than 4 OSD's?

CEPH splits an image into 4MB chunks and then places each of these 4MB chunks on a separate PG's, each PG is assigned to 4 OSD's with each one having a different order of OSD's to spread the load.

The command you are running is just showing one example of a PG that the image is using. The image will technically/most probably have a small bit of data on every single PG, which is why even if one PG becomes down/unavailable there is a good chance a small subset of your data would be unreadable during that time.

The benefit of this is that due to each object being a max of 4MB you will never hit a single disk limit, your limit is set set by the total amount of storage you have available across all your OSD's and having a decent PG size set for the pool.
 
CEPH splits an image into 4MB chunks and then places each of these 4MB chunks on a separate PG's, each PG is assigned to 4 OSD with each one having a different order of OSD's to spread the load.

The command you are running is just showing one example of a PG that the image is using. The image will technically/most probably have a small bit of data on every single PG, which is why even if one PG is to become down there is a good chance a small subset of your data would be unreadable during that time.

The benefit of this is that due to each object being a max of 4MB you will never hit a single disk limit, your limit is set set by the total amount of storage you have available across all your OSD's and having a decent PG size set for the pool.

This helped me understand this alot. I appreciate the input!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!