Number of disks on Ceph storage?

Mayank006 · May 17, 2024

Currently, I have 1 disk in each server(total 3) used for Ceph shared storage. I noticed that if 2 out of 3 disks are not working then complete ceph will also stop working.
I am not sure if the same (n/2 +1) formula works here as well.

Why It is concerning is because we are using CephFS as well.

Does that mean that I need 6 disks for 2 disk failure or just 4?

alexskysilk · May 18, 2024

ceph is self healing.

What that means is, as long as you have enough free space on each node, you can sustain multiple failures- just not simultaneously; as long as enough time was allowed for the data to be redistributed to survivors.

Mayank006 said:
Does that mean that I need 6 disks for 2 disk failure or just 4?

of you have 3 disks per node and 3 nodes, that 9 drives no? please dont say you have two nodes.

Mayank006 · May 20, 2024

alexskysilk said:
ceph is self healing.

What that means is, as long as you have enough free space on each node, you can sustain multiple failures- just not simultaneously; as long as enough time was allowed for the data to be redistributed to survivors.

of you have 3 disks per node and 3 nodes, that 9 drives no? please dont say you have two nodes.

I have 3 servers in HA and each server has one 1TB disk that we use for Ceph. I am not thinking of server failure in this scenario.
What I mean is with this setup if 2 out of 3 disks die. The Ceph will not work.

I was thinking that does having two 500GB disks on each server decreases the disk failure chance? Does that mean that if 3 out of 6 disks die then Ceph willl stop working?

Can I increase the redundancy with 500GB 6 disks setup as it decreases the failure point?

spirit · May 20, 2024

What I mean is with this setup if 2 out of 3 disks die. The Ceph will not work.

if you have min_size=2 for your pool. (and size=3), if you loose 2 disks (ON DIFFERENT SERVERS), you'are going to have readonly PG. (until you have replicated them).

if you use min_size=1, it's still work with 1 disk.

Mayank006 · May 20, 2024

spirit said:
if you have min_size=2 for your pool. (and size=3), if you loose 2 disks (ON DIFFERENT SERVERS), you'are going to have readonly PG. (until you have replicated them).

if you use min_size=1, it's still work with 1 disk.

I was thinking to do that but it gives warning. should I do this?

UdoB · May 20, 2024

spirit said:
if you use min_size=1, it's still work with 1 disk.

You don't actually recommend this to a new Ceph user, do you???

alexskysilk · May 20, 2024

Mayank006 said:
I was thinking to do that but it gives warning. should I do this?

No.

overall, the config you are proposing is not going to be performant or particularly satisfying. What I would suggest is to go back to the drawing board- what is the use case you are trying to fulfill? traffic load, minimum performance, that kind of thing. understand that node count, osd/node, network makeup (physical and logical) cpu/ram considerations, will all come into play.

oh, one more thing:

Consider your maximum space utilization. you should not ever plan on operating at >80% OSD utilization, OR pool utilization. with only 3x1TB osds, that means your pool is effectively full at 640GB (80% of 80%.)

Mayank006 · May 21, 2024

alexskysilk said:
No.

overall, the config you are proposing is not going to be performant or particularly satisfying. What I would suggest is to go back to the drawing board- what is the use case you are trying to fulfill? traffic load, minimum performance, that kind of thing. understand that node count, osd/node, network makeup (physical and logical) cpu/ram considerations, will all come into play.

oh, one more thing:

Consider your maximum space utilization. you should not ever plan on operating at >80% OSD utilization, OR pool utilization. with only 3x1TB osds, that means your pool is effectively full at 640GB (80% of 80%.)

At any point we are not going to utilize 500+ GB as we also have os disks for local storage. I was just wanted to know that what are some drawbacks of using 500GB 2 disks per server instead of using 1TB per server?

Does using 500GB gives me lesser memory to work on ceph due to 80% rule? but it also gives me more redundancy? doesn’t it combine and will be equivalent to 3TB?

alexskysilk · May 21, 2024

Mayank006 said:
Does using 500GB gives me lesser memory to work on ceph due to 80% rule?

2x 500 is functionally the same as 1x 1TB for the purposes of storage capacity, but it is AT LEAST DOUBLE in performance. it also gives you a tiny bit more flexibility since your osd failure domain has doubled in count.

Mayank006 · May 21, 2024

alexskysilk said:
2x 500 is functionally the same as 1x 1TB for the purposes of storage capacity, but it is AT LEAST DOUBLE in performance. it also gives you a tiny bit more flexibility since your osd failure domain has doubled in count.

Thais is what I just wanted to know...that now if 3 disks fail then ceph may not work.

UdoB · May 21, 2024

Mayank006 said:
Thais is what I just wanted to know...that now if 3 disks fail then ceph may not work.

I am not sure if I understand everything in this thread correctly. For still having a working system if three OSD fail you need three more instances of data than min_size.

Having "size=5" and "min_size=2" for example. 5 regular - 2 min = 3 may fail. If three fail then the required two are still available for "normal" operation, including write access.

----
Edit: actually I do run a very small (tiny!) Ceph cluster with size=4/min_size=2 because I want to be able to survive problems on two nodes.

Mayank006 · May 21, 2024

UdoB said:
I am not sure if I understand everything in this thread correctly. For still having a working system if three OSD fail you need three more instances of data than min_size.

Having "size=5" and "min_size=2" for example. 5 regular - 2 min = 3 may fail. If three fail then the required two are still available for "normal" operation, including write access.

----
Edit: actually I do run a very small (tiny!) Ceph cluster with size=4/min_size=2 because I want to be able to survive problems on two nodes.

Thanks for your reply...The idea was to increase the size of Ceph from 3 disks to 6. Having 6 disks will increase the failure point from 2 disks to 3 for Ceph to stop working.

"size=6" and "min_size=2" fail when 3 disks die.
"size=3" and "min_size=2" fail when 2 disks die.

n>2f+1 -> Where n is a number of osd/disks needed and f number of osd failures the system can tolerate.

UdoB · May 21, 2024

Mayank006 said:
"size=6" and "min_size=2" fail when 3 disks die.

Again I am not sure if I understand this.

If you have size=6 then you are writing 6 copies of the same data, distributed onto 6 OSDs.

If 4 disks die, then 2 are left. With "min_size=2" everything will continue to work, including write access. The cluster is degraded, but fine.

There is no "2f", there is no mirroring in the sense of "must be two". This would be a "sixfold-mirror" approach. And two instances of these six must be available to be allowed to write to it, which is -of course- continuously the case for normal operating systems.

Mayank006 said:
"size=3" and "min_size=2" fail when 2 disks die.

Yes. When two disks fail your whole cluster comes to an immediate stop. None of the VMs can write a single bit of data to its storage...

Disclaimer: I am not a Ceph specialist, my statements are base on my own -possibly not complete- understanding...

alexskysilk · May 21, 2024

Mayank006 said:
"size=6" and "min_size=2" fail when 3 disks die.

thats not how it works.

the NUMBER of osds dont matter; its the number of OSDs that are allowed for normal operations per pg (placement group) and number of osds that are allowed for degraded operation per pg.

your logical arrangement doesnt change. its still 3/2. its worth noting what these numbers mean here: the first one is the number of discrete parts that must be present for a pg to be considered healthy, and the second is the MINIMUM number of parts that must be present to allow for operation. in a normal crush rule implementation, each part MUST BE on a different node.

SO, some examples:

example 1. lets say you have one failed osd out of the 6 you have deployed. SOME of your PGs will be degraded, but since all pgs still have at least 2 surviving OSDs all pgs will be available. the subsystem will attempt to re-establish full health by rewriting the affected pg data to remaining free space on the node housing the missing OSD.

example 2. lets say you have two OSDs fail on the same node (or the node is missing): ALL of your pgs will be degraded, but since all pgs still have at least 2 surviving OSDs all pgs will be available. since there is no available space on a third node to fullfill the "3" rule, the system cannot correct and will remain degraded.

example 3. lets say you have two osd out, one on each separate nodes. most (I'm too lazy to do the actual math) of your pgs will be degraded- but some will now be under the minimum required fragments per pg. the affected PGs will no longer allow you to use them and the file system will become read only. Data can still be recovered with manual intervention but most if not all dependent VMs will be hung.

Search

Search

Number of disks on Ceph storage?

Mayank006

Member

alexskysilk

Distinguished Member

Mayank006

Member

spirit

Distinguished Member

Mayank006

Member

UdoB

Distinguished Member

alexskysilk

Distinguished Member

Mayank006

Member

alexskysilk

Distinguished Member

Mayank006

Member

UdoB

Distinguished Member

Mayank006

Member

UdoB

Distinguished Member

alexskysilk

Distinguished Member