Number of disks on Ceph storage?

Mayank006

Member
Dec 6, 2023
53
0
6
Currently, I have 1 disk in each server(total 3) used for Ceph shared storage. I noticed that if 2 out of 3 disks are not working then complete ceph will also stop working.
I am not sure if the same (n/2 +1) formula works here as well.

Why It is concerning is because we are using CephFS as well.

Does that mean that I need 6 disks for 2 disk failure or just 4?
 
Last edited:
ceph is self healing.

What that means is, as long as you have enough free space on each node, you can sustain multiple failures- just not simultaneously; as long as enough time was allowed for the data to be redistributed to survivors.
Does that mean that I need 6 disks for 2 disk failure or just 4?

of you have 3 disks per node and 3 nodes, that 9 drives no? please dont say you have two nodes.
 
ceph is self healing.

What that means is, as long as you have enough free space on each node, you can sustain multiple failures- just not simultaneously; as long as enough time was allowed for the data to be redistributed to survivors.


of you have 3 disks per node and 3 nodes, that 9 drives no? please dont say you have two nodes.
I have 3 servers in HA and each server has one 1TB disk that we use for Ceph. I am not thinking of server failure in this scenario.
What I mean is with this setup if 2 out of 3 disks die. The Ceph will not work.

I was thinking that does having two 500GB disks on each server decreases the disk failure chance? Does that mean that if 3 out of 6 disks die then Ceph willl stop working?

Can I increase the redundancy with 500GB 6 disks setup as it decreases the failure point?
 
Last edited:
What I mean is with this setup if 2 out of 3 disks die. The Ceph will not work.

if you have min_size=2 for your pool. (and size=3), if you loose 2 disks (ON DIFFERENT SERVERS), you'are going to have readonly PG. (until you have replicated them).

if you use min_size=1, it's still work with 1 disk.
 
if you have min_size=2 for your pool. (and size=3), if you loose 2 disks (ON DIFFERENT SERVERS), you'are going to have readonly PG. (until you have replicated them).

if you use min_size=1, it's still work with 1 disk.
I was thinking to do that but it gives warning. should I do this?
1716228417082.png
 
I was thinking to do that but it gives warning. should I do this?
No.

overall, the config you are proposing is not going to be performant or particularly satisfying. What I would suggest is to go back to the drawing board- what is the use case you are trying to fulfill? traffic load, minimum performance, that kind of thing. understand that node count, osd/node, network makeup (physical and logical) cpu/ram considerations, will all come into play.

oh, one more thing:

Consider your maximum space utilization. you should not ever plan on operating at >80% OSD utilization, OR pool utilization. with only 3x1TB osds, that means your pool is effectively full at 640GB (80% of 80%.)
 
No.

overall, the config you are proposing is not going to be performant or particularly satisfying. What I would suggest is to go back to the drawing board- what is the use case you are trying to fulfill? traffic load, minimum performance, that kind of thing. understand that node count, osd/node, network makeup (physical and logical) cpu/ram considerations, will all come into play.

oh, one more thing:

Consider your maximum space utilization. you should not ever plan on operating at >80% OSD utilization, OR pool utilization. with only 3x1TB osds, that means your pool is effectively full at 640GB (80% of 80%.)
At any point we are not going to utilize 500+ GB as we also have os disks for local storage. I was just wanted to know that what are some drawbacks of using 500GB 2 disks per server instead of using 1TB per server?

Does using 500GB gives me lesser memory to work on ceph due to 80% rule? but it also gives me more redundancy? doesn’t it combine and will be equivalent to 3TB?
 
2x 500 is functionally the same as 1x 1TB for the purposes of storage capacity, but it is AT LEAST DOUBLE in performance. it also gives you a tiny bit more flexibility since your osd failure domain has doubled in count.
Thais is what I just wanted to know...that now if 3 disks fail then ceph may not work.
 
Thais is what I just wanted to know...that now if 3 disks fail then ceph may not work.
I am not sure if I understand everything in this thread correctly. For still having a working system if three OSD fail you need three more instances of data than min_size.

Having "size=5" and "min_size=2" for example. 5 regular - 2 min = 3 may fail. If three fail then the required two are still available for "normal" operation, including write access.

----
Edit: actually I do run a very small (tiny!) Ceph cluster with size=4/min_size=2 because I want to be able to survive problems on two nodes.
 
Last edited:
  • Like
Reactions: Mayank006
I am not sure if I understand everything in this thread correctly. For still having a working system if three OSD fail you need three more instances of data than min_size.

Having "size=5" and "min_size=2" for example. 5 regular - 2 min = 3 may fail. If three fail then the required two are still available for "normal" operation, including write access.

----
Edit: actually I do run a very small (tiny!) Ceph cluster with size=4/min_size=2 because I want to be able to survive problems on two nodes.
Thanks for your reply...The idea was to increase the size of Ceph from 3 disks to 6. Having 6 disks will increase the failure point from 2 disks to 3 for Ceph to stop working.

"size=6" and "min_size=2" fail when 3 disks die.
"size=3" and "min_size=2" fail when 2 disks die.

n>2f+1 -> Where n is a number of osd/disks needed and f number of osd failures the system can tolerate.
 
"size=6" and "min_size=2" fail when 3 disks die.
Again I am not sure if I understand this.

If you have size=6 then you are writing 6 copies of the same data, distributed onto 6 OSDs.

If 4 disks die, then 2 are left. With "min_size=2" everything will continue to work, including write access. The cluster is degraded, but fine.

There is no "2f", there is no mirroring in the sense of "must be two". This would be a "sixfold-mirror" approach. And two instances of these six must be available to be allowed to write to it, which is -of course- continuously the case for normal operating systems.


"size=3" and "min_size=2" fail when 2 disks die.
Yes. When two disks fail your whole cluster comes to an immediate stop. None of the VMs can write a single bit of data to its storage...


Disclaimer: I am not a Ceph specialist, my statements are base on my own -possibly not complete- understanding...
 
"size=6" and "min_size=2" fail when 3 disks die.
thats not how it works.

the NUMBER of osds dont matter; its the number of OSDs that are allowed for normal operations per pg (placement group) and number of osds that are allowed for degraded operation per pg.

your logical arrangement doesnt change. its still 3/2. its worth noting what these numbers mean here: the first one is the number of discrete parts that must be present for a pg to be considered healthy, and the second is the MINIMUM number of parts that must be present to allow for operation. in a normal crush rule implementation, each part MUST BE on a different node.

SO, some examples:

example 1. lets say you have one failed osd out of the 6 you have deployed. SOME of your PGs will be degraded, but since all pgs still have at least 2 surviving OSDs all pgs will be available. the subsystem will attempt to re-establish full health by rewriting the affected pg data to remaining free space on the node housing the missing OSD.

example 2. lets say you have two OSDs fail on the same node (or the node is missing): ALL of your pgs will be degraded, but since all pgs still have at least 2 surviving OSDs all pgs will be available. since there is no available space on a third node to fullfill the "3" rule, the system cannot correct and will remain degraded.

example 3. lets say you have two osd out, one on each separate nodes. most (I'm too lazy to do the actual math) of your pgs will be degraded- but some will now be under the minimum required fragments per pg. the affected PGs will no longer allow you to use them and the file system will become read only. Data can still be recovered with manual intervention but most if not all dependent VMs will be hung.
 
  • Like
Reactions: Mayank006

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!