Ceph pool size (is 2/1 really a bad idea?)

brucexx · Feb 9, 2022

ok, that makes sense I was just hopeful that I missed something based on aaron post referencing 4 nodes with 2 nodes down.

Perhaps there is a way to rig it just like we can do pvecm expected -1 to make the PVE working when it looses quorum, is there something similar that can be done for ceph with 4/2 situation to make it to work manually for the time being until the down nodes are back in service ? Like setting up the replicas to 2/1 but wouldn't it cause issues on node recovery ?

Also when 2 out of 4 nodes are down ceph seems not responsive I cannot get any commands to work ceph mon stat or ceph -s.

thx

aaron · Feb 10, 2022

thx @Bengt Nolin to notice the even number of mons and sorry that I missed the mons.

You will always need a majority regarding the Monitors. In a 4 Node cluster, if you run the minimum of 3 Mons, you can lose up to 2 nodes, if they are the right ones...
One with a Mon and the one without any. If you want to lose any 2 nodes, I am afraid you will need to add another node to have 5 in your cluster.

Since I guess it is not likely that 2 nodes will fail within a short time, one could try to remove the Mon of the failed node from the monmap and add it to the 3rd node that did not have a mon yet to be back up to 3. With this, any 1 node could fail again.

But this is something that needs to be tried before it goes into production, and I am not sure without testing it myself, how easy that procedure would be and how the cluster will react with the first failed node back up.

Another thing I noticed: 2 OSDs per node. While this can work for large cluster, smaller clusters need a bit more consideration in this regard. A 4 Node cluster should not be affected that much, but a 3 node cluster certainly is.

Let's focus on a 3 node cluster: With 2 OSDs, how will Ceph react if not a whole node, but an OSD within a node fails?

Since there is still one OSD left in that node, it will try to get all the data that should be on that node, on to that remaining OSD. This will be fine, if both OSDs have been filled below 50, or rather ~40%. As then the remaining OSD would be filled to about 80% which is getting close to the default near full warnings.

If the OSDs contain more data than that, you are likely to run into the situation that the remaining OSD will run out of space and then there will be trouble. Therefore, it is a good idea to have smaller, but more OSDs so the loss of one can be handled better and the data spread across multiple other OSDs.

In Cluster with more nodes than that, it is not that much of a problem as the data can also be stored on other nodes and still have redundancy on the host level.

brucexx · Feb 11, 2022

Empirically tested, working with 2 out of 4 nodes if they are the right ones

Can not change the number of OSDs as the servers have only 4 bays and need two for system RAID - it is kind of a small cluster with limited RAM for non demanding VMs. No issues with OSDs I have relatively large drives and the pool will never be bigger then 50% for other reasons like networking throughput. I know how much the VMs will consume processing/network so I know how many of them I can put in the pool. From my tests the cluster worked perfectly on 2 nodes.

I am running out of time as the servers need to go into production but I will attempt to figure out how to make it to work with any 2 out of 4 nodes (meaning all nodes with monitors).

Perhaps the best option would be to add a VM node from another cluster with no storage but an additional monitor - what do you think ?

Thank you

aaron · Feb 14, 2022

Hmm, another question that comes up is how you maintain quorum for Proxmox VE itself. While there would be the QDevice mechanism, there is not such thing for Ceph monitors. pvecm expect is also not a good approach as it needs manual intervention and can easily lead to unexpected behavior when not handled with great care.

In that situation, I would recommend adding a smaller 5th node to the cluster. It can provide the another vote for Proxmox VE and with a Ceph MON installed, also for the Ceph cluster. Since it is part of the Proxmox VE Cluster, any changes to the Ceph config file are synced via the Proxmox Cluster FS like any other node.

You should just make sure that any storage configured is only configured for the 4 full nodes and if you plan to use HA, use HA groups to specifically only let the VMs recover on the 4 full nodes.

Any other way, unfortunately, will be hacky and will cause problems in one way or another, depending on the failure scenarios you want to protect against.

vizerei · Oct 16, 2024

Revisiting this necrod thread a few years later. We decided to go 2:1 in our production environment. We have 7 hosts each full on disks. We wanted to use more of the storage space and accepted this risk because in an enterprise environment we're not relying JUST on Ceph. We have a full real-time backup solution with local storage first (off-site nightly). So if we lose something? We can restore a backup in minutes (2x 25gbps bonded ports to a flash storage SAN). SOMETIMES you can justify going 2:1, and while the devs may say differently, the engineers using this in practice should really make the risk/reward call.

DaSilva · Dec 13, 2024

So, bringing this conversation to my scenario, a Cluster of 3 Nodes, each Server with 16TB of Storage for CEPH. So, technically, I will have only 16TB of total space to use (3/2). Is this a correct assumption? I'm new to Proxmox/CEPH and, at some level, storage.

CEPH shows me a total of 43.66TB under usage.

what I'm missing here?
Im lost on this CEPH Storage.

Johannes S · Dec 13, 2024

DaSilva said:
So, bringing this conversation to my scenario, a Cluster of 3 Nodes, each Server with 16TB of Storage for CEPH. So, technically, I will have only 16TB of total space to use (3/2). Is this a correct assumption? I'm new to Proxmox/CEPH and, at some level, storage.

CEPH shows me a total of 43.66TB under usage.

what I'm missing here?
Im lost on this CEPH Storage.

16 TB is the amount you still can use if one or two nodes gets broken, see https://florian.ca/ceph-calculator/

DaSilva · Dec 13, 2024

Johannes S said:
16 TB is the amount you still can use if one or two nodes gets broken, see https://florian.ca/ceph-calculator/

Thank you for the reply. The calculator answered all my questions.

Search

Search

Ceph pool size (is 2/1 really a bad idea?)

brucexx

Renowned Member

aaron

Proxmox Staff Member

brucexx

Renowned Member

aaron

Proxmox Staff Member

vizerei

Renowned Member

DaSilva

New Member

Johannes S

Famous Member

DaSilva

New Member

We value your privacy