Ceph pool size (is 2/1 really a bad idea?)

ok, that makes sense I was just hopeful that I missed something based on aaron post referencing 4 nodes with 2 nodes down.

Perhaps there is a way to rig it just like we can do pvecm expected -1 to make the PVE working when it looses quorum, is there something similar that can be done for ceph with 4/2 situation to make it to work manually for the time being until the down nodes are back in service ? Like setting up the replicas to 2/1 but wouldn't it cause issues on node recovery ?

Also when 2 out of 4 nodes are down ceph seems not responsive I cannot get any commands to work ceph mon stat or ceph -s.

thx
 
thx @Bengt Nolin to notice the even number of mons and sorry that I missed the mons.

You will always need a majority regarding the Monitors. In a 4 Node cluster, if you run the minimum of 3 Mons, you can lose up to 2 nodes, if they are the right ones...
One with a Mon and the one without any. If you want to lose any 2 nodes, I am afraid you will need to add another node to have 5 in your cluster.

Since I guess it is not likely that 2 nodes will fail within a short time, one could try to remove the Mon of the failed node from the monmap and add it to the 3rd node that did not have a mon yet to be back up to 3. With this, any 1 node could fail again.

But this is something that needs to be tried before it goes into production, and I am not sure without testing it myself, how easy that procedure would be and how the cluster will react with the first failed node back up.


Another thing I noticed: 2 OSDs per node. While this can work for large cluster, smaller clusters need a bit more consideration in this regard. A 4 Node cluster should not be affected that much, but a 3 node cluster certainly is.

Let's focus on a 3 node cluster: With 2 OSDs, how will Ceph react if not a whole node, but an OSD within a node fails?

Since there is still one OSD left in that node, it will try to get all the data that should be on that node, on to that remaining OSD. This will be fine, if both OSDs have been filled below 50, or rather ~40%. As then the remaining OSD would be filled to about 80% which is getting close to the default near full warnings.

If the OSDs contain more data than that, you are likely to run into the situation that the remaining OSD will run out of space and then there will be trouble. Therefore, it is a good idea to have smaller, but more OSDs so the loss of one can be handled better and the data spread across multiple other OSDs.

In Cluster with more nodes than that, it is not that much of a problem as the data can also be stored on other nodes and still have redundancy on the host level.
 
  • Like
Reactions: brucexx
Empirically tested, working with 2 out of 4 nodes if they are the right ones :)

Can not change the number of OSDs as the servers have only 4 bays and need two for system RAID - it is kind of a small cluster with limited RAM for non demanding VMs. No issues with OSDs I have relatively large drives and the pool will never be bigger then 50% for other reasons like networking throughput. I know how much the VMs will consume processing/network so I know how many of them I can put in the pool. From my tests the cluster worked perfectly on 2 nodes.

I am running out of time as the servers need to go into production but I will attempt to figure out how to make it to work with any 2 out of 4 nodes (meaning all nodes with monitors).

Perhaps the best option would be to add a VM node from another cluster with no storage but an additional monitor - what do you think ?

Thank you
 
Hmm, another question that comes up is how you maintain quorum for Proxmox VE itself. While there would be the QDevice mechanism, there is not such thing for Ceph monitors. pvecm expect is also not a good approach as it needs manual intervention and can easily lead to unexpected behavior when not handled with great care.

In that situation, I would recommend adding a smaller 5th node to the cluster. It can provide the another vote for Proxmox VE and with a Ceph MON installed, also for the Ceph cluster. Since it is part of the Proxmox VE Cluster, any changes to the Ceph config file are synced via the Proxmox Cluster FS like any other node.

You should just make sure that any storage configured is only configured for the 4 full nodes and if you plan to use HA, use HA groups to specifically only let the VMs recover on the 4 full nodes.

Any other way, unfortunately, will be hacky and will cause problems in one way or another, depending on the failure scenarios you want to protect against.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!