Understanding Ceph Failure Modes

gdi2k

Active Member
Aug 13, 2016
83
1
28
I was reading the thread on recent Ceph benchmarks stickied in this forum, and saw some comments from PigLover about how the author of the benchmarks "make the claim about being able to run a 3-node cluster and still access the data with a node OOS. While it is "true", it is also dangerous guidance and shouldn't be given without a caution - even in a bechmarking note."

I was wondering why this is exactly. We have been running a 3-node cluster for a couple of years without issue, but if it is dangerous, we could add a fourth node.

(later there is also discussion of how an odd number of nodes in clusters is better, so then we're in for a 5-way cluster?)

We run a stand-alone backup Proxmox server that has all the VMs replicated to it on a daily basis should the whole cluster fail. Is that reasonable strategy? It is powerful enough to run all our critical VMs.
 
I was reading the thread on recent Ceph benchmarks stickied in this forum, and saw some comments from PigLover about how the author of the benchmarks "make the claim about being able to run a 3-node cluster and still access the data with a node OOS. While it is "true", it is also dangerous guidance and shouldn't be given without a caution - even in a bechmarking note."

I was wondering why this is exactly. We have been running a 3-node cluster for a couple of years without issue, but if it is dangerous, we could add a fourth node.
As one of the poster on this in the original thread, the argumentation is subject to your own risk evaluation and in my opinion not part of a benchmark paper. The statement on its own is correct, as you are still able to access the data on the cluster while one node is out-of-service [1].

The above assumes, the size (size/min_size; 3/2) of the ceph pool is three replicas, as by default the replication is distributed on host level.

On a three node cluster, the state of the cluster stays in a degraded state, till the admin handles it. This leaves a bigger window for another failure to happen on the remaining nodes. With a fourth node, the cluster can rebalance and get back to a good state, while one node is out-of-service.

(later there is also discussion of how an odd number of nodes in clusters is better, so then we're in for a 5-way cluster?)
This has to do with the number of nodes needed to have quorum (consensus) [2] [3]. To give a example, with three nodes, only one node can fail, any subsequent node failure results in loss of quorum. When having four nodes, still only one node can fail, due to the quorum. Adding a fifth node, will allow the failure of two nodes, while still having quorum.

But as the discussion, is centered around the number of nodes needed for quorum, the above example is for ceph's MONs (and PVE) and not for ceph OSDs. As the OSD servers could (depending on replica) start with one server and OSD. In a hyper-converged setup (like yours), this is definitely something that you need to be aware of.

We run a stand-alone backup Proxmox server that has all the VMs replicated to it on a daily basis should the whole cluster fail. Is that reasonable strategy? It is powerful enough to run all our critical VMs.
It is always good to have a recovery/backup strategy. As written above, if this is a reasonable strategy is up to your risk/resilience plans [4].

[1] https://en.wikipedia.org/wiki/High_availability, http://www.ha-cc.org/en/high_availability/
[2] https://en.wikipedia.org/wiki/Byzantine_fault_tolerance
[3] https://en.wikipedia.org/wiki/Quorum_(distributed_computing)
[4] https://en.wikipedia.org/wiki/IT_risk, https://en.wikipedia.org/wiki/Risk_matrix
 
Alwin, many thanks for taking the time to explain things in such detail, and for the references, that's extremely helpful. It sounds like 5 servers is a nice luxury and something for us to work towards, but for now, a 3-way cluster with an independent warm standby is a workable level of risk for us.

In all these scenarios with bigger clusters etc. I am always worried that the whole cluster may go down due to some problem with the glue that sticks the cluster together (corosync), or more likely, a misconfiguration thereof. In our early days deploying our first Proxmox VE cluster, it failed miserably due to something I did, the whole cluster went down and I could not recover it fast enough (even with a support contract). Since then we run a completely independent backup server alongside.
 
In all these scenarios with bigger clusters etc. I am always worried that the whole cluster may go down due to some problem with the glue that sticks the cluster together (corosync), or more likely, a misconfiguration thereof. In our early days deploying our first Proxmox VE cluster, it failed miserably due to something I did, the whole cluster went down and I could not recover it fast enough (even with a support contract). Since then we run a completely independent backup server alongside.
It is always a good idea to have a backup. ;)

Just a word in general, to anyone reading :)
A cluster gives you redundancy and fail-over capabilities, but it's not a substitute for backups. Backups must be stored off system/site, to have any value when disaster strikes. Do test your backup/restore procedure regularly! Validate the result (audit)! Use different methods of backup/restore for different data. Then, live long and prosper.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!