fault tolerance of ceph

proxmox_larry · Jan 8, 2020

Hey guys,

I'm currently running a 4 Node HA-Ceph-Cluster and I was curious about testing the fault tolerance of it.
According to my observations the cluster itself is still accessible, even with two remaining nodes (due to the changed config of votequorum)

Which configuration do I have to make to access Ceph with 2 remaining nodes?
size=4
min_size=2
Install 4 managers and 4 monitors?

Did I miss something?

Thanks!

Stoiko Ivanov · Jan 8, 2020

proxmox_larry said:
Did I miss something?

The problematic case here is that you have a potential split-brain situation:
* if 2 of your hosts are enough for quorum and your cluster is still writable after 2 nodes are down then those 2 nodes (that are down) also can become quorate -> you have 2 conflicting views (which are quorate) on the cluster

This is why you always need more than half of the votes and not just exactly half of the votes (which is also why you usually have odd sized (3,5,7) clusters in such setups (both Ceph and PVE's cluster stack belong in this category)

I hope this explains it!

proxmox_larry · Jan 8, 2020

Thank you for your fast reply - I understand your concern when it comes to quorum. But here is the thing, votequorum regulates the needed votes for my cluster:
When I have 4 - the expected number of votes is 3.
When I shutdown one node it needs 2 from 3 active ones.
When I shutdown the third one two nodes are remaining and the expected vote is two. That’s the maximum of nodes I can loose to keep the cluster running.

I don’t understand how the powered off nodes can get quorate...

My goal is to keep the cluster and ceph running even if I loose 2 out of 4 nodes.

spirit · Jan 9, 2020

proxmox_larry said:
My goal is to keep the cluster and ceph running even if I loose 2 out of 4 nodes.

You need 5 nodes to be able to keep quorum for proxmox cluster && ceph monitors when you loose 2 nodes

proxmox_larry · Jan 9, 2020

spirit said:
You need 5 nodes to be able to keep quorum for proxmox cluster && ceph monitors when you loose 2 nodes

Like I said my 4-Node-Cluster is still quorate when it looses two nodes and ceph is also accessible.
It works.

spirit · Jan 10, 2020

proxmox_larry said:
Like I said my 4-Node-Cluster is still quorate when it looses two nodes and ceph is also accessible.
It works.

if you have 4 ceph monitor, monitors will be out of quorum if they only 2/4 monitors up.
ceph will be accessible, but readonly.

sg90 · Jan 10, 2020

proxmox_larry said:
Like I said my 4-Node-Cluster is still quorate when it looses two nodes and ceph is also accessible.
It works.

What others are saying is in the small likely hood you ended up with all 4 servers up but 2 not able to talk to the other two. You would end up with two separate half's of your cluster both working both making changes. When the 4 servers then are able to start talking again you would end up with corruption and a bunch of issues.

Hence why its always suggested to have it setup that this is never possible and only one "half" of the cluster could ever get into Q.

But as others have said, if you had 4 CEPH Mon's and 2 went down CEPH would go read only, you can't force CEPH to have extra votes, as a MON is more than just votes but manging every part of I/O and interaction with the CEPH cluster.

czechsys · Jan 10, 2020

Use 5th pve+ceph monitor node without osds. You will have HA quorum for 2 hosts down for pve and for ceph.

proxmox_larry · Jan 10, 2020

I see your point guys, that helps me a lot!

So i case of HA there isn't really no difference between a 3 and 4 node cluster?
The availability and performance is nearly the same?

spirit · Jan 10, 2020

proxmox_larry said:
I see your point guys, that helps me a lot!

So i case of HA there isn't really no difference between a 3 and 4 node cluster?
The availability and performance is nearly the same?

no indeed, for availability, with 3 nodes you can loose 1, with 4 nodes you can loose 1, with 5 nodes you can loose 2, with 6 nodes you can loose 2.

Leo David · Nov 16, 2020

Hi,
Interested in this matter too...
Is there a rough rule of thumb for calculating the fault tolerance level for a hyperconverged setup ( all nodes running osds as well ) ?
ie: in a X number of nodes hyperconverged cluster, you can loose Y number of nodes and still be able to rw talk to ceph ( regardless of ceph being in a pgs undersized degraded status and having slow running vms performance ).
I assume it also has to do with the number of osds per node, and if that's so, lets consider them as being 6.
Any thoughts ?
Cheers,

Leo

Search

Search

fault tolerance of ceph

proxmox_larry

Member

Stoiko Ivanov

Proxmox Staff Member

proxmox_larry

Member

spirit

Distinguished Member

proxmox_larry

Member

spirit

Distinguished Member

sg90

Renowned Member

czechsys

Renowned Member

proxmox_larry

Member

spirit

Distinguished Member

Leo David

Well-Known Member