[SOLVED] CEPH - 4 servers plus monitor

IvanF

New Member
Mar 9, 2023
19
1
3
Hello.

Sorry, I know it's been discussed more than once, but I'm still confused how to set it up correctly.

I have 4 identical servers. 2 for 2 datacenters.
Plus 1 smaller one, which will be in the third location only as cluster quorum and CEPH monitor.

Cluster works great. I have also set up groups for VMs so that they are bound to a specific data center (clustered VM) or they can travel on all 4 DC servers (simple VM).

I have a problem with CEPH. All work OK, until 2 servers go down. Simulated outage of 1 data center.
I built CEPH with default parameters 3/2 in proxmox (also ceph pool).
Each of these 4 servers has 5 OSDs plus a separate disk for DB/WAL.

Where am I doing wrong? What did I not understand?

Thank you very much for every opinion.
 
This is a typcial "Split Brain" scenario:
Imagine only the link between both DCs go down, it would look to each DC as if it was the only one surviving and keep providing services.

If each DC now accepts writes to the data, you might have data corruption once the link comes back up (mirrored data got changed differently in both DCs).


So Ceph is protecting you from datacorruption by not serving data when out of quorum:
(edit: or maybe it's even Proxmox itself fencing the hosts as it also relies on quorum!? https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_how_proxmox_ve_fences)

I am not yet an expert on Ceph and CRUSH-Maps with custom placement rules.
If you had 5 nodes, then the DC with 3 nodes would still be up (but maybe only have 1 copy of data, if 2 copies resided in the DC with the 2 remaining nodes).



I guess there are good blogposts around that topic, here are some links to the official docs

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
When a cluster encounters monitor-related troubles there’s a tendency to panic, and sometimes with good reason. Losing one or more monitors doesn’t necessarily mean that your cluster is down, so long as a majority are up, running, and form a quorum.


https://docs.ceph.com/en/quincy/rados/configuration/common/
Production Ceph clusters typically provision a minimum of three Ceph Monitor daemons to ensure availability should a monitor instance crash. A minimum of three ensures that the Paxos algorithm can determine which version of the Ceph Cluster Map is the most recent from a majority of Ceph Monitors in the quorum.

https://docs.ceph.com/en/quincy/rados/configuration/mon-config-ref/
When a Ceph Storage Cluster runs multiple Ceph Monitors for high availability, Ceph Monitors use Paxos to establish consensus about the master cluster map. A consensus requires a majority of monitors running to establish a quorum for consensus about the cluster map (e.g., 1; 2 out of 3; 3 out of 5; 4 out of 6; etc.).
 
Last edited:
Hmm. Split-brain? I thought that the 5th CEPH monitor on the small server in location 3 should prevent this. But that's probably not enough, because that's exactly how it behaves.

But I have 5 nodes. 2+2+1. The latter is only for cluster quorum and CEPH monitor.

My problem is when 3 monitors are UP and 2 are DOWN

In any case, thanks for the response and the useful link for studying
 
You are right, my mind basically ignored your mention of that fifth node because you wrote "will" as in the future...
Plus 1 smaller one, which will be in the third location
So you know well about Split Brain, sorry ;-)

If you have only 3 copies and 2 go down, you're below min_size and Ceph goes offline.
You would have to run 4/2 instead of 3/2 to make sure you always have 2 copies in the remaining DC.

Here it's explained exactly like in your situation, with mon.e as the 5th node working as tiebreaker in a 3rd location.
https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
...
Next, generate a CRUSH rule which will place 2 copies in each data center. This will require editing the CRUSH map directly:
...
Pools will increase in size from the default 3 to 4, expecting 2 copies in each site.
 
  • Like
Reactions: IvanF
I apologize, I don't speak English well. I use a translator and sometimes it is not 100% correct.

So use 4/2 instead of the default 3/2. And I will study the necessary CRUSH modifications.

Thanks for your help. I close the thread.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!