Ceph Monitor Planning

adamb

Famous Member
Mar 1, 2012
1,329
77
113
Just looking for some opinions on my ceph monitor setup. We are looking to deploy a 6 node cluster at two sites connected via 40GB dark fiber with around .5ms latency. Just for the sake of providing details, we are keeping 4 copies in total (2 at each site).

My plan is to run a monitor on each ceph front end. This will give me 3 monitors at each site. Then I was thinking of setting up a small proxmox VM which would run in a separate cluster to act as the 7th vote. This would give us the ability to run this 7th vote at either of our two sites. For the most part it would run at our main site. If our dark fiber goes down, we still have 4 votes. If our building here was to burn down, we would fire up the 7th vote vm out at our other site and have 4 votes out there.

Does this 7th monitor need to be part of the ceph cluster?

My next biggest question is, any easy way to make it so this 7th monitor is as lightly used as possible? I read some place that monitors with the lowest IP octet get chosen first. So if I give this monitor the highest IP, it would be the least favored monitor?

Appreciate the input!
 
Does this 7th monitor need to be part of the ceph cluster?
What do you mean by that? As a MON needs to be part of a Ceph cluster. I assume you mean, if it needs to run on any of the Ceph hosts. Then the answer is no. In bigger clusters the MON runs on its own dedicated hardware and any other Ceph service needs to be able to reach them.

My next biggest question is, any easy way to make it so this 7th monitor is as lightly used as possible?
No.

I read some place that monitors with the lowest IP octet get chosen first. So if I give this monitor the highest IP, it would be the least favored monitor?
This is only on initial startup, after that it will try to use the MON with the lowest latency. In the end, it can happen that MONs from the different sides will be queried.

I assume your data centers are ~93 miles (light speed: 299.792m/μs, 0.5ms = ~150km distance) apart. While latency being stable on light load, it might look very different in production. Especially when disaster strikes and things have to recover. Apart from that any OSD is involved with writes from a client, in worst it could be that a client contact the remote OSD and with a copy of two, the local OSDs will be contacted by the remote OSD to get the copies delivered. While a primary OSD can be fixed, it will change once that OSD is removed (for whatever reason).

On a more general view, connecting two DC and run it as one, creates more complexity. The whole setup has more constraints to work with and in disaster scenarios more points to fail on. IMHO, it is better to run each cluster separately. Use rbd-mirror to sync the RBD images to the remote side (vice versa) and work with VMs / CTs through the API on both PVE clusters. In this sceanrio, you can operate each cluster in a healthy state, even when a disaster would render one DC down.
 
What do you mean by that? As a MON needs to be part of a Ceph cluster. I assume you mean, if it needs to run on any of the Ceph hosts. Then the answer is no. In bigger clusters the MON runs on its own dedicated hardware and any other Ceph service needs to be able to reach them.


No.


This is only on initial startup, after that it will try to use the MON with the lowest latency. In the end, it can happen that MONs from the different sides will be queried.

I assume your data centers are ~93 miles (light speed: 299.792m/μs, 0.5ms = ~150km distance) apart. While latency being stable on light load, it might look very different in production. Especially when disaster strikes and things have to recover. Apart from that any OSD is involved with writes from a client, in worst it could be that a client contact the remote OSD and with a copy of two, the local OSDs will be contacted by the remote OSD to get the copies delivered. While a primary OSD can be fixed, it will change once that OSD is removed (for whatever reason).

On a more general view, connecting two DC and run it as one, creates more complexity. The whole setup has more constraints to work with and in disaster scenarios more points to fail on. IMHO, it is better to run each cluster separately. Use rbd-mirror to sync the RBD images to the remote side (vice versa) and work with VMs / CTs through the API on both PVE clusters. In this sceanrio, you can operate each cluster in a healthy state, even when a disaster would render one DC down.


I worded that wrong. In the proxmox world, does a ceph monitor need to be part of the proxmox cluster to be a ceph monitor? It seems like to utilize the proxmox gui, it does.

Yep data centers are roughly 35 miles apart. I did go down the road of two independent clusters with mirror'ing but it was quite the mess. First issue is there is no real easy way to utilize proxmox as you guys use the default name. The next and largest issue was write performance is absolutely horrible on spinning disks once journaling is enabled (Which is expected imo). This specific ceph cluster is basically archival and will have data written to it in bulk then rarely read. Imo on the flip side, the mirror setup has its own issues as well and would cost significantly more money & work to get going as well.
 
Last edited:
I worded that wrong. In the proxmox world, does a ceph monitor need to be part of the proxmox cluster to be a ceph monitor? It seems like to utilize the proxmox gui, it does.
Our integration is meant for a hyper-converged Ceph + PVE and utilizes our stack. But besides the integration for management, Ceph services don't need to run on PVE. The configuration and management of a separated service needs to be handled separately though.

Yep data centers are roughly 35 miles apart. I did go down the road of two independent clusters with mirror'ing but it was quite the mess. First issue is there is no real easy way to utilize proxmox as you guys use the default name. The next and largest issue was write performance is absolutely horrible on spinning disks once journaling is enabled (Which is expected imo). This specific ceph cluster is basically archival and will have data written to it in bulk then rarely read. Imo on the flip side, the mirror setup has its own issues as well and would cost significantly more money & work to get going as well.
Apart from the performance, the ceph cluster will either way use a lot of bandwidth, IMHO. But to just keep data available on disaster might be possible. A unknown is how to get the 7th MON to have a up-to-date (or close to) monstore on manual failover. If it has its disk on the ceph cluster itself, it will drag down the rest of the cluster, due to the higher disk latency. A zfs storage with storage replication (pvesr) might fit better at this point, but never tested such a scenario.

In any case, while technically possible, I wouldn't recommend it.
 
Our integration is meant for a hyper-converged Ceph + PVE and utilizes our stack. But besides the integration for management, Ceph services don't need to run on PVE. The configuration and management of a separated service needs to be handled separately though.


Apart from the performance, the ceph cluster will either way use a lot of bandwidth, IMHO. But to just keep data available on disaster might be possible. A unknown is how to get the 7th MON to have a up-to-date (or close to) monstore on manual failover. If it has its disk on the ceph cluster itself, it will drag down the rest of the cluster, due to the higher disk latency. A zfs storage with storage replication (pvesr) might fit better at this point, but never tested such a scenario.

In any case, while technically possible, I wouldn't recommend it.

The 7th mon would run on a completely independent proxmox cluster which utilizes high end HPE Nimble Storage. We have a complete replicated setup at our 2nd site as well. This 7th mon is 100% independent of this ceph solution besides being on the ceph public network.

I guess maybe I should give more thought into running ceph on plain old CentOS, I have enjoyed the ease that proxmox brings to the table!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!