Ceph and a Datacenter failure

Baader-IT

Active Member
Oct 29, 2018
49
1
28
41
Hallo,

we are currently evaluating a new Proxmox cluster with Ceph.
Facts:
10 servers in a cluster
2 data centers
5 Server á Data center
12 hard disks á server

The installation and configuration is already finished, currently only test VM's are running.

Now I noticed during failure tests that neither Ceph nor Proxmox work properly in case of a datacenter failure. So I have created another (virtual) Proxmox Node in another datacenter, which will serve as a quorum for Proxmox and Ceph. With this quorums-VM the cluster now works in case of a datacenter failure.

My question:
If one datacenter is down, the cluster works properly but the VMs aren't accessable because ceph change the disk to read-only.

Right now we have a Ceph pool with 3/2 and a correct OSD Crush-map (correct assignment from datacenter, rack, host, osd) .

Can a pool size 3/1 can solve this problem? Any other ideas?

Greetings
 
The first problem to solve is the matter of fencing. With ceph this is easier since crush is hierarchical- you can create datacenter level objects and distribute replication down; see https://ceph.io/geen-categorie/manage-a-multi-datacenter-crush-map-with-the-command-line/ for more discussion. Understand that you will now have two failure domains with replication so you will need to fit all your data in the equivalent of 20 disks TOTAL.

As for clustering proxmox across multiple locations, I didnt know you could even do that... does corosync support multiple hierarchies?
 
As for clustering proxmox across multiple locations, I didnt know you could even do that... does corosync support multiple hierarchies?

I think scheduled replication would be ideal so that way the data centers are independent of each other so you don't lose everything in one shot.
 
you can create datacenter level objects

We did already created the right crush map:


Code:
root@vm18111:~# ceph osd crush tree
ID  CLASS WEIGHT    TYPE NAME               
 -1       873.24554 root default             
-23       436.60767     datacenter RZA       
-27       261.96460         rack 106         
 -3        87.32153             host sv18101
  0   hdd   7.27679                 osd.0   
  1   hdd   7.27679                 osd.1   
  2   hdd   7.27679                 osd.2   
  3   hdd   7.27679                 osd.3   
  4   hdd   7.27679                 osd.4   
  5   hdd   7.27679                 osd.5   
  6   hdd   7.27679                 osd.6   
  7   hdd   7.27679                 osd.7   
  8   hdd   7.27679                 osd.8   
  9   hdd   7.27679                 osd.9   
 10   hdd   7.27679                 osd.10   
 11   hdd   7.27679                 osd.11   
 -7        87.32153             host sv18103
 24   hdd   7.27679                 osd.24   
 25   hdd   7.27679                 osd.25   
 26   hdd   7.27679                 osd.26   
 27   hdd   7.27679                 osd.27   
 28   hdd   7.27679                 osd.28   
 29   hdd   7.27679                 osd.29   
 30   hdd   7.27679                 osd.30   
 31   hdd   7.27679                 osd.31   
 32   hdd   7.27679                 osd.32   
 33   hdd   7.27679                 osd.33   
 34   hdd   7.27679                 osd.34   
 35   hdd   7.27679                 osd.35   
-11        87.32153             host sv18105
 48   hdd   7.27679                 osd.48   
 49   hdd   7.27679                 osd.49   
 50   hdd   7.27679                 osd.50   
 51   hdd   7.27679                 osd.51   
 52   hdd   7.27679                 osd.52   
 53   hdd   7.27679                 osd.53   
 54   hdd   7.27679                 osd.54   
 55   hdd   7.27679                 osd.55   
 56   hdd   7.27679                 osd.56   
 57   hdd   7.27679                 osd.57   
 58   hdd   7.27679                 osd.58   
 59   hdd   7.27679                 osd.59   
-28       174.64307         rack 107         
-15        87.32153             host sv18107
 72   hdd   7.27679                 osd.72   
 73   hdd   7.27679                 osd.73   
 74   hdd   7.27679                 osd.74   
 75   hdd   7.27679                 osd.75   
 76   hdd   7.27679                 osd.76   
 77   hdd   7.27679                 osd.77   
 78   hdd   7.27679                 osd.78   
 79   hdd   7.27679                 osd.79   
 80   hdd   7.27679                 osd.80   
 81   hdd   7.27679                 osd.81   
 82   hdd   7.27679                 osd.82   
 83   hdd   7.27679                 osd.83   
-19        87.32153             host sv18109
 96   hdd   7.27679                 osd.96   
 97   hdd   7.27679                 osd.97   
 98   hdd   7.27679                 osd.98   
 99   hdd   7.27679                 osd.99   
100   hdd   7.27679                 osd.100 
101   hdd   7.27679                 osd.101 
102   hdd   7.27679                 osd.102 
103   hdd   7.27679                 osd.103 
104   hdd   7.27679                 osd.104 
105   hdd   7.27679                 osd.105 
106   hdd   7.27679                 osd.106 
107   hdd   7.27679                 osd.107 
-24       436.63788     datacenter RZB       
-29       261.98273         rack 04-03       
 -5        87.32758             host sv18102
 12   hdd   7.27730                 osd.12   
 13   hdd   7.27730                 osd.13   
 14   hdd   7.27730                 osd.14   
 15   hdd   7.27730                 osd.15   
 16   hdd   7.27730                 osd.16   
 17   hdd   7.27730                 osd.17   
 18   hdd   7.27730                 osd.18   
 19   hdd   7.27730                 osd.19   
 20   hdd   7.27730                 osd.20   
 21   hdd   7.27730                 osd.21   
 22   hdd   7.27730                 osd.22   
 23   hdd   7.27730                 osd.23   
 -9        87.32758             host sv18104
 36   hdd   7.27730                 osd.36   
 37   hdd   7.27730                 osd.37   
 38   hdd   7.27730                 osd.38   
 39   hdd   7.27730                 osd.39   
 40   hdd   7.27730                 osd.40   
 41   hdd   7.27730                 osd.41   
 42   hdd   7.27730                 osd.42   
 43   hdd   7.27730                 osd.43   
 44   hdd   7.27730                 osd.44   
 45   hdd   7.27730                 osd.45   
 46   hdd   7.27730                 osd.46   
 47   hdd   7.27730                 osd.47   
-13        87.32758             host sv18106
 60   hdd   7.27730                 osd.60   
 61   hdd   7.27730                 osd.61   
 62   hdd   7.27730                 osd.62   
 63   hdd   7.27730                 osd.63   
 64   hdd   7.27730                 osd.64   
 65   hdd   7.27730                 osd.65   
 66   hdd   7.27730                 osd.66   
 67   hdd   7.27730                 osd.67   
 68   hdd   7.27730                 osd.68   
 69   hdd   7.27730                 osd.69   
 70   hdd   7.27730                 osd.70   
 71   hdd   7.27730                 osd.71   
-30       174.65515         rack 04-05       
-17        87.32758             host sv18108
 84   hdd   7.27730                 osd.84   
 85   hdd   7.27730                 osd.85   
 86   hdd   7.27730                 osd.86   
 87   hdd   7.27730                 osd.87   
 88   hdd   7.27730                 osd.88   
 89   hdd   7.27730                 osd.89   
 90   hdd   7.27730                 osd.90   
 91   hdd   7.27730                 osd.91   
 92   hdd   7.27730                 osd.92   
 93   hdd   7.27730                 osd.93   
 94   hdd   7.27730                 osd.94   
 95   hdd   7.27730                 osd.95   
-21        87.32758             host sv18110
108   hdd   7.27730                 osd.108 
109   hdd   7.27730                 osd.109 
110   hdd   7.27730                 osd.110 
111   hdd   7.27730                 osd.111 
112   hdd   7.27730                 osd.112 
113   hdd   7.27730                 osd.113 
114   hdd   7.27730                 osd.114 
115   hdd   7.27730                 osd.115 
116   hdd   7.27730                 osd.116 
117   hdd   7.27730                 osd.117 
118   hdd   7.27730                 osd.118 
119   hdd   7.27730                 osd.119


you need to have monitors on 3 datacenters to keep quorum.
We have 6 Monitors seperated in 2 datacenter and 1 Monitor in another datacenter for quorum.
 
We did already created the right crush map:
then all you have to do is to edit your crush rules to use your objects like so:
Code:
rule replicated_ruleset {
  ruleset X
  type replicated
  min_size 2
  max_size 3
  step take default
  step choose firstn 2 type datacenter
  step chooseleaf firstn -1 type host
  step emit
}

I am curious about how to fail proxmox on a datacenter. If one DC becomes fenced from the other DC AND the quorum node, how does it know to fence all surviving nodes off since there are still enough nodes to maintain quorum within the "fenced" datacenter? Is there a way to identify a node as a supernode for the purposes of quorum?
 
I am curious about how to fail proxmox on a datacenter. If one DC becomes fenced from the other DC AND the quorum node, how does it know to fence all surviving nodes off since there are still enough nodes to maintain quorum within the "fenced" datacenter? Is there a way to identify a node as a supernode for the purposes of quorum?

No there is no supernode for quorum. How I told before we have one other node (in an third datacenter) which don't have any storage or vms on it, only to have one more cluster vote. If one DC becomes fenced the 5 hosts from the 1th DC and the 1 node from the 3th DC becomes the majority so the cluster should/will work as expected.
 
@TwiX
I just can say that we use a 25G badwidth for the connection between the DC's.

I don't think there is a required bandwidth - it only depends on the latency which have to be < 2ms .
 
thanks

You solved your issue ?
It was related to ceph min replicas ?
 
You solved your issue ?

Our solution is:
Code:
# rules

rule replicated_rule {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 0 type datacenter
step chooseleaf firstn 2 type host
step emit
}
 
Our solution is:
Code:
# rules

rule replicated_rule {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 0 type datacenter
step chooseleaf firstn 2 type host
step emit
}
this is an interesting thread.. I have a couple of questions

1- the rules above , where those put to a .conf file or where? as the format does not look same as my next question

2- could you post output from
Code:
ceph osd crush rule dump
 
@RobFantini
1: this is the rule set from the Proxmox GUI :)

2:
Code:
{
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 2,
        "max_size": 4,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "datacenter"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
 
Good to see others doing this as well. We also run a almost identical setup. Been in production for almost 6 months now. We do lots of failover testing and everything has been rock solid. We have 40Gbps between the two sites with 1ms latency.
 
@adamb
Nice to hear that!

Just one question:
Do you use HA?

We activated HA and have some problems after a lost node joined back into the cluster.