CEPH crush map and CEPH-cluster HEALTH

Whatever

Renowned Member
Nov 19, 2012
383
58
93
Playing around with 3-nodes PVE cluster with CEPH

Node1: pve-node, ceph-mon
Node2: pve-node, ceph-mon, 4 ceph osds
Node3: pve-node, ceph-mon, 4 ceph osds

in order to have 3 replicas for data pool (distribute pg between hosts and disks) I've defined the crush-map as following:

Code:
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7


# types
type 0 osd
type 1 sets
type 2 host
type 3 rack
type 4 datacenter
type 5 region
type 6 root


# buckets


sets pve02A_ssd_set1 {
        id -4           # do not change unnecessarily
        # weight 0.920
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 0.460
        item osd.1 weight 0.460
}


sets pve02A_ssd_set2 {
        id -5           # do not change unnecessarily
        # weight 0.920
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 0.460
        item osd.3 weight 0.460
}


sets pve02B_ssd_set1 {
        id -6           # do not change unnecessarily
        # weight 0.920
        alg straw
        hash 0  # rjenkins1
        item osd.4 weight 0.460
        item osd.5 weight 0.460
}


sets pve02B_ssd_set2 {
        id -7           # do not change unnecessarily
        # weight 0.920
        alg straw
        hash 0  # rjenkins1
        item osd.6 weight 0.460
        item osd.7 weight 0.460
}




host pve02A_ssd {
        id -2           # do not change unnecessarily
        # weight 1.840
        alg straw
        hash 0  # rjenkins1
        item pve02A_ssd_set1 weight 0.920
        item pve02A_ssd_set2 weight 0.920
}


host pve02B_ssd {
        id -3           # do not change unnecessarily
        # weight 1.840
        alg straw
        hash 0  # rjenkins1
        item pve02B_ssd_set1 weight 0.920
        item pve02B_ssd_set2 weight 0.920
}


root default {
        id -1           # do not change unnecessarily
        # weight 3.680
        alg straw
        hash 0  # rjenkins1
        item pve02A_ssd weight 1.840
        item pve02B_ssd weight 1.840
}


# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host


#       step choose firstn 2 type host
#       step chooseleaf firstn -2 type sets


        step emit
}




# end crush map

# ceph osd tree

# id weight type name up/down reweight
-1 3.68 root default
-2 1.84 host pve02A_ssd
-4 0.92 sets pve02A_ssd_set1
0 0.46 osd.0 up 1
1 0.46 osd.1 up 1
-5 0.92 sets pve02A_ssd_set2
2 0.46 osd.2 up 1
3 0.46 osd.3 up 1
-3 1.84 host pve02B_ssd
-6 0.92 sets pve02B_ssd_set1
4 0.46 osd.4 up 1
5 0.46 osd.5 up 1
-7 0.92 sets pve02B_ssd_set2
6 0.46 osd.6 up 1
7 0.46 osd.7 up 1

However, when I set number of replicas to 3 or 4 I always get health WARN status:

# ceph -s
cluster d8803d92-98dc-40b3-8f80-83e08b21e500
health HEALTH_WARN 10 pgs degraded; 64 pgs stuck unclean
monmap e9: 3 mons at {0=172.16.253.16:6789/0,1=172.16.253.15:6789/0,2=172.16.253.14:6789/0}, election epoch 18, quorum 0,1,2 2,1,0
osdmap e104: 8 osds: 8 up, 8 in
pgmap v247: 192 pgs, 3 pools, 0 bytes data, 0 objects
297 MB used, 3773 GB / 3773 GB avail
10 active+degraded
54 active+remapped
128 active+clean




Could anyone point me out what is wrong with my setup?

Tnx in advance)
 
In order to have replicas you need that many number of nodes with OSDs. In your setup you have only 2 nodes with 4 OSDs in each. So maximum replica you can have is 2.
 
In order to have replicas you need that many number of nodes with OSDs. In your setup you have only 2 nodes with 4 OSDs in each. So maximum replica you can have is 2.

With standard crush-map you are correct.
However with user-defined crush map with setup specified in my first post 3 replica distribution can be achieved (I've found out a solution):

Code:
...

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type sets
        step emit
}

...

as far as I've defined 2 sets of disks on each host (4 sets in total) with replica of 3, data will be distributed among 2 hosts (2 replicas on the first node in both sets and 1 replica on the second node in one of the sets)

Hope, it will be useful to somebody)
 
I reexamined your crushmap. Unless i am missing something, i do not see how you can achieve more than 2 replicas in this setup. I must commend you for your effort with crushmap. There is no doubt you have put lot of thought into it.

May i ask what you are trying to achieve this way. Do you must have 2+ replicas with 2 nodes?
 
I reexamined your crushmap. Unless i am missing something, i do not see how you can achieve more than 2 replicas in this setup. I must commend you for your effort with crushmap. There is no doubt you have put lot of thought into it.

with the rulset:
step take default
step chooseleaf firstn 0 type sets

All replicas (count of "pool size") will be distributed among the units of type "sets". Probably the number of replicas == 3 is not the best choice here (4 should be more appropriate choise)

May i ask what you are trying to achieve this way. Do you must have 2+ replicas with 2 nodes?

I'm trying to achieve kind of raid1 on each node (distribute pgs not only between the nodes, but also between the disks/osds on each node)
 
Hello everybody.

After a quick look on your configuration, everything is for me actually fine as you did: the warning adv you get is not that bad, it just means that CEPH got that you changed the configuration from 2 to 3 copies.

The system is curently self-balncing, you just have to wait for it to complete the process.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!