ceph stretch cluster

przemekk

New Member
Apr 4, 2024
1
0
1
We created stretch cluster with 6 nodes , 4 disks in each

Rule is working only on 3 or 4 replicas

rule StretchRuleReadRandom {
id 2
type replicated
step take default
step choose firstn 0 type datacenter
step chooseleaf firstn 2 type host
step emit
}

This rule is not working

rule W1 {
id 3
type replicated
step take W1
step chooseleaf firstn 0 type host
step emit
}

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 11.71912 root default
-19 5.85956 datacenter W1
-3 1.95319 host ceph-w1-01
0 ssd 0.48830 osd.0 up 1.00000 1.00000
1 ssd 0.48830 osd.1 up 1.00000 1.00000
2 ssd 0.48830 osd.2 up 1.00000 1.00000
23 ssd 0.48830 osd.23 up 1.00000 1.00000
-7 1.95319 host ceph-w1-02
3 ssd 0.48830 osd.3 up 1.00000 1.00000
4 ssd 0.48830 osd.4 up 1.00000 1.00000
5 ssd 0.48830 osd.5 up 1.00000 1.00000
6 ssd 0.48830 osd.6 up 1.00000 1.00000
-10 1.95319 host ceph-w1-03
7 ssd 0.48830 osd.7 up 1.00000 1.00000
8 ssd 0.48830 osd.8 up 1.00000 1.00000
9 ssd 0.48830 osd.9 up 1.00000 1.00000
10 ssd 0.48830 osd.10 up 1.00000 1.00000
-20 5.85956 datacenter W2
-13 1.95319 host ceph-w2-01
11 ssd 0.48830 osd.11 up 1.00000 1.00000
12 ssd 0.48830 osd.12 up 1.00000 1.00000
13 ssd 0.48830 osd.13 up 1.00000 1.00000
14 ssd 0.48830 osd.14 up 1.00000 1.00000
-16 1.95319 host ceph-w2-02
15 ssd 0.48830 osd.15 up 1.00000 1.00000
16 ssd 0.48830 osd.16 up 1.00000 1.00000
17 ssd 0.48830 osd.17 up 1.00000 1.00000
18 ssd 0.48830 osd.18 up 1.00000 1.00000
-25 1.95319 host ceph-w2-03
19 ssd 0.48830 osd.19 up 1.00000 1.00000
20 ssd 0.48830 osd.20 up 1.00000 1.00000
21 ssd 0.48830 osd.21 up 1.00000 1.00000
22 ssd 0.48830 osd.22 up 1.00000 1.00000
 
I assume you're asking why rule W1 isn't working? In short, Ceph doesn't know about locality.

Long answer:
  • Every pool has a size and min_size.
    The size tells Ceph how many object copies there need to be, Ceph always strives for this number of copies. The min_size is the number of copies where Ceph still allows IO. Meaning, when the actual count of copies is lower then min_size it will freeze this particular PG.
  • Eg. a Ceph client calculates on the fly where a particular object is (or should be), in which PG it is located.
    Then a client talks to the primary OSD (which is the one responsible for the PG). Since there is always only one primary its location is arbitrary.
  • The crush_rules are how Ceph is distributing objects and its copies.
    Each rule works according to a hierarchy of failure domains (these need to be defined). On your rule StretchRuleReadRandom it is defined that Ceph will choose as many DCs as there are copies and then place 2 copies in each DC on different hosts.
  • Only a pool with size=4,min_size=2 can work, in a stretched cluster with only two locations. As either side needs at least two copies to still facilitate IO when the other location is down.
Aside from that, you need a third location for quorum (PVE & Ceph). Otherwise both sides lose quorum when one side is down.
 
  • Like
Reactions: jsterr and UdoB