Ceph Stretch cluster - Crush rules assistance

KKWait

New Member
Feb 14, 2024
6
4
3
Hi !
We're being validating a stretched cluster design such as :
- Datacenter 1
- 3 PVE (Dell R650) with 5 NVME (1 OSD per disk)
- Datacenter 2
- 3 PVE (Dell R650) with 5 NVME (1 OSD per disk)
- Datacenter 3
- 1 Virtual PVE as witness

So far so good, the stretched mode work well with the following (specific stretched cluster) configuration :

Code:
ceph osd crush add-bucket dc1 datacenter
ceph osd crush add-bucket dc2 datacenter
ceph osd crush move dc1 root=default
ceph osd crush move dc2 root=default
ceph osd crush move dc1pve1 datacenter=dc1
ceph osd crush move dc1pve2 datacenter=dc1
ceph osd crush move dc1pve3 datacenter=dc1
ceph osd crush move dc2pve1 datacenter=dc2
ceph osd crush move dc2pve2 datacenter=dc2
ceph osd crush move dc2pve3 datacenter=dc2
ceph mon set_location dc1pve1 datacenter=dc1
ceph mon set_location dc1pve2 datacenter=dc1
ceph mon set_location dc1pve3 datacenter=dc1
ceph mon set_location dc2pve1 datacenter=dc2
ceph mon set_location dc2pve2 datacenter=dc2
ceph mon set_location dc2pve3 datacenter=dc2
ceph mon set_location dc3pve1 datacenter=dc3
ceph mon set election_strategy connectivity
ceph mon set_location dc3pve1 datacenter=dc3
ceph mon enable_stretch_mode dc3pve1 stretch_rule datacenter

The following crush rule :
Code:
rule stretch_rule {
  id 2
  type replicated
  step take default
  step choose firstn 0 type datacenter
  step chooseleaf firstn 2 type host
  step emit
}

A pool with 4/2 replication, 128PGs and the stretched_rule as replication policy. Coupled with a proper HA group, loosing a whole datacenter restarts all VMs on the other datacenter, exactly what we needed.

Now, I'd like to add two more pools with Datacenter affinities with a 3/2 crush rule to ensure a VM sticks its datacenter osds to address "native HA applications" such as web servers, active directory ... I tried to add the following crush rules :
Code:
rule dc1_rule {
    id 3
    type replicated
    step take dc1
    step chooseleaf firstn 3 type host
    step emit
}
rule dc2_rule {
    id 4
    type replicated
    step take dc2
    step chooseleaf firstn 3 type host
    step emit
}

And created 2 new pools (3/2, 64PGS), each based on crush rules. Unfortunatly, Ceph health reports those 128PGs stucked as clean+peered and never ends as active.
Code:
root@dc1pve1:~# ceph health detail
HEALTH_WARN Reduced data availability: 128 pgs inactive
[WRN] PG_AVAILABILITY: Reduced data availability: 128 pgs inactive
    pg 25.26 is stuck inactive for 3h, current state clean+peered, last acting [4,9,2]
    pg 25.28 is stuck inactive for 3h, current state clean+peered, last acting [3,5,9]
    pg 25.29 is stuck inactive for 3h, current state clean+peered, last acting [1,9,7]
    pg 25.2a is stuck inactive for 3h, current state clean+peered, last acting [0,11,5]
    pg 25.2b is stuck inactive for 3h, current state clean+peered, last acting [10,5,2]
    pg 25.2c is stuck inactive for 3h, current state clean+peered, last acting [2,6,9]
    pg 25.2d is stuck inactive for 3h, current state clean+peered, last acting [10,5,3]
    pg 25.2e is stuck inactive for 3h, current state clean+peered, last acting [7,11,2]
    pg 25.2f is stuck inactive for 3h, current state clean+peered, last acting [2,10,5]
    pg 25.30 is stuck inactive for 3h, current state clean+peered, last acting [10,0,5]
    pg 25.31 is stuck inactive for 3h, current state clean+peered, last acting [6,11,0]
    pg 25.32 is stuck inactive for 3h, current state clean+peered, last acting [5,0,10]
    pg 25.33 is stuck inactive for 3h, current state clean+peered, last acting [9,4,0]
    pg 25.34 is stuck inactive for 3h, current state clean+peered, last acting [9,7,1]
    pg 25.35 is stuck inactive for 3h, current state clean+peered, last acting [4,9,3]
    pg 25.36 is stuck inactive for 3h, current state clean+peered, last acting [0,11,6]
    pg 25.37 is stuck inactive for 3h, current state clean+peered, last acting [5,11,2]
    pg 25.38 is stuck inactive for 3h, current state clean+peered, last acting [8,2,7]
    pg 25.39 is stuck inactive for 3h, current state clean+peered, last acting [4,0,9]
    pg 25.3a is stuck inactive for 3h, current state clean+peered, last acting [1,8,4]
    pg 25.3b is stuck inactive for 3h, current state clean+peered, last acting [9,5,1]
    pg 25.3c is stuck inactive for 3h, current state clean+peered, last acting [6,3,11]
    pg 25.3d is stuck inactive for 3h, current state clean+peered, last acting [11,5,3]
    pg 25.3e is stuck inactive for 3h, current state clean+peered, last acting [6,3,8]
    pg 25.3f is stuck inactive for 3h, current state clean+peered, last acting [7,0,11]
    pg 26.24 is stuck inactive for 3h, current state clean+peered, last acting [13,19,17]
    pg 26.25 is stuck inactive for 3h, current state clean+peered, last acting [13,20,18]
    pg 26.28 is stuck inactive for 3h, current state clean+peered, last acting [20,13,18]
    pg 26.29 is stuck inactive for 3h, current state clean+peered, last acting [22,15,13]
    pg 26.2a is stuck inactive for 3h, current state clean+peered, last acting [19,12,15]
    pg 26.2b is stuck inactive for 3h, current state clean+peered, last acting [16,23,20]
    pg 26.2c is stuck inactive for 3h, current state clean+peered, last acting [14,19,17]
    pg 26.2d is stuck inactive for 3h, current state clean+peered, last acting [19,15,14]
    pg 26.2e is stuck inactive for 3h, current state clean+peered, last acting [21,18,14]
    pg 26.2f is stuck inactive for 3h, current state clean+peered, last acting [12,15,21]
    pg 26.30 is stuck inactive for 3h, current state clean+peered, last acting [17,21,14]
    pg 26.31 is stuck inactive for 3h, current state clean+peered, last acting [15,14,19]
    pg 26.32 is stuck inactive for 3h, current state clean+peered, last acting [12,17,21]
    pg 26.33 is stuck inactive for 3h, current state clean+peered, last acting [17,20,13]
    pg 26.34 is stuck inactive for 3h, current state clean+peered, last acting [12,19,15]
    pg 26.35 is stuck inactive for 3h, current state clean+peered, last acting [21,13,16]
    pg 26.36 is stuck inactive for 3h, current state clean+peered, last acting [13,19,18]
    pg 26.37 is stuck inactive for 3h, current state clean+peered, last acting [19,17,13]
    pg 26.38 is stuck inactive for 3h, current state clean+peered, last acting [13,15,19]
    pg 26.39 is stuck inactive for 3h, current state clean+peered, last acting [16,13,21]
    pg 26.3a is stuck inactive for 3h, current state clean+peered, last acting [14,20,17]
    pg 26.3b is stuck inactive for 3h, current state clean+peered, last acting [20,15,12]
    pg 26.3c is stuck inactive for 3h, current state clean+peered, last acting [16,23,22]
    pg 26.3d is stuck inactive for 3h, current state clean+peered, last acting [23,21,18]
    pg 26.3e is stuck inactive for 3h, current state clean+peered, last acting [20,17,14]
    pg 26.3f is stuck inactive for 3h, current state clean+peered, last acting [12,22,15]

Here's the OSD tree :
Code:
root@dc1pve1:~# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
 -1         2.34445  root default                                       
-15         1.17223      datacenter dc1                               
 -3         0.39075          host dc1pve1                           
  0    nvme  0.09769              osd.0             up   1.00000  1.00000
  1    nvme  0.09769              osd.1             up   1.00000  1.00000
  2    nvme  0.09769              osd.2             up   1.00000  1.00000
  3    nvme  0.09769              osd.3             up   1.00000  1.00000
 -5         0.39075          host dc1pve2                           
  4    nvme  0.09769              osd.4             up   1.00000  1.00000
  5    nvme  0.09769              osd.5             up   1.00000  1.00000
  6    nvme  0.09769              osd.6             up   1.00000  1.00000
  7    nvme  0.09769              osd.7             up   1.00000  1.00000
 -7         0.39075          host dc1pve3                           
  8    nvme  0.09769              osd.8             up   1.00000  1.00000
  9    nvme  0.09769              osd.9             up   1.00000  1.00000
 10    nvme  0.09769              osd.10            up   1.00000  1.00000
 11    nvme  0.09769              osd.11            up   1.00000  1.00000
-16         1.17223      datacenter dc2                               
 -9         0.39075          host dc2pve1                           
 12    nvme  0.09769              osd.12            up   1.00000  1.00000
 13    nvme  0.09769              osd.13            up   1.00000  1.00000
 14    nvme  0.09769              osd.14            up   1.00000  1.00000
 23    nvme  0.09769              osd.23            up   1.00000  1.00000
-11         0.39075          host dc2pve2                           
 15    nvme  0.09769              osd.15            up   1.00000  1.00000
 16    nvme  0.09769              osd.16            up   1.00000  1.00000
 17    nvme  0.09769              osd.17            up   1.00000  1.00000
 18    nvme  0.09769              osd.18            up   1.00000  1.00000
-13         0.39075          host dc2pve3                           
 19    nvme  0.09769              osd.19            up   1.00000  1.00000
 20    nvme  0.09769              osd.20            up   1.00000  1.00000
 21    nvme  0.09769              osd.21            up   1.00000  1.00000
 22    nvme  0.09769              osd.22            up   1.00000  1.00000

Maybe a clue, but I cannot figure out if it's relevant, I can see blacklisted connection ? The IP corresponds to the witness pve node :
Code:
root@dc1pve1:~# ceph osd blacklist ls
192.168.114.2:0/3316608452 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/5968008 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/1051452434 2024-11-27T18:19:13.728470+0100
192.168.114.2:6817/18177 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/2390757461 2024-11-27T18:19:13.728470+0100
192.168.114.2:6816/18177 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/2298991455 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/3816873361 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/3896422733 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/1685705789 2024-11-27T18:18:16.144679+0100
192.168.114.2:6817/1068 2024-11-27T18:18:16.144679+0100
192.168.114.2:6816/1068 2024-11-27T18:18:16.144679+0100

I tried to play with PGS (512 for stretched 256 for dc), but no changes.

Does anyone see what I'm missing ?

Thanks !
 
Last edited:
  • Like
Reactions: Johannes S
You have enabled stretch mode for the whole cluster. This means that all PGs from all pools need to be distributed across both DCs before they become active.

Quoting the docs https://docs.ceph.com/en/squid/rados/operations/stretch-mode/ :

When stretch mode is enabled, PGs will become active only when they peer across data centers (or across whichever CRUSH bucket type was specified), assuming both are alive. Pools will increase in size from the default 3 to 4, and two copies will be expected in each site.
What your want can be achieved with enabling stretch mode only for specific pools; https://docs.ceph.com/en/squid/rado...?highlight=stretched#individual-stretch-pools
 
Thanks for your lights ! So I assume I have to disable stretch mode globally before applying per pool stretch. But I cannot find any command to disable this mode (ie ceph mon disable_stretch_mode). Do you know if it's possible ?
In addition do you confirm the crush rules showed above are correct to perform what I want to achieve ?
 
Thanks for your feedback. I found a more consistent documentation here https://www.ibm.com/docs/en/storage-ceph/8.0?topic=storage-stretch-mode saying :

Stretch mode limitations​

  • It is not possible to exit from stretch mode once it is entered.

So I've reinstalled the proxmox test cluster, but the command to set a pool as stretched is not found :
Code:
root@dc1pve1:~# ceph osd pool stretch set stretch_pool 1 2 datacenter stretch_rule 4 2
no valid command found; 10 closest matches:
osd pool stats [<pool_name>]
osd pool scrub <who>...
osd pool deep-scrub <who>...
osd pool repair <who>...
osd pool force-recovery <who>...
osd pool force-backfill <who>...
osd pool cancel-force-recovery <who>...
osd pool cancel-force-backfill <who>...
osd pool autoscale-status [--format <value>]
osd pool set threshold <num:float>
Error EINVAL: invalid command
I've also upgraded from reef to squid but I still don't have that sub command. Could you check on your side if it's available ?
 
Ok, so you confirm the
Code:
ceph osd pool stretch
is not (yet ?) implemented on Proxmox.
Thus, to conclude the thread, stretched cluster in a Proxmox environment is a all-or-nothing matter, no way to be selective.
I'll keep an eye on next releases, I think it would be nice to see it soon.
Thanks for your answers anyway !
 
  • Like
Reactions: gurubert
For people who want to achieve this, it's possible though without defining any stretch configuration (global nor per pool).
Just apply all configurations detailed above but "ceph mon enable_stretch_mode dc3pve1 stretch_rule datacenter".
Manual map datastores :
- "stretched_pool" with dc1 and dc2 nodes
- "dc1_pool" with dc1 nodes
- "dc2_pool" with dc2 nodes
Create 3 ha groups :
- "stretched_group" with dc1 and dc2 nodes
- "dc1_group" with dc1 nodes checking "restricted"
- "dc2_group" with dc2 nodes checking "restricted"
Checking restricted prevents HA from restarting VMs on alive nodes which are not mapped with the original datastore.
Obviously, witness node must not be part of any datastore nor ha group mapping.
Hence, losing dc1 will restart all vms within "stretched_group" on dc2 and will leave vms within "dc1_group" as failed. When dc1 come back alive, you'll need to switch from started to disabled then switch back to started on resource ha for every previously failed vm since ha may have exhausted "max restart" counter.
Please note that for a long (few hours) datacenter outage, automatic ceph rebuild didn't work, complaining dc1 osds are down. I had to manually restart each failed osd through CLI (systemctl restart ceph-osd@*.service) . Hitting restart osd button on GUI didn't have any effect.
 
Hmm.. what would your pools settings be then? 4/2 for the stretched one?
Problem would be that you might end up with only one replica at the other DC there (so IO block until recovery?), unless you actually set it down to 4/1...

Also, I may be wrong, but in that scenario, you could have monitors at dc1 connecting to OSDs at dc2, over your datacenter link, wouldn't you?