Ceph Stretch cluster - Crush rules assistance

KKWait · Nov 26, 2024

Hi !
We're being validating a stretched cluster design such as :
- Datacenter 1
- 3 PVE (Dell R650) with 5 NVME (1 OSD per disk)
- Datacenter 2
- 3 PVE (Dell R650) with 5 NVME (1 OSD per disk)
- Datacenter 3
- 1 Virtual PVE as witness

So far so good, the stretched mode work well with the following (specific stretched cluster) configuration :

Code:

ceph osd crush add-bucket dc1 datacenter
ceph osd crush add-bucket dc2 datacenter
ceph osd crush move dc1 root=default
ceph osd crush move dc2 root=default
ceph osd crush move dc1pve1 datacenter=dc1
ceph osd crush move dc1pve2 datacenter=dc1
ceph osd crush move dc1pve3 datacenter=dc1
ceph osd crush move dc2pve1 datacenter=dc2
ceph osd crush move dc2pve2 datacenter=dc2
ceph osd crush move dc2pve3 datacenter=dc2
ceph mon set_location dc1pve1 datacenter=dc1
ceph mon set_location dc1pve2 datacenter=dc1
ceph mon set_location dc1pve3 datacenter=dc1
ceph mon set_location dc2pve1 datacenter=dc2
ceph mon set_location dc2pve2 datacenter=dc2
ceph mon set_location dc2pve3 datacenter=dc2
ceph mon set_location dc3pve1 datacenter=dc3
ceph mon set election_strategy connectivity
ceph mon set_location dc3pve1 datacenter=dc3
ceph mon enable_stretch_mode dc3pve1 stretch_rule datacenter

The following crush rule :

Code:

rule stretch_rule {
  id 2
  type replicated
  step take default
  step choose firstn 0 type datacenter
  step chooseleaf firstn 2 type host
  step emit
}

A pool with 4/2 replication, 128PGs and the stretched_rule as replication policy. Coupled with a proper HA group, loosing a whole datacenter restarts all VMs on the other datacenter, exactly what we needed.

Now, I'd like to add two more pools with Datacenter affinities with a 3/2 crush rule to ensure a VM sticks its datacenter osds to address "native HA applications" such as web servers, active directory ... I tried to add the following crush rules :

Code:

rule dc1_rule {
    id 3
    type replicated
    step take dc1
    step chooseleaf firstn 3 type host
    step emit
}
rule dc2_rule {
    id 4
    type replicated
    step take dc2
    step chooseleaf firstn 3 type host
    step emit
}

And created 2 new pools (3/2, 64PGS), each based on crush rules. Unfortunatly, Ceph health reports those 128PGs stucked as clean+peered and never ends as active.

Code:

root@dc1pve1:~# ceph health detail
HEALTH_WARN Reduced data availability: 128 pgs inactive
[WRN] PG_AVAILABILITY: Reduced data availability: 128 pgs inactive
    pg 25.26 is stuck inactive for 3h, current state clean+peered, last acting [4,9,2]
    pg 25.28 is stuck inactive for 3h, current state clean+peered, last acting [3,5,9]
    pg 25.29 is stuck inactive for 3h, current state clean+peered, last acting [1,9,7]
    pg 25.2a is stuck inactive for 3h, current state clean+peered, last acting [0,11,5]
    pg 25.2b is stuck inactive for 3h, current state clean+peered, last acting [10,5,2]
    pg 25.2c is stuck inactive for 3h, current state clean+peered, last acting [2,6,9]
    pg 25.2d is stuck inactive for 3h, current state clean+peered, last acting [10,5,3]
    pg 25.2e is stuck inactive for 3h, current state clean+peered, last acting [7,11,2]
    pg 25.2f is stuck inactive for 3h, current state clean+peered, last acting [2,10,5]
    pg 25.30 is stuck inactive for 3h, current state clean+peered, last acting [10,0,5]
    pg 25.31 is stuck inactive for 3h, current state clean+peered, last acting [6,11,0]
    pg 25.32 is stuck inactive for 3h, current state clean+peered, last acting [5,0,10]
    pg 25.33 is stuck inactive for 3h, current state clean+peered, last acting [9,4,0]
    pg 25.34 is stuck inactive for 3h, current state clean+peered, last acting [9,7,1]
    pg 25.35 is stuck inactive for 3h, current state clean+peered, last acting [4,9,3]
    pg 25.36 is stuck inactive for 3h, current state clean+peered, last acting [0,11,6]
    pg 25.37 is stuck inactive for 3h, current state clean+peered, last acting [5,11,2]
    pg 25.38 is stuck inactive for 3h, current state clean+peered, last acting [8,2,7]
    pg 25.39 is stuck inactive for 3h, current state clean+peered, last acting [4,0,9]
    pg 25.3a is stuck inactive for 3h, current state clean+peered, last acting [1,8,4]
    pg 25.3b is stuck inactive for 3h, current state clean+peered, last acting [9,5,1]
    pg 25.3c is stuck inactive for 3h, current state clean+peered, last acting [6,3,11]
    pg 25.3d is stuck inactive for 3h, current state clean+peered, last acting [11,5,3]
    pg 25.3e is stuck inactive for 3h, current state clean+peered, last acting [6,3,8]
    pg 25.3f is stuck inactive for 3h, current state clean+peered, last acting [7,0,11]
    pg 26.24 is stuck inactive for 3h, current state clean+peered, last acting [13,19,17]
    pg 26.25 is stuck inactive for 3h, current state clean+peered, last acting [13,20,18]
    pg 26.28 is stuck inactive for 3h, current state clean+peered, last acting [20,13,18]
    pg 26.29 is stuck inactive for 3h, current state clean+peered, last acting [22,15,13]
    pg 26.2a is stuck inactive for 3h, current state clean+peered, last acting [19,12,15]
    pg 26.2b is stuck inactive for 3h, current state clean+peered, last acting [16,23,20]
    pg 26.2c is stuck inactive for 3h, current state clean+peered, last acting [14,19,17]
    pg 26.2d is stuck inactive for 3h, current state clean+peered, last acting [19,15,14]
    pg 26.2e is stuck inactive for 3h, current state clean+peered, last acting [21,18,14]
    pg 26.2f is stuck inactive for 3h, current state clean+peered, last acting [12,15,21]
    pg 26.30 is stuck inactive for 3h, current state clean+peered, last acting [17,21,14]
    pg 26.31 is stuck inactive for 3h, current state clean+peered, last acting [15,14,19]
    pg 26.32 is stuck inactive for 3h, current state clean+peered, last acting [12,17,21]
    pg 26.33 is stuck inactive for 3h, current state clean+peered, last acting [17,20,13]
    pg 26.34 is stuck inactive for 3h, current state clean+peered, last acting [12,19,15]
    pg 26.35 is stuck inactive for 3h, current state clean+peered, last acting [21,13,16]
    pg 26.36 is stuck inactive for 3h, current state clean+peered, last acting [13,19,18]
    pg 26.37 is stuck inactive for 3h, current state clean+peered, last acting [19,17,13]
    pg 26.38 is stuck inactive for 3h, current state clean+peered, last acting [13,15,19]
    pg 26.39 is stuck inactive for 3h, current state clean+peered, last acting [16,13,21]
    pg 26.3a is stuck inactive for 3h, current state clean+peered, last acting [14,20,17]
    pg 26.3b is stuck inactive for 3h, current state clean+peered, last acting [20,15,12]
    pg 26.3c is stuck inactive for 3h, current state clean+peered, last acting [16,23,22]
    pg 26.3d is stuck inactive for 3h, current state clean+peered, last acting [23,21,18]
    pg 26.3e is stuck inactive for 3h, current state clean+peered, last acting [20,17,14]
    pg 26.3f is stuck inactive for 3h, current state clean+peered, last acting [12,22,15]

Here's the OSD tree :

Code:

root@dc1pve1:~# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
 -1         2.34445  root default                                       
-15         1.17223      datacenter dc1                               
 -3         0.39075          host dc1pve1                           
  0    nvme  0.09769              osd.0             up   1.00000  1.00000
  1    nvme  0.09769              osd.1             up   1.00000  1.00000
  2    nvme  0.09769              osd.2             up   1.00000  1.00000
  3    nvme  0.09769              osd.3             up   1.00000  1.00000
 -5         0.39075          host dc1pve2                           
  4    nvme  0.09769              osd.4             up   1.00000  1.00000
  5    nvme  0.09769              osd.5             up   1.00000  1.00000
  6    nvme  0.09769              osd.6             up   1.00000  1.00000
  7    nvme  0.09769              osd.7             up   1.00000  1.00000
 -7         0.39075          host dc1pve3                           
  8    nvme  0.09769              osd.8             up   1.00000  1.00000
  9    nvme  0.09769              osd.9             up   1.00000  1.00000
 10    nvme  0.09769              osd.10            up   1.00000  1.00000
 11    nvme  0.09769              osd.11            up   1.00000  1.00000
-16         1.17223      datacenter dc2                               
 -9         0.39075          host dc2pve1                           
 12    nvme  0.09769              osd.12            up   1.00000  1.00000
 13    nvme  0.09769              osd.13            up   1.00000  1.00000
 14    nvme  0.09769              osd.14            up   1.00000  1.00000
 23    nvme  0.09769              osd.23            up   1.00000  1.00000
-11         0.39075          host dc2pve2                           
 15    nvme  0.09769              osd.15            up   1.00000  1.00000
 16    nvme  0.09769              osd.16            up   1.00000  1.00000
 17    nvme  0.09769              osd.17            up   1.00000  1.00000
 18    nvme  0.09769              osd.18            up   1.00000  1.00000
-13         0.39075          host dc2pve3                           
 19    nvme  0.09769              osd.19            up   1.00000  1.00000
 20    nvme  0.09769              osd.20            up   1.00000  1.00000
 21    nvme  0.09769              osd.21            up   1.00000  1.00000
 22    nvme  0.09769              osd.22            up   1.00000  1.00000

Maybe a clue, but I cannot figure out if it's relevant, I can see blacklisted connection ? The IP corresponds to the witness pve node :

Code:

root@dc1pve1:~# ceph osd blacklist ls
192.168.114.2:0/3316608452 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/5968008 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/1051452434 2024-11-27T18:19:13.728470+0100
192.168.114.2:6817/18177 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/2390757461 2024-11-27T18:19:13.728470+0100
192.168.114.2:6816/18177 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/2298991455 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/3816873361 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/3896422733 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/1685705789 2024-11-27T18:18:16.144679+0100
192.168.114.2:6817/1068 2024-11-27T18:18:16.144679+0100
192.168.114.2:6816/1068 2024-11-27T18:18:16.144679+0100

I tried to play with PGS (512 for stretched 256 for dc), but no changes.

Does anyone see what I'm missing ?

Thanks !

gurubert · Nov 27, 2024

You have enabled stretch mode for the whole cluster. This means that all PGs from all pools need to be distributed across both DCs before they become active.

Quoting the docs https://docs.ceph.com/en/squid/rados/operations/stretch-mode/ :

When stretch mode is enabled, PGs will become active only when they peer across data centers (or across whichever CRUSH bucket type was specified), assuming both are alive. Pools will increase in size from the default 3 to 4, and two copies will be expected in each site.

What your want can be achieved with enabling stretch mode only for specific pools; https://docs.ceph.com/en/squid/rado...?highlight=stretched#individual-stretch-pools

KKWait · Nov 27, 2024

Thanks for your lights ! So I assume I have to disable stretch mode globally before applying per pool stretch. But I cannot find any command to disable this mode (ie ceph mon disable_stretch_mode). Do you know if it's possible ?
In addition do you confirm the crush rules showed above are correct to perform what I want to achieve ?

gurubert · Nov 27, 2024

I would replace "firstn 3" with "firstn 0" to make the rules more generic.

I do not if it's possible to remove the global stretch mode from a cluster. Maybe you should ask on the Ceph mailing list.

KKWait · Nov 27, 2024

Thanks for your feedback. I found a more consistent documentation here https://www.ibm.com/docs/en/storage-ceph/8.0?topic=storage-stretch-mode saying :

Stretch mode limitations

It is not possible to exit from stretch mode once it is entered.

So I've reinstalled the proxmox test cluster, but the command to set a pool as stretched is not found :

Code:

root@dc1pve1:~# ceph osd pool stretch set stretch_pool 1 2 datacenter stretch_rule 4 2
no valid command found; 10 closest matches:
osd pool stats [<pool_name>]
osd pool scrub <who>...
osd pool deep-scrub <who>...
osd pool repair <who>...
osd pool force-recovery <who>...
osd pool force-backfill <who>...
osd pool cancel-force-recovery <who>...
osd pool cancel-force-backfill <who>...
osd pool autoscale-status [--format <value>]
osd pool set threshold <num:float>
Error EINVAL: invalid command

I've also upgraded from reef to squid but I still don't have that sub command. Could you check on your side if it's available ?

gurubert · Nov 27, 2024

Sorry, sometimes the Ceph documentation describes features that have not been released yet.

KKWait · Nov 27, 2024

Ok, so you confirm the

Code:

ceph osd pool stretch

is not (yet ?) implemented on Proxmox.
Thus, to conclude the thread, stretched cluster in a Proxmox environment is a all-or-nothing matter, no way to be selective.
I'll keep an eye on next releases, I think it would be nice to see it soon.
Thanks for your answers anyway !

KKWait · Nov 27, 2024

For people who want to achieve this, it's possible though without defining any stretch configuration (global nor per pool).
Just apply all configurations detailed above but "ceph mon enable_stretch_mode dc3pve1 stretch_rule datacenter".
Manual map datastores :
- "stretched_pool" with dc1 and dc2 nodes
- "dc1_pool" with dc1 nodes
- "dc2_pool" with dc2 nodes
Create 3 ha groups :
- "stretched_group" with dc1 and dc2 nodes
- "dc1_group" with dc1 nodes checking "restricted"
- "dc2_group" with dc2 nodes checking "restricted"
Checking restricted prevents HA from restarting VMs on alive nodes which are not mapped with the original datastore.
Obviously, witness node must not be part of any datastore nor ha group mapping.
Hence, losing dc1 will restart all vms within "stretched_group" on dc2 and will leave vms within "dc1_group" as failed. When dc1 come back alive, you'll need to switch from started to disabled then switch back to started on resource ha for every previously failed vm since ha may have exhausted "max restart" counter.
Please note that for a long (few hours) datacenter outage, automatic ceph rebuild didn't work, complaining dc1 osds are down. I had to manually restart each failed osd through CLI (systemctl restart ceph-osd@*.service) . Hitting restart osd button on GUI didn't have any effect.

Gilou · Dec 28, 2024

Hmm.. what would your pools settings be then? 4/2 for the stretched one?
Problem would be that you might end up with only one replica at the other DC there (so IO block until recovery?), unless you actually set it down to 4/1...

Also, I may be wrong, but in that scenario, you could have monitors at dc1 connecting to OSDs at dc2, over your datacenter link, wouldn't you?

ghusson · Aug 27, 2025

Gilou said:
Hmm.. what would your pools settings be then? 4/2 for the stretched one?
Problem would be that you might end up with only one replica at the other DC there (so IO block until recovery?), unless you actually set it down to 4/1...

Hello, this question is answered in the documentation.
https://docs.ceph.com/en/squid/rados/operations/stretch-mode/ :

When stretch mode is enabled, PGs will become active only when they peer across CRUSH datacenter``s (or across whichever CRUSH bucket type was specified), assuming both are available. Pools will increase in size from the default ``3 to 4, and two replicas will be placed at each site. OSDs will be allowed to connect to Monitors only if they are in the same data center as the Monitors. New Monitors will not be allowed to join the cluster if they do not specify a CRUSH location.

If all OSDs and Monitors in one of the datacenter``s become inaccessible at once, the cluster in the surviving ``datacenter enters degraded stretch mode. A health state warning will be raised, pools’ min_size will be reduced to 1, and the cluster will be allowed to go active with the components and data at the single remaining site. Pool size does not change, so warnings will be raised that the PGs are undersized, but a special stretch mode flag will prevent the OSDs from creating extra copies in the remaining data center. This means that the data center will keep only two copies, just as before.

When the inaccessible datacenter comes back, the cluster will enter recovery stretch mode. This changes the warning and allows peering, but requires OSDs only from the datacenter that was up throughout the duration of the downtime. When all PGs are in a known state, and are neither degraded nor undersized / incomplete, the cluster transitions back to regular stretch mode, ends the warning, restores pools’ min_size to its original value of 2, requires PGs at both sites to peer, and no longer requires the site that was up throughout the duration of the downtime when peering. This makes failover to the other site possible, if needed.

Gilou · Sep 16, 2025

I missed the reply, but.. I was indeed pinpointing the issue that could happen without the stretch mode enabled, as it was suggested as a workaround.. But yeah.

Search

Search

Ceph Stretch cluster - Crush rules assistance

KKWait

New Member

gurubert

Distinguished Member

KKWait

New Member

gurubert

Distinguished Member

KKWait

New Member

Stretch mode limitations

gurubert

Distinguished Member

KKWait

New Member

KKWait

New Member

Gilou

Renowned Member

ghusson

Renowned Member

Gilou

Renowned Member

We value your privacy

Ceph Stretch cluster - Crush rules assistance

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Stretch mode limitations​

Distinguished Member

New Member

New Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy

Stretch mode limitations