Hi !
We're being validating a stretched cluster design such as :
- Datacenter 1
- 3 PVE (Dell R650) with 5 NVME (1 OSD per disk)
- Datacenter 2
- 3 PVE (Dell R650) with 5 NVME (1 OSD per disk)
- Datacenter 3
- 1 Virtual PVE as witness
So far so good, the stretched mode work well with the following (specific stretched cluster) configuration :
The following crush rule :
A pool with 4/2 replication, 128PGs and the stretched_rule as replication policy. Coupled with a proper HA group, loosing a whole datacenter restarts all VMs on the other datacenter, exactly what we needed.
Now, I'd like to add two more pools with Datacenter affinities with a 3/2 crush rule to ensure a VM sticks its datacenter osds to address "native HA applications" such as web servers, active directory ... I tried to add the following crush rules :
And created 2 new pools (3/2, 64PGS), each based on crush rules. Unfortunatly, Ceph health reports those 128PGs stucked as clean+peered and never ends as active.
Here's the OSD tree :
Maybe a clue, but I cannot figure out if it's relevant, I can see blacklisted connection ? The IP corresponds to the witness pve node :
I tried to play with PGS (512 for stretched 256 for dc), but no changes.
Does anyone see what I'm missing ?
Thanks !
We're being validating a stretched cluster design such as :
- Datacenter 1
- 3 PVE (Dell R650) with 5 NVME (1 OSD per disk)
- Datacenter 2
- 3 PVE (Dell R650) with 5 NVME (1 OSD per disk)
- Datacenter 3
- 1 Virtual PVE as witness
So far so good, the stretched mode work well with the following (specific stretched cluster) configuration :
Code:
ceph osd crush add-bucket dc1 datacenter
ceph osd crush add-bucket dc2 datacenter
ceph osd crush move dc1 root=default
ceph osd crush move dc2 root=default
ceph osd crush move dc1pve1 datacenter=dc1
ceph osd crush move dc1pve2 datacenter=dc1
ceph osd crush move dc1pve3 datacenter=dc1
ceph osd crush move dc2pve1 datacenter=dc2
ceph osd crush move dc2pve2 datacenter=dc2
ceph osd crush move dc2pve3 datacenter=dc2
ceph mon set_location dc1pve1 datacenter=dc1
ceph mon set_location dc1pve2 datacenter=dc1
ceph mon set_location dc1pve3 datacenter=dc1
ceph mon set_location dc2pve1 datacenter=dc2
ceph mon set_location dc2pve2 datacenter=dc2
ceph mon set_location dc2pve3 datacenter=dc2
ceph mon set_location dc3pve1 datacenter=dc3
ceph mon set election_strategy connectivity
ceph mon set_location dc3pve1 datacenter=dc3
ceph mon enable_stretch_mode dc3pve1 stretch_rule datacenter
The following crush rule :
Code:
rule stretch_rule {
id 2
type replicated
step take default
step choose firstn 0 type datacenter
step chooseleaf firstn 2 type host
step emit
}
A pool with 4/2 replication, 128PGs and the stretched_rule as replication policy. Coupled with a proper HA group, loosing a whole datacenter restarts all VMs on the other datacenter, exactly what we needed.
Now, I'd like to add two more pools with Datacenter affinities with a 3/2 crush rule to ensure a VM sticks its datacenter osds to address "native HA applications" such as web servers, active directory ... I tried to add the following crush rules :
Code:
rule dc1_rule {
id 3
type replicated
step take dc1
step chooseleaf firstn 3 type host
step emit
}
rule dc2_rule {
id 4
type replicated
step take dc2
step chooseleaf firstn 3 type host
step emit
}
And created 2 new pools (3/2, 64PGS), each based on crush rules. Unfortunatly, Ceph health reports those 128PGs stucked as clean+peered and never ends as active.
Code:
root@dc1pve1:~# ceph health detail
HEALTH_WARN Reduced data availability: 128 pgs inactive
[WRN] PG_AVAILABILITY: Reduced data availability: 128 pgs inactive
pg 25.26 is stuck inactive for 3h, current state clean+peered, last acting [4,9,2]
pg 25.28 is stuck inactive for 3h, current state clean+peered, last acting [3,5,9]
pg 25.29 is stuck inactive for 3h, current state clean+peered, last acting [1,9,7]
pg 25.2a is stuck inactive for 3h, current state clean+peered, last acting [0,11,5]
pg 25.2b is stuck inactive for 3h, current state clean+peered, last acting [10,5,2]
pg 25.2c is stuck inactive for 3h, current state clean+peered, last acting [2,6,9]
pg 25.2d is stuck inactive for 3h, current state clean+peered, last acting [10,5,3]
pg 25.2e is stuck inactive for 3h, current state clean+peered, last acting [7,11,2]
pg 25.2f is stuck inactive for 3h, current state clean+peered, last acting [2,10,5]
pg 25.30 is stuck inactive for 3h, current state clean+peered, last acting [10,0,5]
pg 25.31 is stuck inactive for 3h, current state clean+peered, last acting [6,11,0]
pg 25.32 is stuck inactive for 3h, current state clean+peered, last acting [5,0,10]
pg 25.33 is stuck inactive for 3h, current state clean+peered, last acting [9,4,0]
pg 25.34 is stuck inactive for 3h, current state clean+peered, last acting [9,7,1]
pg 25.35 is stuck inactive for 3h, current state clean+peered, last acting [4,9,3]
pg 25.36 is stuck inactive for 3h, current state clean+peered, last acting [0,11,6]
pg 25.37 is stuck inactive for 3h, current state clean+peered, last acting [5,11,2]
pg 25.38 is stuck inactive for 3h, current state clean+peered, last acting [8,2,7]
pg 25.39 is stuck inactive for 3h, current state clean+peered, last acting [4,0,9]
pg 25.3a is stuck inactive for 3h, current state clean+peered, last acting [1,8,4]
pg 25.3b is stuck inactive for 3h, current state clean+peered, last acting [9,5,1]
pg 25.3c is stuck inactive for 3h, current state clean+peered, last acting [6,3,11]
pg 25.3d is stuck inactive for 3h, current state clean+peered, last acting [11,5,3]
pg 25.3e is stuck inactive for 3h, current state clean+peered, last acting [6,3,8]
pg 25.3f is stuck inactive for 3h, current state clean+peered, last acting [7,0,11]
pg 26.24 is stuck inactive for 3h, current state clean+peered, last acting [13,19,17]
pg 26.25 is stuck inactive for 3h, current state clean+peered, last acting [13,20,18]
pg 26.28 is stuck inactive for 3h, current state clean+peered, last acting [20,13,18]
pg 26.29 is stuck inactive for 3h, current state clean+peered, last acting [22,15,13]
pg 26.2a is stuck inactive for 3h, current state clean+peered, last acting [19,12,15]
pg 26.2b is stuck inactive for 3h, current state clean+peered, last acting [16,23,20]
pg 26.2c is stuck inactive for 3h, current state clean+peered, last acting [14,19,17]
pg 26.2d is stuck inactive for 3h, current state clean+peered, last acting [19,15,14]
pg 26.2e is stuck inactive for 3h, current state clean+peered, last acting [21,18,14]
pg 26.2f is stuck inactive for 3h, current state clean+peered, last acting [12,15,21]
pg 26.30 is stuck inactive for 3h, current state clean+peered, last acting [17,21,14]
pg 26.31 is stuck inactive for 3h, current state clean+peered, last acting [15,14,19]
pg 26.32 is stuck inactive for 3h, current state clean+peered, last acting [12,17,21]
pg 26.33 is stuck inactive for 3h, current state clean+peered, last acting [17,20,13]
pg 26.34 is stuck inactive for 3h, current state clean+peered, last acting [12,19,15]
pg 26.35 is stuck inactive for 3h, current state clean+peered, last acting [21,13,16]
pg 26.36 is stuck inactive for 3h, current state clean+peered, last acting [13,19,18]
pg 26.37 is stuck inactive for 3h, current state clean+peered, last acting [19,17,13]
pg 26.38 is stuck inactive for 3h, current state clean+peered, last acting [13,15,19]
pg 26.39 is stuck inactive for 3h, current state clean+peered, last acting [16,13,21]
pg 26.3a is stuck inactive for 3h, current state clean+peered, last acting [14,20,17]
pg 26.3b is stuck inactive for 3h, current state clean+peered, last acting [20,15,12]
pg 26.3c is stuck inactive for 3h, current state clean+peered, last acting [16,23,22]
pg 26.3d is stuck inactive for 3h, current state clean+peered, last acting [23,21,18]
pg 26.3e is stuck inactive for 3h, current state clean+peered, last acting [20,17,14]
pg 26.3f is stuck inactive for 3h, current state clean+peered, last acting [12,22,15]
Here's the OSD tree :
Code:
root@dc1pve1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 2.34445 root default
-15 1.17223 datacenter dc1
-3 0.39075 host dc1pve1
0 nvme 0.09769 osd.0 up 1.00000 1.00000
1 nvme 0.09769 osd.1 up 1.00000 1.00000
2 nvme 0.09769 osd.2 up 1.00000 1.00000
3 nvme 0.09769 osd.3 up 1.00000 1.00000
-5 0.39075 host dc1pve2
4 nvme 0.09769 osd.4 up 1.00000 1.00000
5 nvme 0.09769 osd.5 up 1.00000 1.00000
6 nvme 0.09769 osd.6 up 1.00000 1.00000
7 nvme 0.09769 osd.7 up 1.00000 1.00000
-7 0.39075 host dc1pve3
8 nvme 0.09769 osd.8 up 1.00000 1.00000
9 nvme 0.09769 osd.9 up 1.00000 1.00000
10 nvme 0.09769 osd.10 up 1.00000 1.00000
11 nvme 0.09769 osd.11 up 1.00000 1.00000
-16 1.17223 datacenter dc2
-9 0.39075 host dc2pve1
12 nvme 0.09769 osd.12 up 1.00000 1.00000
13 nvme 0.09769 osd.13 up 1.00000 1.00000
14 nvme 0.09769 osd.14 up 1.00000 1.00000
23 nvme 0.09769 osd.23 up 1.00000 1.00000
-11 0.39075 host dc2pve2
15 nvme 0.09769 osd.15 up 1.00000 1.00000
16 nvme 0.09769 osd.16 up 1.00000 1.00000
17 nvme 0.09769 osd.17 up 1.00000 1.00000
18 nvme 0.09769 osd.18 up 1.00000 1.00000
-13 0.39075 host dc2pve3
19 nvme 0.09769 osd.19 up 1.00000 1.00000
20 nvme 0.09769 osd.20 up 1.00000 1.00000
21 nvme 0.09769 osd.21 up 1.00000 1.00000
22 nvme 0.09769 osd.22 up 1.00000 1.00000
Maybe a clue, but I cannot figure out if it's relevant, I can see blacklisted connection ? The IP corresponds to the witness pve node :
Code:
root@dc1pve1:~# ceph osd blacklist ls
192.168.114.2:0/3316608452 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/5968008 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/1051452434 2024-11-27T18:19:13.728470+0100
192.168.114.2:6817/18177 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/2390757461 2024-11-27T18:19:13.728470+0100
192.168.114.2:6816/18177 2024-11-27T18:19:13.728470+0100
192.168.114.2:0/2298991455 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/3816873361 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/3896422733 2024-11-27T18:18:16.144679+0100
192.168.114.2:0/1685705789 2024-11-27T18:18:16.144679+0100
192.168.114.2:6817/1068 2024-11-27T18:18:16.144679+0100
192.168.114.2:6816/1068 2024-11-27T18:18:16.144679+0100
I tried to play with PGS (512 for stretched 256 for dc), but no changes.
Does anyone see what I'm missing ?
Thanks !
Last edited: