Hi proxmox fans.
Excuse my ignorance in this area, I have been trying to wrap my head around the Ceph Clustering Model. Watched some videos and read bunches. I have a 4 Node Cluster setup with Equivalent hardware. Based on some tutorials I have setup and have functioning an environment. I would like it to have 1 node go down without any kind of faults.
Here is my problem.
When 1 host goes down, everything goes down. So clearly I have something setup wrong.
I also notice that when 1 node goes down it takes down the other hosts osd's
Here is my Crush Map and Ceph Config:
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
filestore xattr use omap = true
fsid = c4a24163-5d13-4d82-8877-8b6ddc050f29
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.10.10.0/24
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.2]
host = pm02
mon addr = 10.10.10.2:6789
[mon.0]
host = PM01
mon addr = 10.10.10.1:6789
[mon.1]
host = pm03
mon addr = 10.10.10.3:6789
[mon.3]
host = pm04
mon addr = 10.10.10.4:6789
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host pm02 {
id -2 # do not change unnecessarily
# weight 2.720
alg straw
hash 0 # rjenkins1
item osd.0 weight 2.720
}
host pm03 {
id -3 # do not change unnecessarily
# weight 1.050
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.050
}
host pm04 {
id -4 # do not change unnecessarily
# weight 1.810
alg straw
hash 0 # rjenkins1
item osd.5 weight 1.810
}
host PM01 {
id -5 # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.900
item osd.3 weight 0.900
item osd.4 weight 0.900
}
root default {
id -1 # do not change unnecessarily
# weight 8.280
alg straw
hash 0 # rjenkins1
item pm02 weight 2.720
item pm03 weight 1.050
item pm04 weight 1.810
item PM01 weight 2.700
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
Logs
root@PM01:~# ceph osd pool get cephStor pg_num
pg_num: 150
root@PM01:~# ceph osd pool get cephStor size
size: 3
I found the Ceph Calc.
Pool cephStor
Size =3
OSD = 5
% of Data = 100%
PG's Per OSD = 200
Total PG's 256
If looking at this correctly the calculator is suggesting that I have a PG count of 256 with 200 on each osd where I only have 150.
How do I update and is this correct for what I am trying to accomplish?
Excuse my ignorance in this area, I have been trying to wrap my head around the Ceph Clustering Model. Watched some videos and read bunches. I have a 4 Node Cluster setup with Equivalent hardware. Based on some tutorials I have setup and have functioning an environment. I would like it to have 1 node go down without any kind of faults.
Here is my problem.
When 1 host goes down, everything goes down. So clearly I have something setup wrong.
I also notice that when 1 node goes down it takes down the other hosts osd's
Here is my Crush Map and Ceph Config:
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
filestore xattr use omap = true
fsid = c4a24163-5d13-4d82-8877-8b6ddc050f29
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.10.10.0/24
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.2]
host = pm02
mon addr = 10.10.10.2:6789
[mon.0]
host = PM01
mon addr = 10.10.10.1:6789
[mon.1]
host = pm03
mon addr = 10.10.10.3:6789
[mon.3]
host = pm04
mon addr = 10.10.10.4:6789
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host pm02 {
id -2 # do not change unnecessarily
# weight 2.720
alg straw
hash 0 # rjenkins1
item osd.0 weight 2.720
}
host pm03 {
id -3 # do not change unnecessarily
# weight 1.050
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.050
}
host pm04 {
id -4 # do not change unnecessarily
# weight 1.810
alg straw
hash 0 # rjenkins1
item osd.5 weight 1.810
}
host PM01 {
id -5 # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.900
item osd.3 weight 0.900
item osd.4 weight 0.900
}
root default {
id -1 # do not change unnecessarily
# weight 8.280
alg straw
hash 0 # rjenkins1
item pm02 weight 2.720
item pm03 weight 1.050
item pm04 weight 1.810
item PM01 weight 2.700
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
Logs
root@PM01:~# ceph osd pool get cephStor pg_num
pg_num: 150
root@PM01:~# ceph osd pool get cephStor size
size: 3
I found the Ceph Calc.
Pool cephStor
Size =3
OSD = 5
% of Data = 100%
PG's Per OSD = 200
Total PG's 256
If looking at this correctly the calculator is suggesting that I have a PG count of 256 with 200 on each osd where I only have 150.
How do I update and is this correct for what I am trying to accomplish?
Last edited: