Hello all,
Background:
We are having 2 sites/DC's with a cluster of 9 nodes each site/DC.
The way Ceph is build:
4 Disks of 2.4 T and 2 SSD disks of 200GB for WAL/journalist.
each Server is HPE DL380 GEN9
Controllers:
P840 for OS
P440ar for CEPH (HBA mode)
Recently we purchased 10 SSD disk of 3.2T each
We put 1 disk on each one of 5 servers out of 9 servers in the cluster.
We have created the pools with separated replicate rules and made separate pools
one for SSD and one for HDD
SSD with crush rule for ssd class and HDD with Crush rule for hdd class
Eventually we are having on 5 nodes in the cluster
4 HDD disk, 2 journalist SSD for the HDD OSD's disks and 1 SSD OSD disk.
BOTH SSD and HDD disks are on same HBA Controller.
For reference this is our crush map rule:
At the moment we are moving a vm from the HDD pool to the SSD POOL
The node where the mentioned VM moved to the SSD pool, then its CEPH crushes since at that moment the OS is locking up controller and then disconnecting it.
To fix this we need to reboot the node in order to get the CEPH back running,
Sometimes ceph is not go up if SSD is still in cluster, and in order to get it up i need to destroy SSD osd and reboot node again and only then CEPH is goes back up after boot
I would like to know:
1. if our crush map is wrong? or need to get a fine tuning in order to prevent our case.
2. is SSD Disk and HDD disk can be on the same controller even they are having different dedicated pools and rules???
3. Should I configure our pools differently
Since it is an already running production environment
We thought about taking all SSD disks and move them to node 8 &9 in the cluster and make them SSD only instead of their HDD disks and test the behavior of having sperate OSD type hosts, we wont have mixed disk hardware on same controller because we think that this is our issue
Regards,
Oren
Background:
We are having 2 sites/DC's with a cluster of 9 nodes each site/DC.
The way Ceph is build:
4 Disks of 2.4 T and 2 SSD disks of 200GB for WAL/journalist.
each Server is HPE DL380 GEN9
Controllers:
P840 for OS
P440ar for CEPH (HBA mode)
Recently we purchased 10 SSD disk of 3.2T each
We put 1 disk on each one of 5 servers out of 9 servers in the cluster.
We have created the pools with separated replicate rules and made separate pools
one for SSD and one for HDD
SSD with crush rule for ssd class and HDD with Crush rule for hdd class
Eventually we are having on 5 nodes in the cluster
4 HDD disk, 2 journalist SSD for the HDD OSD's disks and 1 SSD OSD disk.
BOTH SSD and HDD disks are on same HBA Controller.
For reference this is our crush map rule:
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class hdd
device 34 osd.34 class hdd
device 35 osd.35 class hdd
device 36 osd.36 class ssd
device 37 osd.37 class ssd
device 38 osd.38 class ssd
device 39 osd.39 class ssd
device 40 osd.40 class ssd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# buckets
host proxmox-1-afq {
id -3 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
id -21 class ssd # do not change unnecessarily
# weight 8.730
alg straw2
hash 0 # rjenkins1
item osd.3 weight 2.183
item osd.2 weight 2.183
item osd.9 weight 2.183
item osd.5 weight 2.183
}
host proxmox-2-afq {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
id -22 class ssd # do not change unnecessarily
# weight 8.730
alg straw2
hash 0 # rjenkins1
item osd.4 weight 2.183
item osd.11 weight 2.183
item osd.6 weight 2.183
item osd.0 weight 2.183
}
host proxmox-3-afq {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
id -23 class ssd # do not change unnecessarily
# weight 11.641
alg straw2
hash 0 # rjenkins1
item osd.8 weight 2.183
item osd.7 weight 2.183
item osd.10 weight 2.183
item osd.1 weight 2.183
item osd.40 weight 2.911
}
host proxmox-4-afq {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
id -24 class ssd # do not change unnecessarily
# weight 11.644
alg straw2
hash 0 # rjenkins1
item osd.12 weight 2.183
item osd.13 weight 2.183
item osd.14 weight 2.183
item osd.15 weight 2.183
item osd.37 weight 2.911
}
host proxmox-5-afq {
id -11 # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
id -25 class ssd # do not change unnecessarily
# weight 11.644
alg straw2
hash 0 # rjenkins1
item osd.16 weight 2.183
item osd.17 weight 2.183
item osd.18 weight 2.183
item osd.19 weight 2.183
item osd.38 weight 2.911
}
host proxmox-6-afq {
id -13 # do not change unnecessarily
id -14 class hdd # do not change unnecessarily
id -26 class ssd # do not change unnecessarily
# weight 11.644
alg straw2
hash 0 # rjenkins1
item osd.20 weight 2.183
item osd.21 weight 2.183
item osd.22 weight 2.183
item osd.23 weight 2.183
item osd.39 weight 2.911
}
host proxmox-7-afq {
id -15 # do not change unnecessarily
id -16 class hdd # do not change unnecessarily
id -27 class ssd # do not change unnecessarily
# weight 11.644
alg straw2
hash 0 # rjenkins1
item osd.24 weight 2.183
item osd.25 weight 2.183
item osd.26 weight 2.183
item osd.27 weight 2.183
item osd.36 weight 2.911
}
host proxmox-8-afq {
id -17 # do not change unnecessarily
id -18 class hdd # do not change unnecessarily
id -28 class ssd # do not change unnecessarily
# weight 8.733
alg straw2
hash 0 # rjenkins1
item osd.28 weight 2.183
item osd.29 weight 2.183
item osd.30 weight 2.183
item osd.31 weight 2.183
}
host proxmox-9-afq {
id -19 # do not change unnecessarily
id -20 class hdd # do not change unnecessarily
id -29 class ssd # do not change unnecessarily
# weight 8.733
alg straw2
hash 0 # rjenkins1
item osd.32 weight 2.183
item osd.33 weight 2.183
item osd.34 weight 2.183
item osd.35 weight 2.183
}
root default {
id -1 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
id -30 class ssd # do not change unnecessarily
# weight 93.143
alg straw2
hash 0 # rjenkins1
item proxmox-1-afq weight 8.730
item proxmox-2-afq weight 8.730
item proxmox-3-afq weight 11.641
item proxmox-4-afq weight 11.644
item proxmox-5-afq weight 11.644
item proxmox-6-afq weight 11.644
item proxmox-7-afq weight 11.644
item proxmox-8-afq weight 8.733
item proxmox-9-afq weight 8.733
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule replicated-hdd {
id 1
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
rule replicated_ssd {
id 2
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 0 type host
step emit
}
# end crush map
At the moment we are moving a vm from the HDD pool to the SSD POOL
The node where the mentioned VM moved to the SSD pool, then its CEPH crushes since at that moment the OS is locking up controller and then disconnecting it.
To fix this we need to reboot the node in order to get the CEPH back running,
Sometimes ceph is not go up if SSD is still in cluster, and in order to get it up i need to destroy SSD osd and reboot node again and only then CEPH is goes back up after boot
I would like to know:
1. if our crush map is wrong? or need to get a fine tuning in order to prevent our case.
2. is SSD Disk and HDD disk can be on the same controller even they are having different dedicated pools and rules???
3. Should I configure our pools differently
Since it is an already running production environment
We thought about taking all SSD disks and move them to node 8 &9 in the cluster and make them SSD only instead of their HDD disks and test the behavior of having sperate OSD type hosts, we wont have mixed disk hardware on same controller because we think that this is our issue
Regards,
Oren