Proxmox ve 6 Mixed CEPH SSD HDD same HBA Controller

orenr

New Member
Oct 26, 2021
20
1
3
30
Hello all,

Background:
We are having 2 sites/DC's with a cluster of 9 nodes each site/DC.
The way Ceph is build:
4 Disks of 2.4 T and 2 SSD disks of 200GB for WAL/journalist.

each Server is HPE DL380 GEN9
Controllers:
P840 for OS
P440ar for CEPH (HBA mode)

Recently we purchased 10 SSD disk of 3.2T each
We put 1 disk on each one of 5 servers out of 9 servers in the cluster.

We have created the pools with separated replicate rules and made separate pools
one for SSD and one for HDD
SSD with crush rule for ssd class and HDD with Crush rule for hdd class

Eventually we are having on 5 nodes in the cluster
4 HDD disk, 2 journalist SSD for the HDD OSD's disks and 1 SSD OSD disk.

BOTH SSD and HDD disks are on same HBA Controller.
For reference this is our crush map rule:

Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class hdd
device 34 osd.34 class hdd
device 35 osd.35 class hdd
device 36 osd.36 class ssd
device 37 osd.37 class ssd
device 38 osd.38 class ssd
device 39 osd.39 class ssd
device 40 osd.40 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host proxmox-1-afq {
    id -3        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    id -21 class ssd        # do not change unnecessarily
    # weight 8.730
    alg straw2
    hash 0    # rjenkins1
    item osd.3 weight 2.183
    item osd.2 weight 2.183
    item osd.9 weight 2.183
    item osd.5 weight 2.183
}
host proxmox-2-afq {
    id -5        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    id -22 class ssd        # do not change unnecessarily
    # weight 8.730
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 2.183
    item osd.11 weight 2.183
    item osd.6 weight 2.183
    item osd.0 weight 2.183
}
host proxmox-3-afq {
    id -7        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    id -23 class ssd        # do not change unnecessarily
    # weight 11.641
    alg straw2
    hash 0    # rjenkins1
    item osd.8 weight 2.183
    item osd.7 weight 2.183
    item osd.10 weight 2.183
    item osd.1 weight 2.183
    item osd.40 weight 2.911
}
host proxmox-4-afq {
    id -9        # do not change unnecessarily
    id -10 class hdd        # do not change unnecessarily
    id -24 class ssd        # do not change unnecessarily
    # weight 11.644
    alg straw2
    hash 0    # rjenkins1
    item osd.12 weight 2.183
    item osd.13 weight 2.183
    item osd.14 weight 2.183
    item osd.15 weight 2.183
    item osd.37 weight 2.911
}
host proxmox-5-afq {
    id -11        # do not change unnecessarily
    id -12 class hdd        # do not change unnecessarily
    id -25 class ssd        # do not change unnecessarily
    # weight 11.644
    alg straw2
    hash 0    # rjenkins1
    item osd.16 weight 2.183
    item osd.17 weight 2.183
    item osd.18 weight 2.183
    item osd.19 weight 2.183
    item osd.38 weight 2.911
}
host proxmox-6-afq {
    id -13        # do not change unnecessarily
    id -14 class hdd        # do not change unnecessarily
    id -26 class ssd        # do not change unnecessarily
    # weight 11.644
    alg straw2
    hash 0    # rjenkins1
    item osd.20 weight 2.183
    item osd.21 weight 2.183
    item osd.22 weight 2.183
    item osd.23 weight 2.183
    item osd.39 weight 2.911
}
host proxmox-7-afq {
    id -15        # do not change unnecessarily
    id -16 class hdd        # do not change unnecessarily
    id -27 class ssd        # do not change unnecessarily
    # weight 11.644
    alg straw2
    hash 0    # rjenkins1
    item osd.24 weight 2.183
    item osd.25 weight 2.183
    item osd.26 weight 2.183
    item osd.27 weight 2.183
    item osd.36 weight 2.911
}
host proxmox-8-afq {
    id -17        # do not change unnecessarily
    id -18 class hdd        # do not change unnecessarily
    id -28 class ssd        # do not change unnecessarily
    # weight 8.733
    alg straw2
    hash 0    # rjenkins1
    item osd.28 weight 2.183
    item osd.29 weight 2.183
    item osd.30 weight 2.183
    item osd.31 weight 2.183
}
host proxmox-9-afq {
    id -19        # do not change unnecessarily
    id -20 class hdd        # do not change unnecessarily
    id -29 class ssd        # do not change unnecessarily
    # weight 8.733
    alg straw2
    hash 0    # rjenkins1
    item osd.32 weight 2.183
    item osd.33 weight 2.183
    item osd.34 weight 2.183
    item osd.35 weight 2.183
}
root default {
    id -1        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    id -30 class ssd        # do not change unnecessarily
    # weight 93.143
    alg straw2
    hash 0    # rjenkins1
    item proxmox-1-afq weight 8.730
    item proxmox-2-afq weight 8.730
    item proxmox-3-afq weight 11.641
    item proxmox-4-afq weight 11.644
    item proxmox-5-afq weight 11.644
    item proxmox-6-afq weight 11.644
    item proxmox-7-afq weight 11.644
    item proxmox-8-afq weight 8.733
    item proxmox-9-afq weight 8.733
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated-hdd {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take default class hdd
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_ssd {
    id 2
    type replicated
    min_size 1
    max_size 10
    step take default class ssd
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map


At the moment we are moving a vm from the HDD pool to the SSD POOL
The node where the mentioned VM moved to the SSD pool, then its CEPH crushes since at that moment the OS is locking up controller and then disconnecting it.

To fix this we need to reboot the node in order to get the CEPH back running,

Sometimes ceph is not go up if SSD is still in cluster, and in order to get it up i need to destroy SSD osd and reboot node again and only then CEPH is goes back up after boot


I would like to know:
1. if our crush map is wrong? or need to get a fine tuning in order to prevent our case.
2. is SSD Disk and HDD disk can be on the same controller even they are having different dedicated pools and rules???
3. Should I configure our pools differently

Since it is an already running production environment
We thought about taking all SSD disks and move them to node 8 &9 in the cluster and make them SSD only instead of their HDD disks and test the behavior of having sperate OSD type hosts, we wont have mixed disk hardware on same controller because we think that this is our issue


Regards,
Oren
 
Do you see anything in the system logs (/var/log/syslog or /var/log/kern.log) regarding IO errors or errors with the HBA?
The behavior of the HBA crashing when there is load on the SSD does not sound right. Does it also happen if you don't use the SSD for Ceph and just write a lot of data to it directly?

I suspect that the HBA is the problem and not Ceph itself. Check if there are firmware updates available for the servers.
 
Hi aaron,

Thank you very much for replying.
We saw IO error but only when controller has disconnected. So couldn't point on it directly.

Situation is:
Finished inserting all SSD osd's
then lets say on node pve03 i have changed vm storage from HDD ceph pool to SSD ceph pool
After 5 minutes the CEPH's of above node crashed, and required to reboot to reconnect controller again and get ceph on node back up.
I have simulate it 4 times, and talking on vm that has 50 GB with 20% used

On the other hand when writing only to the HDD im talking even in matter of TB ceph not crashes

Controller is HPE Smart Storage P440ar
Controller on latest firmware

Soon we are going to take 2 nodes of the cluster and make them only SSD ceph node and check if the controller is issue of not able to handle also HDD and SSD HBA mode ceph data write/read actions.

Ill update soon to get more input on results and consider further steps
 
Be aware that only having 2 nodes with SSDs, will not give you redundancy for the SSD pool. At least 3 nodes should be used for that.

To eliminate Ceph, I would write a lot of data directly to the naked SSD and see if the controller behaves the same. Because I suspect that this is rather something for the HP support why their controller behaves this way.
 
OK we did the 2 nodes only SSD thing. and yet no good results.
I have downgrade pool config to be 2/2 replicates since i'm having only 2 nodes for it and have all pgs reallocated correctly.

The results are:
Moved test-vm (very light vm ) to ssd disk pool
After 10 seconds - BOOM! 3 ssd osd's on node crashes - of course i had to reboot fail node again.

@aaron as per your correspondence above... should be working only if having 3 nodes? i mean should i really test if after having 3 nodes with SSD only?

@Neobin Thank you for reply i'll certainly check up for your inline comments posts

Regards
Oren
 
@aaron as per your correspondence above... should be working only if having 3 nodes? i mean should i really test if after having 3 nodes with SSD only?
Nope, I meant it if you want to use it in production with 3/2 ad 2/2 isn't really great.

Right now, we don't know if there is a Ceph problem or a problem in the layers below, such as the SSDs themselves or the controller.

Have you tried to write data directly to the SSDs yet? Something like dd if=/dev/urandom of=/dev/sdX bs=1M? Or better, first laying out a file of random data and copying that to the SSD directly because /dev/urandom might not give you data fast enough to put load on the SSD.
 
Hi @aaron

Already tried Directly Write into disk - also found a qcow2 image of recently backed vm around 60GB and other unharmful files and small FS's
wrought it directly to one of the ssd disks and it didnt crashed or misbehaved.

im probably guessing that maybe my controller is not best suit for situations like holding SSDs alone in ceph

but one more question,
Lets say im having smaller ssd disk for journal/WALL, for the ssd osd disks- would it work? or ease on system for not throwing off my SSD OSD's?
with HDD OSD's i have no issues and the only different beside their type is that they are using wall/journals ssd disks.

Kind Regards,
Oren
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!