Ceph performance - raid configration

Leo David · Feb 16, 2020

Hi guys,
Firstly I would like to thank to all the people that are working on such a product that you just deploy and almost forget about it, without the need to continuously fix and repair, while focusing on the upper services ( vms ) that the platform helps you to roll.
Secondly I am sorry if this subject has been discussed before, and I might be too lazy for searching for the answer.
My question is related to the actual deployment that I am involved in, which involves starting with a set of 4 x Dell PE R640 nodes. They are equipped with Perc H730p mini controller with a 2Gb cache, while the data disks are all 1.9TB sas ssd 12Gb/s ( 4 per node ). It ssems that with this card I can configure a mixed mode of Raid1 for OS disks and passthrough for the rest of the data disks - if the case.
The plan is to use the platform as a very nice HyperConverged setup, and having Ceph to be used as underlying storage. I have some past Ceph experiences starting Jewel , and i've been through a lot of issues regarding disks and controller types when we talk about performance.
Now, being the fact that I am starting with only 4 nodes and all disks are enterprise grade ssd:

1. Would it be better to configure each of the osd's disks as:
- raid0 with cache enabled
- passthrough with cache enabled
- passthrough without cache

2. I have also quoted 2 x write intensive 400GB disks per node to be used as journal disks:
- should I take them out and not consider using separate journal disks being that osds are all sas ssd's ?
- if using journal disks is still recommended, should I use them in a raid1 array for having raid fault tolerance ?

3. Any other recommendations regarding controller / disks / raid / passthrough setup ?
Firstly I have considered Perc h330, but I have gave up on it, because the lack of cache and some other's users bad performance experience.

Thank you so much, and have a very nice weekend !

Cheers,

Leo

Alwin · Feb 17, 2020

Leo David said:
1. Would it be better to configure each of the osd's disks as:
- raid0 with cache enabled
- passthrough with cache enabled
- passthrough without cache

Ceph (or ZFS for that matter) and RAID is a no go.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

Leo David said:
2. I have also quoted 2 x write intensive 400GB disks per node to be used as journal disks:
- should I take them out and not consider using separate journal disks being that osds are all sas ssd's ?
- if using journal disks is still recommended, should I use them in a raid1 array for having raid fault tolerance ?

With Bluestore there is no journal anymore. Bluestore's DB can be located on a separate device. Why use RAID, something that is fault-tolerant at that level already?

Leo David said:
3. Any other recommendations regarding controller / disks / raid / passthrough setup ?
Firstly I have considered Perc h330, but I have gave up on it, because the lack of cache and some other's users bad performance experience.

Use a HBA and save some hassle.

Leo David · Feb 17, 2020

Thank you Alwin,

Ceph (or ZFS for that matter) and RAID is a no go.

- in this case, i assume "passthrough without cache" would be the most appropiate option ?

With Bluestore there is no journal anymore. Bluestore's DB can be located on a separate device. Why RAID something that is fault-tolerant at that level already?

- i mean to keep wal db on the write-intensive disks ( indeed, no journal anymore in this version ).
- so i guess it would make sense to keep the wal db on a raid1 array to prevent loosing of all the osds in case of wal db drive fails ?

Use a HBA and save some hassle.

- you are perfectly right !
- the only reason for not installing a pure hba and relay on perc730p passthrough capablitty is to be sure that we can gave up on ceph ( if needed ) and create local raided storage without the need of a datacenter visit for card replacement / disks rewiring.

Have a nice day,

Leo

Alwin · Feb 17, 2020

Leo David said:
- in this case, i assume "passthrough without cache" would be the most appropiate option ?

HBA, HBA, HBA.

Leo David said:
- i mean to keep wal db on the write-intensive disks ( indeed, no journal anymore in this version ).
- so i guess it would make sense to keep the wal db on a raid1 array to prevent loosing of all the osds in case of wal db drive fails ?

You cut the performance of the DB disk in half. For data that is redundant on cluster level (eg. 3x replicas). If the DB disk fails, then it failed. You replace the disk and re-create all the OSDs. Ceph takes care of the recovery.

Leo David said:
- the only reason for not installing a pure hba and relay on perc730p passthrough capablitty is to be sure that we can gave up on ceph ( if needed ) and create local raided storage without the need of a datacenter visit for card replacement / disks rewiring.

There is ZFS as well.

Search

Search

Ceph performance - raid configration

Leo David

Well-Known Member

Alwin

Proxmox Retired Staff

Leo David

Well-Known Member

Alwin

Proxmox Retired Staff