Proxmox - CEPH and RAID, LVM or Qcow2?

TJ101

New Member
Mar 24, 2014
9
0
1
Dear Proxmox Community,


I have been looking at the advantages of using CEPH running on three Proxmox Nodes.


In the example configuration found here:-


http://pve.proxmox.com/wiki/Ceph_Server


The OSDs are four individual SATA disks on each Node, with three Proxmox Nodes in total.


I have read in numerous places that CEPH spells the end for RAID.


I however think that combing the two technologies will give the best performance and security.


For example the seek/read/write speed of each SATA disk used for an OSD can be a possible snagging point but the overall storage capacity is a plus when compared to SSD's.


In the above example, two SSD's per Proxmox Node are used in the environment, one for the Proxmox VE and the other for CEPH journals.


I am wondering if there is a have your cake and eat it scenario and have been thinking about combining hardware RAID such as the LSI 9271-8CC (CacheCade) with CEPH.


This LSI raid controller comes with 8 ports, two of which can be used for a RAID 1 (Stripped) SSD CacheCade which acts as a front-side flash cache for larger SATA arrays or individual disks.


The advantage of this approach is that if you can use two high end CacheCade approved SSD's and build a disk cache up to 512 GB with a read/ write speed of around 1,000 MB/s. This card also comes with an on-board cache of 1 GB and a must have option of a battery back-up.


In all it greatly improves the performance of standalone SATA disks and 4K Block Random read/ write speeds.



A comparison can be found here:- http://www.boston.co.uk/technical/2011/11/lsi-cachecade-2-evaluation-the-benefits-of-caching.aspx


My first question is this :)



Do I take advantage of the LSI RAID configuration above and build two array's per Proxmox Node?



One array would be used for the Proxmox VE and the other for a CEPH OSD.


If I were to use Hardware RAID 6 or 50 this would further increase performance for read/write applications not in the CacheCade and would provide hardware level redundancy on top of the CEPH redundancy.


I am thinking about the following RAID configurations on the basis of performance and security with the following characteristics.


RAID 6 with 6 SATA disks – Two disk failure anywhere in the array with a 4x read speed gain.


RAID 50 with 6 SATA disks - Two disk failure, one disk failure per sub array with a 4x read speed and 2x write speed gain.


or do I individually configure each of the 6 SATA disks on the LSI RAID as individual OSDs taking advantage of CacheCade and on-board RAM cache?




My Second question :)




Does CEPH support the storage of file disk images such as qcow2 and vmdk?



One of the problems of using other distributed replication technologies such as DRBD is the need for LVM in primary/primary mode, which only seems to support the storage of logical volumes.



Although the performance is enhanced a lot of space is wasted due to unused but allocated disk space in the volume.



Virtual disk images such as qcow2 and vmdk support sparse disk space, which means that they only consume the physical disk space that they actually need.




This has two advantages as far as I can see:-



1) Improved storage capacity which means more virtual machines.



2) Backup of file virtual disk images appears to be a lot faster versus LV based storage. I suspect this is due to vzdump processing the entire volume including unused space.



The disadvantages appear to be as follows:-



1) An interface is needed to read file virtual disk images which can not be mounted as raw volumes.



2) Performance is degraded over raw volumes.



3) Corruption is hard to fix but negated to some extent by carrying out regular backups.



I would be most interested to hear anyone's input's on my questions and my apologies if I have got anything wrong. Please correct me :)
 
"If I were to use Hardware RAID 6 or 50 this would further increase performance for read/write applications not in the CacheCade and would provide hardware level redundancy on top of the CEPH redundancy."

For maximum performance and IO you should consider RAID 10. In my experience RAID 10 is a requirement for virtual RDBMS.
 
Following is "My Opinion" based on "My Experience" only.

CEPH RBD Storage only supports RAW. CEPH FS Storage supports all formats including qcow2, vmdk. Same CEPH hardware platform/cluster supports both RBD and CephFS systems. Proxmox 3.2 do not have any option to configure CephFS through GUI at this moment, so all setup and monitor has to be done through CLI. It is very very simple to setyp a CephFS through CLI. It is said that CephFS is not yet industry standard and thus not stable enough. I personally did not have any issue "yet" with CephFS. But i use RBD for all VM storage and CephFS for all ISO/Template and first stage backup storage. One of my CEPH cluster( has 48TB of storage space and in production with both RBD and CephFS setup. On same hardware i have two CEPH clusters for SSD and HDD based OSDs. SSD OSDs for primary VM OS virtual disks and HDD OSDs for other VM virtual disks.

This leads to the first question of RAID. In all of my CEPH/Proxmox Clusters, i do not have a single hardware/software RAID. CEPH performance increases as number of OSDs goes up. When dealing with more than 8 OSDs per node, it is best not to use SSD for journaling. If for whatever reason the SSD goes sideways, it will take down all OSDs with it that uses the same SSD journal. Co locating journal on same OSD is best choice for high capacity CEPH cluster in my opinion. That way if an OSD goes down only the journal for that OSD is lost.
For sure a physical RAID adds layers of redundancy but it also adds that additional layer without adding much in the value. Because the cost it will take to setup the RAID to handle OSDs does not exceed the benefit you receive. It does makes sense though to use physical RAID for the actual node Operating System SSDs or HDDs. A simple mirror RAID is more than good enough for OS disks. CEPH allows us to think RAID at Cluster Level than SSD/HDD level. In CEPH environment when the entire node goes down along with all the OSDs in it, the storage still keeps running while we replace the node or do whatever necessary same as a RAID in a node allows the node to keep going when a HDD or two goes down.

Again, this is based on my experience from running multiple production busy clusters where users are quite happy with speed and we have not noticed massive performance issues, including multiple database servers.
 
From what I have been able to read it seems the same rule of thumb applies to CEPH and ZFS which says to never use RAID of any kind, neither software nor hardware RAID, since ZFS and CEPH is much better to do this, both ZFS and CEPH has build-in sophiticated storage algorithms which only works optimal given direct access to the storage, and also require direct access to the storage to be able to maximize performance.

In my opinion CEPH and ZFS can better be compared to a SAN storage than to a simple disk based storage. You would never add RAID between the SAN OS and the storage, would you?