No such block device - Proxmox 3.2 and Ceph configuration

Dietmar, these definitively are NOT decisions belonging to proxmox development but users. There are many use cases which require other than raw, direct SATA/SAS devices! Ours included.

We are planning to build a sizeable cluster, and if we cannot use partitions it would mean wasting *half* of all drive bays to just the proxmox OS! Not very sensible.
We use dell cloud nodes in 3node/2U configuration. This is 4 disks per node with 2 CPUs, 12 disks total in the chassis. There is no means to mount 2xSSD inside the chassis, and USB sticks are not reliable.
This means we make RAID10 array from 4x80G partitions on the drives, and all of the space we want to use for Ceph is partitions.
We will add discreet storage nodes too, but at least in the start we need to just utilize what we already have.
We were planning to do 30 nodes to start with ... That would be 120x3TB SATA drives.
Wasting half of the drive bays to do RAID1 just for the OS and perhaps journaling makes absolutely no financial sense, every single drive bay & sata/sas port is very very precious.

Now i'm left wondering how much of a PITA it will be to manage without using the pveceph tools for osd creation (what else does pveceph do, what does it save etc.), and what issues it will create. I managed to activate 4 OSDs from partitions using the usual ceph tools but now i'm left on proxmox gui with all kinds of timeouts etc. when trying to manage things ...

Hi, with 30 nodes, anyway, you can't use proxmox because of corosync limitation.
pveceph is great to manage easily small clusters.

for big clusters, I recommend you to install ceph directly from a distro (centos/ubuntu/debian, what you want),
and use ceph-deploy (see ceph.com doc), to create osd and mons.

You can still use proxmox as client to create vm disks, manage snapshots,....
 
I don't known which kind of raid controller you use,

but if you want to use partitions for ceph (It's not recommended by ceph team, and I'm not sure it's working),

with lsi/dell perc raid controller, it's possible to create virtual disks on top of the full raid.

Like this, you'll have virtual /dev/sdX disks, and it'll work 100% with ceph.
 
Hi, with 30 nodes, anyway, you can't use proxmox because of corosync limitation.
pveceph is great to manage easily small clusters.

for big clusters, I recommend you to install ceph directly from a distro (centos/ubuntu/debian, what you want),
and use ceph-deploy (see ceph.com doc), to create osd and mons.

You can still use proxmox as client to create vm disks, manage snapshots,....

Thanks! That was something i did not know and did not see mention in the documention.
 
I don't known which kind of raid controller you use,

but if you want to use partitions for ceph (It's not recommended by ceph team, and I'm not sure it's working),

with lsi/dell perc raid controller, it's possible to create virtual disks on top of the full raid.

Like this, you'll have virtual /dev/sdX disks, and it'll work 100% with ceph.

Now that is not very sensible at all -- to use raid before ceph :(
Kinda ruins the idea of Ceph.

I don't use RAID cards, unless you go with the most costliest options they tend to ruin performance and sometimes even reliability. I've long been suspecting that at least Adaptec makes the cards go slower intentionally over time! - Benchmarks were pitiful for supposedly highest end card adaptec once made.

We use A LOT of servers from different datacenters around the world, and almost everytime there is HW Raid present the disk I/O subsystem performance is quite pitiful... Unless you can go for JBOD + Soft raid ^_^
Ofc, not all cards are like this, if it's a highend brand new adapter they tend to work fast - so i don't understand why the older gen adapters are slower than the on-board chips for consumers of the same time period!

As for ceph using partitions: No issues what-so-ever - works just fine :)
You do realize what Ceph essentially does on osd creation is mkfs and puts metadata in there?
 
I agree with nucode that RAID with Ceph just unnecessary. Unless you are using RAID card with large cache memory which will somewhat eliminate the need to have SSD for journaling. Ceph does excellent work what it is design for. A Big data storage with high level of redundancy. In my opinion is a self sustaining system and not to be mixed with anything such as ZFS, RAID, iSCSI etc etc.
We have several Ceph deployment running for few years with very demanding environment. Yes, there were some issue but every one of them due to "human" error and ignorance. And yes some of the issue caused by trying to mix. We use this deployments for just about ALL storage needs. So in a way we abuse our Ceph systems to the max. :D Knock on wood, we have not have any issues so far.

Spirit made a good point of 30 nodes proxmox+ceph cluster. But if i recall from my research in the past, i think we can push it to 64? Not that it cannot be done, it just somewhat becomes unstable. I think somebody in proxmox community had a large single cluster like that without any issue. All of our Ceph deployments are in proxmox nodes. I personally enjoy checking on all the nodes from one GUI. I totally understand the argument of why run a hypervisor with ceph node when all they are there to be storage nodes. Again, nice to see all active nodes from one cluster GUI. Recently i started using Zabbix to monitor networks. So may be this will change. It will become mandatory to move to OS only Ceph node when we go beyond 50 nodes. Its worth mentioning though, some of the Proxmox+Ceph nodes we have are not fully slimmed down. They have some above avg. hardware so in case we have crisis, we can move some VMs to them temporarily. Nice to have some buffer.

Somewhere in this thread Pveceph was mentioned. What i understand it is a custom script by Proxmox team to provide some ceph functionality which is very helpful to integrate Ceph in Proxmox initially. After Ceph is up and running through pveceph, pveceph is almost not needed. Pveceph is not there to replace all Ceph commands.

If you know your Ceph storage going to be massive with very large number of nodes, it makes sense to put them in their own node with slim down distro such as Ubuntu.
 
I agree with nucode that RAID with Ceph just unnecessary. Unless you are using RAID card with large cache memory which will somewhat eliminate the need to have SSD for journaling. Ceph does excellent work what it is design for. A Big data storage with high level of redundancy. In my opinion is a self sustaining system and not to be mixed with anything such as ZFS, RAID, iSCSI etc etc.
We have several Ceph deployment running for few years with very demanding environment. Yes, there were some issue but every one of them due to "human" error and ignorance. And yes some of the issue caused by trying to mix. We use this deployments for just about ALL storage needs. So in a way we abuse our Ceph systems to the max. :D Knock on wood, we have not have any issues so far.

Spirit made a good point of 30 nodes proxmox+ceph cluster. But if i recall from my research in the past, i think we can push it to 64? Not that it cannot be done, it just somewhat becomes unstable. I think somebody in proxmox community had a large single cluster like that without any issue. All of our Ceph deployments are in proxmox nodes. I personally enjoy checking on all the nodes from one GUI. I totally understand the argument of why run a hypervisor with ceph node when all they are there to be storage nodes. Again, nice to see all active nodes from one cluster GUI. Recently i started using Zabbix to monitor networks. So may be this will change. It will become mandatory to move to OS only Ceph node when we go beyond 50 nodes. Its worth mentioning though, some of the Proxmox+Ceph nodes we have are not fully slimmed down. They have some above avg. hardware so in case we have crisis, we can move some VMs to them temporarily. Nice to have some buffer.

Somewhere in this thread Pveceph was mentioned. What i understand it is a custom script by Proxmox team to provide some ceph functionality which is very helpful to integrate Ceph in Proxmox initially. After Ceph is up and running through pveceph, pveceph is almost not needed. Pveceph is not there to replace all Ceph commands.

If you know your Ceph storage going to be massive with very large number of nodes, it makes sense to put them in their own node with slim down distro such as Ubuntu.

Glad to hear from someone who has used ceph for years! :)
Our target market and business model is such that 30 is nothing, even 64 nodes is nothing. We do pure mass market applications at a limited margin. Sometimes we only have 2-3 users on a node!

Have you had your OSDs 60%+ full? Someone somewhere noted performance issues when OSDs get more than 60% full.

So the way proxmox does clustering is going to limit us to max 64 nodes per cluster?
Probably a couple of proxmox clusters can share the same ceph cluster, have you tried this?
 
Have you had your OSDs 60%+ full? Someone somewhere noted performance issues when OSDs get more than 60% full.
So the way proxmox does clustering is going to limit us to max 64 nodes per cluster?
Probably a couple of proxmox clusters can share the same ceph cluster, have you tried this?

Our expansion of ceph storage takes place when between 60-70% full. So yes i have seen over 60% usage but never 70%. No noticeable issue if there was any in over 60%. Usually we bring it down to 50% or less within 48 hours.
I have not heard any issue unless it is 85%+. At 90% i believe Ceph starts giving health warning. But in a busy environment i dont think we should wait till it hits 85% usage anyway, regardless of issue or not.

It is not Proxmox that limits to 32 or 64 nodes. Corosync is itself has limit of 64 which is hardcoded. Proxmox developer team tests Proxmox in maximum 16 nodes in their lab. But that does not mean we cannot use more than 16. I believe other hypervisors such as Hyper-V also has maximum limit of 64 nodes in a windows server 2012 cluster too with 8000 maximum VM per cluster.
We implement cluster hardware per 42U rack basis. By using 4 nodes in 2U, we can cram 60 Proxmox nodes cluster in a rack which leaves room for switches, PSU, UPS in the same rack.
 
Our expansion of ceph storage takes place when between 60-70% full. So yes i have seen over 60% usage but never 70%. No noticeable issue if there was any in over 60%. Usually we bring it down to 50% or less within 48 hours.
I have not heard any issue unless it is 85%+. At 90% i believe Ceph starts giving health warning. But in a busy environment i dont think we should wait till it hits 85% usage anyway, regardless of issue or not.

It is not Proxmox that limits to 32 or 64 nodes. Corosync is itself has limit of 64 which is hardcoded. Proxmox developer team tests Proxmox in maximum 16 nodes in their lab. But that does not mean we cannot use more than 16. I believe other hypervisors such as Hyper-V also has maximum limit of 64 nodes in a windows server 2012 cluster too with 8000 maximum VM per cluster.
We implement cluster hardware per 42U rack basis. By using 4 nodes in 2U, we can cram 60 Proxmox nodes cluster in a rack which leaves room for switches, PSU, UPS in the same rack.

Nice setup! Sounds to me like you are also using Dell cloud nodes, or Supermicro version?
Dell cloud nodes uses Tyan & Supermicro motherboards at least.

We are still planning on which sizes we will be deploying, first cluster ofc will be slowly built out, but it's quite possible we will initially just build 12 node mini clusters to minimize risks, and that just so happens to be the port number of the 10GbE switch model we will likely be using ;)

Have you had any data corruption issues with Ceph? What we are afraid is that could something potentially destroy the whole cluster into unrecoverable state, hence loosing all customer data (or most) residing in that cluster.
Failures do happen, and on our target market ever more so, our target market is very cost conscious and ultimately we need to get a VM price down to a level of cheap dedicated(!!!). To reach this we use a lot of recycled hardware - but HDDs we always acquire new, but consumer versions (according to backblaze, no significant reliability difference) mostly. Toshiba 3TB (Aka HGST) and Seagate Archive drives are currently our goto models :)

What i'm thinking of is running either 3 or 5 mons, and adding OSDs as we go, but certain server models we use require a hard reboot to add drives, since for power savings those ports are completely disabled if no disk present during boot oO;
We will add 4 disks per compute node, RAID10 for the OS, and partitions to Ceph. Most likely these will be consumer SSDs (1TB size). Then we'll use 12bay 2U Dual Xeon storage nodes for the magnetic drives, which will also house 2xSSD in the chassis. Magnetic drives on jerasure, probably 18+2 or 16+4 setup, and SSDs as cache tier (Replica 1 so half of the space usable). Also OS's will remain 100% on SSDs as linked clones, and data will utilize the SSD cache pool. Magnetic drives will be a mix of everything that comes out of normal production so 2TB, 3TB and 8TB, and initially when we have slots vacant we were thinking of just throwing every single working drive we've got there, so potentially also 250,500,1T models until we swap those for new 3TB or 8TB drives.

Our thinking was to have drives at around 85-90% full, so we don't need to have so many drives floating around. There will be probably 1:10-14 ratio of SSD to HDD in terms of size (raw).
 
Nice setup! Sounds to me like you are also using Dell cloud nodes, or Supermicro version?
Dell cloud nodes uses Tyan & Supermicro motherboards at least.
We use 4 different models, but for Proxmox computing nodes this one is my favorite:
http://www.supermicro.com/products/system/2U/2028/SYS-2028TP-HTR.cfm

For Proxmox+Ceph nodes we use this:
http://www.in-win.com.tw/Server/zh/goods.php?act=view&id=IW-RS212-02


We are still planning on which sizes we will be deploying, first cluster ofc will be slowly built out, but it's quite possible we will initially just build 12 node mini clusters to minimize risks, and that just so happens to be the port number of the 10GbE switch model we will likely be using ;)
For Ceph network backbone we use Infiniband 40gbps. If you are building small with 10gbe 12 port switch, keep in mind that we will need one dedicated port for Ceph cluster sync. To be really efficient you will need 3 switches. 1 for Proxmox cluster, 1 for Ceph public network and 1 for Ceph cluster sync network.

Have you had any data corruption issues with Ceph? What we are afraid is that could something potentially destroy the whole cluster into unrecoverable state, hence loosing all customer data (or most) residing in that cluster.
Yes, we did have some data corruption with one big one being in last year. But, every time it was none of Proxmox+Ce[h faults. Mostly user error. The last big one was caused by none other but Me. :p Cannot remember the details but i tried something i should not have tried in production cluster. We had good backups to fall back on to so the actual data loss was minimal. But the hassle and some downtime was painful.
I think this goes for any system, dont test & trial on live system and know what you are doing before trying something. In my case i did not read on fully so missed a step or two which caused disaster.

Failures do happen, and on our target market ever more so, our target market is very cost conscious and ultimately we need to get a VM price down to a level of cheap dedicated(!!!). To reach this we use a lot of recycled hardware - but HDDs we always acquire new, but consumer versions (according to backblaze, no significant reliability difference) mostly. Toshiba 3TB (Aka HGST) and Seagate Archive drives are currently our goto models :)
I fully understand the need to save initial cost and using recycled hardware. We followed similar path years ago. But i will caution you and so will anybody else with some level of Ceph experience, try not to go too cheap with hard drives. Specially the way you are thinking, i see huge disaster. Here is what i mean:
Although you can mix and match different sizes of HDD in Ceph cluster, you have to maintain some form of balance. For example, Lets say you have 512Gb, 1Tb and 2Tb HDDs. Dont end up few nodes with majority kind. The following scenario is bad idea:
Node 1: 4 x 512Gb, 2 x 1Tb
Node 2: 1 x 512Gb, 1 x 2Tb
Node 3: 3 x 2Tb
..................................
The following scenario is good idea:
Node 1 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
Node 2 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
Node 3 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
...................................

The size of the drives does not matter what matter is equal distribution of the drives. This gives your Ceph a good chance to write evenly. This is also very important when you have drive or node failures. The way your data is distributed directly effects recovery. If some nodes ends up with more drives with bigger size thus more data and that node fails, you are looking at major issue. Even if you have go out and buy certian drives to ensure that all sizes are evenly distributed, it is well worth it.

What i'm thinking of is running either 3 or 5 mons, and adding OSDs as we go, but certain server models we use require a hard reboot to add drives, since for power savings those ports are completely disabled if no disk present during boot oO;
Use caution when rebooting node to add drives. When your Ceph cluster is up and running, everytime there is a node/hdd failure, the cluster will go into rebalancing mode. Even when there are no failures and you have rebooted, the node will still think something has failed. Tell cluster not to rebalance every time before you reboot: Following a line command will just do the trick:
# ceph osd set noout
After you have rebooted simply unset the noout option as this:
#ceph osd unset nout

This way you can reboot without rebalancing. And by any means do not reboot multiple nodes and hdd at the same time. Try to do one node at a time.

We will add 4 disks per compute node, RAID10 for the OS, and partitions to Ceph. Most likely these will be consumer SSDs (1TB size).
Why Raid10 for the OS drive? Are you putting Ceph journals on the same OS SSD?

Our thinking was to have drives at around 85-90% full, so we don't need to have so many drives floating around. There will be probably 1:10-14 ratio of SSD to HDD in terms of size (raw).
I do not know the depth of your Ceph knowledge, so if i am saying what you already know, i apologize. I would not recommend running Ceph cluster above 85% usage continuously. Also did you take into consideration how Ceph uses space with replica and all? For example, lets say your total raw cluster size is 144Tb. This does not mean that you can store that much user data. With replica 3, any data you store will get written total 3 times. So lets say you have stored 15Tb of customer data. Ceph will actually consume about 45Tb of space due to replica 3. With replica 2, it will use 30Tb of space. You got the idea. Some see this as drawback. But i see a very small price to pay considering what Ceph does.
 
Thanks for the input! :)

For Ceph network backbone we use Infiniband 40gbps. If you are building small with 10gbe 12 port switch, keep in mind that we will need one dedicated port for Ceph cluster sync. To be really efficient you will need 3 switches. 1 for Proxmox cluster, 1 for Ceph public network and 1 for Ceph cluster sync network.

Did not know about the cluster sync network! In none of the examples i've seen there's been no mention of such. So does the sync network do rebalancing etc. OSD to OSD traffic? or what does it do? Does it need as much BW?

Ofc i haven't been considering Infiniband 40gbps - that would be probably cheaper than 10GbE - but i've had a terrible experience with Infiniband adapters & Linux drivers along with advertised specs :(
Do you run IPoIB or as plain pure Infiniband?

Our nodes have only one PCI-e slot, mini size to use, and the 10GbE adapters only provide 2x10GbE + 2x1GbE.

Yes, we did have some data corruption with one big one being in last year. But, every time it was none of Proxmox+Ce[h faults. Mostly user error. The last big one was caused by none other but Me. :D Cannot remember the details but i tried something i should not have tried in production cluster. We had good backups to fall back on to so the actual data loss was minimal. But the hassle and some downtime was painful.
I think this goes for any system, dont test & trial on live system and know what you are doing before trying something. In my case i did not read on fully so missed a step or two which caused disaster.
Ouch! Good thing to have backups :)
I don't know if i can in any means cram backups for data on our budget. for OS & Pure SSD backups are on plan, but bulk storage, no budget. We need to get below 3€/TiB/Mo operational expenses (inc. hardware!) for our target segment. It's that tough!

That's why i'm going back and forth on using Ceph or just doing plain ol' style RAID5 per machine. But i'd REALLY REALLY would like to have live migration as option for server maintenance etc.
We have tons of disk space vacant even after over provisioning only about 60-70% gets used, business sense in me says we need to ramp it up to 90% !
Then the scalability and other possibilities, all of this i'm weighting against the risk of loosing a significant portion of our customer's data, since once again for cost we want to use largest possible drive models.
Then again, need to also weigh in, disks in lower usage don't fail as often ;)
We have all disks currently active all the time, averaging 40-50% utilization as per Iostat. Ofc, there is signicant number below 30% and significant number above 60% too.


I fully understand the need to save initial cost and using recycled hardware. We followed similar path years ago. But i will caution you and so will anybody else with some level of Ceph experience, try not to go too cheap with hard drives. Specially the way you are thinking, i see huge disaster. Here is what i mean:
Although you can mix and match different sizes of HDD in Ceph cluster, you have to maintain some form of balance. For example, Lets say you have 512Gb, 1Tb and 2Tb HDDs. Dont end up few nodes with majority kind. The following scenario is bad idea:
Node 1: 4 x 512Gb, 2 x 1Tb
Node 2: 1 x 512Gb, 1 x 2Tb
Node 3: 3 x 2Tb
..................................
The following scenario is good idea:
Node 1 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
Node 2 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
Node 3 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
...................................

Thank you for reminding me! Sometimes i'm just going too fast when building things to think about all the details, might have forgotten to balance them out! :D

Use caution when rebooting node to add drives. When your Ceph cluster is up and running, everytime there is a node/hdd failure, the cluster will go into rebalancing mode. Even when there are no failures and you have rebooted, the node will still think something has failed. Tell cluster not to rebalance every time before you reboot: Following a line command will just do the trick:
# ceph osd set noout
After you have rebooted simply unset the noout option as this:
#ceph osd unset nout

This way you can reboot without rebalancing. And by any means do not reboot multiple nodes and hdd at the same time. Try to do one node at a time.

Which brings me to the question, what if we have total lights out event?
It's rare, but most definitively will happen at some point of time. It's just plain guaranteed to happen no matter what - even if big players sometimes this happens, i'm sure we cannot avoid it!

So if all nodes go down at roughly the same time, or say 1-2 mons and 10-20% of OSDs stay online, what happens?
is all data lost at that time?
can't be - that would be kind of disastrous drawback for ceph, as this kind of situation is pretty much guaranteed to happen eventually.

Why Raid10 for the OS drive? Are you putting Ceph journals on the same OS SSD?
I guess in this case RAID6 might actually be better, besides i do use software raid (don't trust any HW raid adapters, every single one i've ever used has been total and utter garbage!)
Anyways, the idea is redundancy.

I do not know the depth of your Ceph knowledge, so if i am saying what you already know, i apologize. I would not recommend running Ceph cluster above 85% usage continuously. Also did you take into consideration how Ceph uses space with replica and all? For example, lets say your total raw cluster size is 144Tb. This does not mean that you can store that much user data. With replica 3, any data you store will get written total 3 times. So lets say you have stored 15Tb of customer data. Ceph will actually consume about 45Tb of space due to replica 3. With replica 2, it will use 30Tb of space. You got the idea. Some see this as drawback. But i see a very small price to pay considering what Ceph does.
And this right here is why i have not considered Ceph before, needed to wait for erasure code to mature a bit! :D
So erasure coding 18+2 means for 18TB of customer data 20TB is consumed! :D
Then add the SSD cache (cache tier support recently included in Ceph), with Replica 2. I cannot justify to myself going for Replica 3 :(

We have obviously different perspectives and goals here - my most important goal is to get sensible costs with maximal capacity and decent performance.
Tho, with Ceph reliability & redundancy needs to be a major objective, but not the point our business plan becomes unviable.
For example, we need to be able to provision 16GiB Ram + 4TB storage with at least 200IOPS at any given time for less than 20€ a month, far less since that is our benchmark, design goal needs to be somewhere around 13€, preferrably 10€.

Our initial common compute node will be: 48GiB of ram, 12 cores and 4x Drives (whether HDD or SSD, OS + Ceph OSD). Our cost for such system in operational costs is roughly 13-14€ a month.
Tho next models will probably be 96GiB of ram, 8 cores, 3x drives.

May sound impossible, but with smart purchasing and very frugal approach i believe this can be done by leveraging economies of scale on each step of the way.
You wouldn't believe how little i pay for 1 such node as described above, they are almost free in practice compared to the drive, ops & infrastructure expenditure ;)

We use very similar systems as you linked, thanks for linking, that confirms the chassis specs i've been considering are the correct ones! :D Tho ultimately i'd love to use the Backblaze pod for storage chassis, but 48 drives per chassis ... that's a bit too much (rich) for me!
I think our chassis could actually take the SC mobo used in the model you linked: http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRT-P.cfm
Looks exactly the same form factor, and some of our nodes actually use a older gen similar SC mobo :D
Tho we use the 3.5" drive model, since initially we did not do virtualization and needed the lower price & higher capacity of 3.5" drives.
 
Did not know about the cluster sync network! In none of the examples i've seen there's been no mention of such. So does the sync network do rebalancing etc. OSD to OSD traffic? or what does it do? Does it need as much BW?
This is not absolutely must. Your Proxmox+Ceph nodes will still work. But when you have all running on single network interface, you will notice significant performance degradation when cluster is balancing due to bandwidth consumption. Yes, the ceph cluster sync network is to rebalance and OSD to OSD traffic. This leaves public side network free of congestion when large amount of replication taking place in ceph.

From yours words thus far, it is very evident that you are running on extremely low budget while planning to provide best possible service. There are absolutely nothing wrong with that. But in my opinion, there are few things you need to consider and base all your option/choice on that so you are not in very big trouble later on. From your business plan it seems like you will be holding your clients data. As service provider, ensuring data safety comes above anything. If data is lost it is hard to place compare to downtime due to hard failure. A second storage to hold backup is absolute necessary regardless what storage system you use. Even if you salvage 10 year old PC and turn that into a backup node, i think you should still do that.

Ceph is designed for big data. Unlike other storage systems, Ceph gets faster as it grows in number of OSDs. It is easily scalable by simply adding new node or hdd without changing a thing in the entire system. Can take multiple drive failures including nodes. For initial small environment it does add some level of complexity to properly understand ceph and maintain. Only you can weigh the benefit of ceph against your business model.

One other option you can look at is ZFS+Gluster. You can start off with just 2 nodes replicated to each other. By using ZFS underneath you eliminate the need of using any form of hardware RAID. By putting Gluster on top of ZFS you gain node level redundancy. Our backup storage cluster is made of such setup. This gives you option to use .raw,.vmdk,.qcow2 for your VMs even OpenVZ/LXC containers. Ceph RBD can only store .raw unless you use CephFS which can store anything.
The day of plain old hardware RAID is going away i think. I personally Do Not use hardware RAID with anything.

We have tons of disk space vacant even after over provisioning only about 60-70% gets used, business sense in me says we need to ramp it up to 90% !
Then the scalability and other possibilities, all of this i'm weighting against the risk of loosing a significant portion of our customer's data, since once again for cost we want to use largest possible drive models.
Smaller size drive model will give you better option to grow slowly while saving money. Since replacement cost will be lower and you can buy in higher quantity do your number of OSDs goes up. What is your initial data storage requirement i mean how much data are you dealing with when you go live?

Which brings me to the question, what if we have total lights out event?
It's rare, but most definitively will happen at some point of time. It's just plain guaranteed to happen no matter what - even if big players sometimes this happens, i'm sure we cannot avoid it!
By total lights out event i think you mean if quorum is lost. Simply put bad thing happens when quorum is lost. This applies to anything runs on quorum not just ceph. In the cae of Ceph your storage will become inaccessible. Depending on what kind of failure your data may be lost or not. For example, if over half all nodes dies completely but HDD are fine, then simply replace the nodes, put the OSDs in them and the cluster will try to rebalance itself. But if over half of your drives died at the same time then yes you will have massive data loss since there will be nothing left to rebalance from. Lets look at this example, you have a file named myfile.txt. After you copied it to ceph storage it got chunked in 3 pieces and got stored in OSD.1, OSD.9 and OSD.17. When you had your mass drive failure these 3 drives happened to be part of that. So in this case you lost that file completely. This is of course an overly simplified example, but you got the idea.
Think about hardware RAID. If you have more hdd failure than what the RAID was configured for your entire RAID array is lost. Similar logic applies to ceph with just higher redundancy level.

May sound impossible, but with smart purchasing and very frugal approach i believe this can be done by leveraging economies of scale on each step of the way. You wouldn't believe how little i pay for 1 such node as described above, they are almost free in practice compared to the drive, ops & infrastructure expenditure ;)
Anything is possible with proper planning, research and mindset. We started our production level ceph adventure with just 4 nodes and now it is much much larger than that. Just make sure to cover the basics.
 
From yours words thus far, it is very evident that you are running on extremely low budget while planning to provide best possible service. There are absolutely nothing wrong with that. But in my opinion, there are few things you need to consider and base all your option/choice on that so you are not in very big trouble later on. From your business plan it seems like you will be holding your clients data. As service provider, ensuring data safety comes above anything. If data is lost it is hard to place compare to downtime due to hard failure. A second storage to hold backup is absolute necessary regardless what storage system you use. Even if you salvage 10 year old PC and turn that into a backup node, i think you should still do that.

You are very right - very limited budget. But most get this wrong - i have no problem using money, as long as the Per Capacity/Performance/Node or whatever ratio is RIGHT.
It's what our target segment is - 99.9% of our customers are private persons, storing whatever not so essential data. for 95% of our customers, data loss is no big deal.
We've had many data failures, mainly because we used to run solely on RAID0 on leased hardware - hundreds of users have lost data, yet we've so few complaints about that i could probably count them with my fingers.
Not a single time was there data the customer could not replicate however - as far as we know at least.
Our customers are willing to pay about NADA for data redundancy.

THAT being said, we are going for slightly different target market with new services, still mainly private persons, and we will move the obligation for primary backups to the users, we will outright tell them: OS is backed up, Bulk data is not.
But you can also buy backup space from us if you want to :)

This is funny market in the sense that data safety does not come priority #1, cost does! 40-50% of our customers are pretty happy getting 1/4th the service if they save 30%.
That being said, almost all of our own hardware runs on RAID5/RAID10, and all leased nodes with either at least 4 disks or running premium services run as well on RAID5.

Backup, the biggest problem probably is bandwidth & electricity, not the initial buy-in, since these don't require performance we can just use 8TB drives in RAID5.

One other option you can look at is ZFS+Gluster. You can start off with just 2 nodes replicated to each other. By using ZFS underneath you eliminate the need of using any form of hardware RAID. By putting Gluster on top of ZFS you gain node level redundancy. Our backup storage cluster is made of such setup. This gives you option to use .raw,.vmdk,.qcow2 for your VMs even OpenVZ/LXC containers. Ceph RBD can only store .raw unless you use CephFS which can store anything.
The day of plain old hardware RAID is going away i think. I personally Do Not use hardware RAID with anything.
this is gonna siderail a bit, but anyways, that's probably the only case where ZFS works, one or few users max, tons of sequential data. ZFS nuked our customer data due to corruption, 10-13 disks worth of it, not once, not twice but THREE times in the span of just couple months.

Further, the primary goal of the project was reliability: They fail it miserably by not dropping into read only when too many disks fail, it will keep happily "writing to all disks" *facepalm*, this is another way it nuked data. Disk failure was diagnosed as failing SATA cables -> We learned a hard lesson, go for the CHEAPEST looking ones, those are going to work in whatever case. Go for the PREMIUM sata cables? They absolutely never ever work, just cut them up and them away!

THE PROBLEM was due to these cables connection to hdd was intermittenly lost. We sent probably 30 drives for warranty replacement due to the FPDMA errors this causes on Smart data (yea, seagate still replaced them!) before realizing the SATA cables were the issue.
This resulted occasionally for loss of 4 drives because when drive comes back, it's not taken automaticly back into use, so sometimes before i got around to replace data was already nuked.

Then we get to performance ... Unless it's 1 to few users, sure it's amazing SEQUENTIAL. Random IOPS: Don't dream of it! This is a total design failure of ZFS, they will admit this is the case. Case happens because ZFS has been designed to activate ALL drives for ALL I/Os, this limits you on a 12 disk array to something like 2 disks worth of IOPS.

THE GOOD ZFS L2ARC is truly amazing! Only downside is the extremely long warmup period (weeks and still not warmed up!). The only SSD caching i've actually seen work well, others not worth the money (in our case).

</siderail rant>

Do you really use Replica 3 on your main ceph cluster, and backups are replica 2 in practice? Wow, talk about waste! (well, sortof)
Shows very clearly how different our target market is :) I could never justify price hike of 3 fold, let alone 6 fold to our customers.

My plan is to backup the OS & Other solely SSD data only, but bulk data we just can't do unless customer pays for it directly :( There's just no room in the budget to duplicate data requirements.

Smaller size drive model will give you better option to grow slowly while saving money. Since replacement cost will be lower and you can buy in higher quantity do your number of OSDs goes up. What is your initial data storage requirement i mean how much data are you dealing with when you go live?

Right now my testing Proxmox cluster is a total of 44TB RAW, on 4 disk RAID5 arrays. ~30TiB usable space and 144GB of ram, with 36 cores :)
Optimally i'd like to have our first cluster around 70-80TB usable but can do with far less. Probably a total of 24 disks on 6 nodes.
This would grow out to be probably around 78 disk array. with 12x SSDs of 1TB or 2TB size for Replica 2 Cache Tier + OS images. Remainder 66 disks split probably 60x 3TB and 6x 8TB on jerasure 18+2, so about 205TB usable.
I might even be able to justify Replica 3 for the OS images, since these are going to be standard 20G :)

By total lights out event i think you mean if quorum is lost. Simply put bad thing happens when quorum is lost. This applies to anything runs on quorum not just ceph. In the cae of Ceph your storage will become inaccessible. Depending on what kind of failure your data may be lost or not. For example, if over half all nodes dies completely but HDD are fine, then simply replace the nodes, put the OSDs in them and the cluster will try to rebalance itself. But if over half of your drives died at the same time then yes you will have massive data loss since there will be nothing left to rebalance from. Lets look at this example, you have a file named myfile.txt. After you copied it to ceph storage it got chunked in 3 pieces and got stored in OSD.1, OSD.9 and OSD.17. When you had your mass drive failure these 3 drives happened to be part of that. So in this case you lost that file completely. This is of course an overly simplified example, but you got the idea.
Think about hardware RAID. If you have more hdd failure than what the RAID was configured for your entire RAID array is lost. Similar logic applies to ceph with just higher redundancy level.


Sorry, bad phrasing of my question probably.
So if the cluster goes all down at once or within a few minutes of each other - when nodes are rebooted, is Ceph able to recover from the situation, ie. same as RAID Check - just checking data sanity.
I'm wondering if it's better to have big UPS for small portion of the cluster, or none at all really. <-- Yes, we can't really budget much for UPS neither: Our customers are not ready to pay for that neither.

Anything is possible with proper planning, research and mindset. We started our production level ceph adventure with just 4 nodes and now it is much much larger than that. Just make sure to cover the basics.

Now you made me want to go back and reboot the storage we have live as Ceph cluster :/
I was seriously considering that, but since we only have 1Gbps internal switches right now, thought it was not practical. Tho, those switches also have CX4 ports of IPoIB connectivity ... But none of the Infiniband IPoIB capable cards i have laying around will fit into the chassis :(
So my plan was to run some ceph tests from VMs running - but i really also need to take this minicluster into production swiftly. We have an insatiable hunger for moarrr capacity and i could turn these nodes into profit makers in a week ;)

May i ask you what kind of Infiniband switches + adapters are you using? I will plot those as one of our choices as well! :)
IPoIB would give about 34.4Gbps of usable bandwidth and that would be really nice :)

 
I am curious. Your ZFS experience is with ZFS on Linux only and consumer grade hardware?

I did use it for a short while under *BSD too. The aforementioned configs were consumer grade motherboard + PSU, Kingston ECC Ram (and lots of it), Opteron CPU, Rackmount chassis and i think i used Adaptec adapters for the SATA connections.
It was stable after swapping for "cheapo" SATA cables.
 
Well, then I am not surprised given the fact that ZFS was first considered stable and ready for production with the release of 0.6.3 earlier this year on Linux. On FreeBSD it has been considered stable for a number of years which is proven by the number of installations running FreeNAS for several years without a hitch. You say you used Adaptec adapters, were they RAID controllers or HBA's? If they were RAID controllers your setup was a disaster waiting to happen and a bad design judgment from your part. ZFS means HBA, no more no less. I personally has been running ZFS on Omnios (Solaris derivate) for years and never have seen any problems what so ever since I choose hardware according to recommendation - Solaris is known to be picky about hardware.

So before ruling out ZFS entirely try a setup based on Omnios.

I am planning on a new setup based on this hardware for home use:
MB: https://www.sona.de/.1872164413-ASRock-Mainboard-E3C224-Sockel-1150
HBA: My existing proven LSI SAS1068
CPU: Intel I3-4170T
RAM: 16GB ECC
Proxmox connection: Infiniband SDR

For your usecase I would suggest this (Add 2 in a cluster using GlusterFS over ZFS as suggested by Wasim):
MB: http://www.supermicro.nl/products/motherboard/Xeon/C220/X10SLH-F.cfm
HBA: http://www.newegg.com/Product/Product.aspx?Item=N82E16816118142
CPU: Intel Xeon E3-1200 v3
RAM: 32GB ECC
Proxmox connection: Infiniband QDR (Mellanox QDR ConnectX2)

HBA for storage pool (16 x 4 TB HGST or WB Red) on-board SATA for OS and 2 x SSD (RAID1) for log and 2 or more SSD for cache. SSD could be Intel DC S3500 (cheaper) or Intel DC S3700 (more expensive)
 
Last edited:
Sorry to hear about your experience with ZFS. But i think you are underestimating ZFS a bit. Sounds like you lost HDDs due to bad cables which may have made more than permissible HDD failure causing array lost. Same thing would have happened with physical RAID. ZFS is extremely resilient to consider it as Enterprise grade mission critical storage system. Before we moved to Ceph we used to use ZFS for long time. Never had any issue.

Yes our target market is completely different than yours. For us data safety and redundancy comes above anything. Given the size of our cluster, nature of our customer data and need to keep historical data , replica 3 is very much acceptable. We also have 3rd ZFS+Gluster setup for data cold storage which is completely offsite. As you can tell from my signature we have a cloud business, anything we use goes through months of tests before we put it in production.
There are several experts in ZFS in this forum who can give you even greater details on ZFS mechanics. Mir is one of them who i know.

If data safety doesnt matter at all, then i think i should go with gluster or ZFS+gluster. Very low initial cost and it just works. You already have experience with ZFS, so you already know.


Yes, if the ceph cluster goes down all at once or within few minutes of each nodes and they are rebooted , ceph is able to do its own check and bring cluster back to healthy status.

About the UPS, the way your customer are you are saying, you can get away without any protection at all including UPS. If up time not important, just let all nodes shutdown. Of course you will not be able to gracefully shutdown your server which could be bad. You can also modify a cheap UPS and add some batteries to it to give you just enough time to shut down everything properly.

With IPoIB you will never get full bandwidth. With enough tweaking you can push close to 20gbps. It is mainly because IPoIB overhead. But thats 20gbps at much less cost than 10gbps ethernet.
We use 36 port Mellanox IB switches and dual port Mellanox ConnectX-3 cards.
 
I think this is to reason to your bad performance. X3 cards is not very well supported in Omnios/Solaris. X2 cards should work out of the box given almost theoretical bandwidth (http://comments.gmane.org/gmane.os.omnios.general/1730) The X2 driver 'Tavor' works flawlessly;-)
I am not using Omnios/Solaris. My configuration is all Proxmox+Ceph, Proxmox+ZFS+Gluster and Ubuntu+Ceph.
Even during our initial testing we never achieved more than 19Gbps on these 40gbps QDR cards. But we went with it anyway since it was still good speed over 10gb ethernet and with future hope that we will come across updated driver to achieve close to that speed. Some says with IPoIB we will never get that kind of speed unless we use something like RDMA? I think there is a new something in prototype Ethernet over IB (EoIB) which suppose to eliminate the drawback of IPoIB?
 
Well, then I am not surprised given the fact that ZFS was first considered stable and ready for production with the release of 0.6.3 earlier this year on Linux. On FreeBSD it has been considered stable for a number of years which is proven by the number of installations running FreeNAS for several years without a hitch. You say you used Adaptec adapters, were they RAID controllers or HBA's? If they were RAID controllers your setup was a disaster waiting to happen and a bad design judgment from your part. ZFS means HBA, no more no less. I personally has been running ZFS on Omnios (Solaris derivate) for years and never have seen any problems what so ever since I choose hardware according to recommendation - Solaris is known to be picky about hardware.

So before ruling out ZFS entirely try a setup based on Omnios.

I am planning on a new setup based on this hardware for home use:
MB: https://www.sona.de/.1872164413-ASRock-Mainboard-E3C224-Sockel-1150
HBA: My existing proven LSI SAS1068
CPU: Intel I3-4170T
RAM: 16GB ECC
Proxmox connection: Infiniband SDR

For your usecase I would suggest this (Add 2 in a cluster using GlusterFS over ZFS as suggested by Wasim):
MB: http://www.supermicro.nl/products/motherboard/Xeon/C220/X10SLH-F.cfm
HBA: http://www.newegg.com/Product/Product.aspx?Item=N82E16816118142
CPU: Intel Xeon E3-1200 v3
RAM: 32GB ECC
Proxmox connection: Infiniband QDR (Mellanox QDR ConnectX2)

HBA for storage pool (16 x 4 TB HGST or WB Red) on-board SATA for OS and 2 x SSD (RAID1) for log and 2 or more SSD for cache. SSD could be Intel DC S3500 (cheaper) or Intel DC S3700 (more expensive)


Those Adaptec adapters were capable of running HW Raid or JBOD, naturally i used JBOD.
Actually, what i read up on back then claimed it's stable and ready back when i tried it. i ran FreeNAS first on the same hardware btw.

YES - The hardware was faulty (SATA cables), but common sense says that it should drop to read only until admin can interfere, instead of corrupting the entire FS. You know, sometimes HDDs are actually recoverable from a failure ;)
Hell, i've even seen brand name servers loose occasionally connection to drives, all you need to do is get caddy out and push it back in and fixed.

Then we get to the performance aspect of things, ZFS is just pure & plain defective by design when it comes to Random IO. Random IO at the end of the day is what matters unless it's sequential only, very few concurrent IOs.
I put the *SAME* hardware back as 2x*SOFTWARE* RAID5 arrays, and performance increased at least 6 fold. Certainly, there was no more those insane 2GB/s peaks as there was no more SSD caching, but i had stable 1440 IOPS all day & night vs. the ~200-240 peaks with ZFS setup, but usually it was more like 140-160 range.

Oh your suggested HBA price alone exceeds what i pay for barebones 12bay hotswap server chassis with mobo + dual xeons + caddies + rail kit + hba, but exc. ram.

The 4TB HGST and WD Red has bad Price to Capacity ratios vs. 3TB Toshiba or re-branded Verbatim. They are identical to HGST since they bought out HGST, just different labels & reports different model :)
WD Red especially with also worse performance vs. the HGST/Toshiba/Verbatim
HGST is the most reliable, so Toshiba badged ones i've been using.

Yes, i admit i didn't research properly upfront. I went and believed the damn hype of it being the Holy Grail of local storage, insanely reliable and faster than anything. Each of which points turned out to be false claims unless you have very specific and narrow use case :)
 
Your performance figures indicates that something else must have been wrong too. My 4 disk RAID10 with SSD log and cache easily gives 2800 random read IOPS and 900 random write IOPS.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!