Help me choose storage for cluster

E

eg1l

Guest
Hi!

I'm currently moving from XenCenter to proxmox, and I have set up proxmox in my lab to test it before making run in production.
There are some questions I need help with.

What kind of storage solution would generally work best with proxmox?
My experience is that NFS works quite good as it enables snapshots and live backups.

However, I have used openfiler as OS for storage, but I don't know if this is as good as it can be.
Should I try other storage OS's?

My setup will be:
5 - 6 hosts with a 10TB+ storage server (redundant if I can find a decent solution).

Any help would be appreciated! :)
 
Are you aware of the fact that development of openfiler has discontinued and has been for almost 2 years?

Before we can advice you of choice of storage you might want to disclose some more about your experience. Unix/Linux, distributions, command line allergic etc.
 
As a matter of fact, I was not aware that development of openfiles has been discontinued. Guess I better try something else.
I work with Unix/Linux systems all day, and usually just on the command line, so I am pretty comfortable using cli based stuff.
I have experience with Debian based distros, RHEL based, and some BSD.

I want a storage that is Unix/Linux based, and I am open to any suggestions.
 
Ok, then I would recommend the following:

If you only want to use NFS:
FreeNAS using a ZFS pool. If FreeNAS does not support your hardware try NAS4Free (FreeNAS is based on FreeBSD 8.3 while NAS4Free is based on FreeBSD 9.1)

If you want to use iSCSI as well:
Omnios with napp-it as web GUI. Again using a ZFS pool.

Tips and tricks for ZFS:
If speed and IOPS is important use pools with mirrored vdevs
If size of storage is important use RAID-Z.
Hardware RAID is not recommended but if your storage contains a hardware RAID put it i JBOD mode - ZFS wants full control over your disks.
Use as much RAM as your budget can provide - ZFS _LOVES_ RAM.
A ZFS pool consistent of many small vdevs is better than a pool consistent of one big vdev.
 
unless you are being forced by company policy to buy overpriced licensing-plagued "certified" stuff (i.e. SAN), theres really only 2 options to consider afaik:

  1. NAS: buy regular storage servers and export their storage via NFS and/or iscsi. depending on your preferences you can go for one of the following OS solutions:
    • Oracle Solaris (most recent ZFS versions)
    • OpenSolaris based NAS solutions (FreeNAS, Nexenta, whatHaveYou), older ZFS version
    • *BSD based ZFS, older ZFS version
    • ZFSonLinux (exactly what the name suggests), also older ZFS version (Dont know if this is considered production stable!)
    • Any Linux with XFS (replace XFS with BTRFS once thats stable)
    All non-OracleSolaris ZFS solutions only have access to a previous ZFS code dump and therefor cannot receive zpool updates unless Oracle decides to release newer zpool version code

    Also for safety reasons you would probably want to mirror your NAS to a second failover NAS via DRBD or such
  2. ceph rbd: works with pretty much every kind of hardware you can possibly imagine. Support for ceph rbd is currently being implemented and tested in pvetest
    If there was a single website that told you in a pinch what ceph is, I would link you to it, but alas Ceph was made by an academic and as such, it lacks good presentation

    Its basically a fairly new and unique way of handling storage where you cluster as many hard disks in as many servers as you want into one huge storage cluster (Im deliberately trying not to use the word "cloud" here to not trigger any bull shit bingo alarms) and therefor eliminate ALL SPOFs (single point of failure), that every other storage solution has. This cluster will then ensure your data is stored securely by distributung it across multiple racks, multiple sites, multiple server rooms (to even be able to mitigate a burning server room). For this you would typically buy regular NAS-type storage servers too.

Depending on the amount of time you got to present your new platform, you may wish to evaluate proxmox first and use a storage system you can easily wrap your head around (NAS-type) and maybe after that evaluate ceph individually... or vice versa, depending on preferences
 
there is a lengthy post of mine in moderation queue that basically lists ZFS and ceph as options (I think you reliably trigger the moderation queue by using swear words :S)

What I wanted to add to my post and basically also mir's: ZFS has one big problem: fragmentation. ZFS does not offer anything to deal with fragmentation (even in most recent Oracle Solaris versions), which means that once your ZFS NAS deteriorates in performance due to fragmentation you are forced to copy the whole NAS over to a different / maybe bigger sized NAS box and resolve fragmentation that way
 
as long as the used space of the storage array is below 80%.
thats exactly it tho. its something to be weary of - you may run into heavy CoW-related fragmentation later on (btw: CoW is sadly not optional in ZFS), forcing you to upgrade
 
Ok, then I would recommend the following:

If you only want to use NFS:
FreeNAS using a ZFS pool. If FreeNAS does not support your hardware try NAS4Free (FreeNAS is based on FreeBSD 8.3 while NAS4Free is based on FreeBSD 9.1)

I thought of using FreeNAS, but isn't that supposed to run of a usb stick, or is this not correct?

If you want to use iSCSI as well:
Omnios with napp-it as web GUI. Again using a ZFS pool.

Would I need iSCSI, what are the pros?

Tips and tricks for ZFS:
If speed and IOPS is important use pools with mirrored vdevs
If size of storage is important use RAID-Z.
Hardware RAID is not recommended but if your storage contains a hardware RAID put it i JBOD mode - ZFS wants full control over your disks.
Use as much RAM as your budget can provide - ZFS _LOVES_ RAM.
A ZFS pool consistent of many small vdevs is better than a pool consistent of one big vdev.

What kind of setup would you recommend for 10 2TB disks? Speed is a factor.
Would two RAID-Z pools be better than one RAID-Z2?

And, what would you recommend if I wanted to have a second hot-spare mirrored storage?
 
I thought of using FreeNAS, but isn't that supposed to run of a usb stick, or is this not correct?
It is designed for this but it does not need to be that way.

Would I need iSCSI, what are the pros?
iSCSI typically provides 10-20% increased speed at the cost of complexity.

What kind of setup would you recommend for 10 2TB disks? Speed is a factor.
Would two RAID-Z pools be better than one RAID-Z2?
I would create a pool of 5 mirrored vdevs. Each vdev consists of 2 disks. read speed will be close to the combined speed of 5 disks. For writes the speed will always be the speed of the slowest disk in the array. Read more here: http://constantin.glez.de/blog/2010/06/closer-look-zfs-vdevs-and-performance

And, what would you recommend if I wanted to have a second hot-spare mirrored storage?
If you build a pool consisting of 5 mirrored vdevs I think you have enough spares for your storage. In best case your storage can survive loosing 5 disks provided the disks are evenly distributed over your vdevs.
 
Last edited:
unless you are being forced by company policy to buy overpriced licensing-plagued "certified" stuff (i.e. SAN)


Luckily, no :)


unless you are being forced by company policy to buy overpriced licensing-plagued "certified" stuff (i.e. SAN), theres really only 2 options to consider afaik:
NAS: buy regular storage servers and export their storage via NFS and/or iscsi. depending on your preferences you can go for one of the following OS solutions:
Oracle Solaris (most recent ZFS versions)
OpenSolaris based NAS solutions (FreeNAS, Nexenta, whatHaveYou), older ZFS version
*BSD based ZFS, older ZFS version
ZFSonLinux (exactly what the name suggests), also older ZFS version (Dont know if this is considered production stable!)
Any Linux with XFS (replace XFS with BTRFS once thats stable)
All non-OracleSolaris ZFS solutions only have access to a previous ZFS code dump and therefor cannot receive zpool updates unless Oracle decides to release newer zpool version code


I am leaning against eighter FreeNAS of Nas4Free.


Also for safety reasons you would probably want to mirror your NAS to a second failover NAS via DRBD or such
Yes! This is exactly what I want, but does the FreeNas or Nas4Free support that? And if one NAS dies, would the other be able to kick in instantly?


Depending on the amount of time you got to present your new platform, you may wish to evaluate proxmox first and use a storage system you can easily wrap your head around (NAS-type) and maybe after that evaluate ceph individually... or vice versa, depending on preferences


I think proxmox is quite awesome, and I'm planning switch from XenCenter. It's the storage part that I'm insecure about.


FreeNas on USB:
It is designed for this but it does not need to be that way.


This scares me. Is it common practice to run FreeNas off a usb stick?


If you build a pool consisting of 5 mirrored vdevs I think you have enough spares for your storage. In best case your storage can survive loosing 5 disks provided the disks are evenly distributed over your vdevs.


So this would translate to 5 "raid 1" sets, combinded into a "raid 5"?
Would that mean that I could loose one disk per set (5 disks), then another one from one of the sets, and the pool would still be running? (Degraded raid 5?).


How well does ZFS tackle IO?
 
This scares me. Is it common practice to run FreeNas off a usb stick?
Yes. What should be wrong with that? Imagine your storage goes down, how quickly will you be able to get in running again provided you had a backup usb stick?
So this would translate to 5 "raid 1" sets, combinded into a "raid 5"?
Would that mean that I could loose one disk per set (5 disks), then another one from one of the sets, and the pool would still be running? (Degraded raid 5?).
No it can be translated into a stripe array combined of 5 raid 1's. awesome for speed;)
How well does ZFS tackle IO?
Very well. Below are some test results (fio) from a zfs mirror consisting of two SATA 3 disks from a storage containing 8 GB RAM

Sync diabled (Battery backed-up controller)
read : io=3527.2MB, bw=19995KB/s, iops=1249 , runt=600177msec
write: io=3888.1MB, bw=6635.2KB/s, iops=414 , runt=600177msec

Sync standard (client controls sync)
read : io=2843.5MB, bw=18833KB/s, iops=1177 , runt=600016msec
write: io=3666.3MB, bw=6256.1KB/s, iops=391 , runt=600016msec
 
I am leaning against eighter FreeNAS or Nas4Free.
you probably meant "towards" ;)
Also for safety reasons you would probably want to mirror your NAS to a second failover NAS via DRBD or such
Yes! This is exactly what I want, but does the FreeNas or Nas4Free support that? And if one NAS dies, would the other be able to kick in instantly?
Well. apparantly this wouldnt be as easy to do as I figured it would be, since:

DRBD would be perfect to replicate your NAS contents onto a backup NAS and then you only have to set up IP failover (googling that will result in multiple guides, depending on which linux distribution you want). However, DRBD is not available on solaris; and ZFSonLinux is not stable so uhm... yea we're out of luck there.

I suppose you could set up an rsync "cronjob" (however theyd be called on solaris) to sync your storage to a backup server, but thats really only a makeshift solution.

One thing I found tho is this: https://code.google.com/p/zxfer/ and since it refers to zfs-replicate, maybe that one too. Not sure how youd do IP failover in a solaris environment tho.

Id certainly be interrested if theres a good, clean way to have a 2-node active/passive NAS cluster with solaris+ZFS.

If not... theres always ceph ;)
 
Yes, towards :)

Nas4Free has something called HAST, which seems like just the thing for me.
I think FreeNas has support for this also.
 
In ZFS fragmentation is not a problem, it's a feature!

ZFS is copy on write so fragmentation is part of the game and advanced algorithms ensures no performance hit as long as the used space of the storage array is below 80%.
http://www.datacenteracceleration.com/author.asp?section_id=2433&doc_id=253175

Using zfs with nexenta san, I can tell 70% is the limit for me, after that performance are degrading around 10x slower. (zfs take too much time to find free block where to write things).
And also, use separate disks for zil, to avoid too much fragmentation.
 
I have now had some experience with Nas4Free and storage clustering.
It is possible to user something called CARP for IP failover, HAST for storage replication, and some scripting for automatic iSCSI target failover.

However, I am now looking at DRBD.
I could then use the hardware RAID controller on each server, and make DRBD take care of data replication.

Still, there are a lot of questions in my head around. If I set up IP failover, would I be able to have a seamless failover with NFS or iSCSI?
I've heard of peacemaker and VRRP, but have not tested it myself.

EDIT: I wanted to add that I also have tried ceph.
I set up ceph according to proxmox wiki, two servers (Ubuntu 12.04) with one disk each.
They are using one gigabit NIC each, connected to the same switch.

Sadly, ceph is giving me horrible performance:

Code:
2013-03-13 19:32:29.531292 mon.0 [INF] pgmap v1863: 576 pgs: 576 active+clean; 3855 MB data, 17976 MB used, 1844 GB / 1862 GB avail; 5836KB/s wr, 11op/s
2013-03-13 19:32:34.527722 mon.0 [INF] pgmap v1864: 576 pgs: 576 active+clean; 3882 MB data, 18028 MB used, 1844 GB / 1862 GB avail; 5715KB/s wr, 11op/s
2013-03-13 19:32:39.463129 mon.0 [INF] pgmap v1865: 576 pgs: 576 active+clean; 3912 MB data, 18075 MB used, 1844 GB / 1862 GB avail; 5958KB/s wr, 11op/s
2013-03-13 19:32:44.528313 mon.0 [INF] pgmap v1866: 576 pgs: 576 active+clean; 3940 MB data, 18155 MB used, 1844 GB / 1862 GB avail; 5734KB/s wr, 11op/s
2013-03-13 19:32:49.546282 mon.0 [INF] pgmap v1867: 576 pgs: 576 active+clean; 3969 MB data, 18230 MB used, 1844 GB / 1862 GB avail; 6041KB/s wr, 11op/s
2013-03-13 19:32:54.542307 mon.0 [INF] pgmap v1868: 576 pgs: 576 active+clean; 3999 MB data, 18289 MB used, 1844 GB / 1862 GB avail; 6041KB/s wr, 11op/s
2013-03-13 19:32:59.563027 mon.0 [INF] pgmap v1869: 576 pgs: 576 active+clean; 4029 MB data, 18319 MB used, 1844 GB / 1862 GB avail; 6160KB/s wr, 12op/s

On the guest (running Ubuntu 12.04):

Code:
536870912 bytes (537 MB) copied, 338.291 s, 1.6 MB/s

Am I missing something?
 
Last edited by a moderator:
Ok, so I ditched ceph, because of bad performance, and set up DRBD instead.
I have set up DRBD + heartbeat + iSCSI, and it works well on the master node.

When I try to failover to the second node, logs on the proxmox host shows:
Code:
Mar 15 17:42:19 px1 pvestatd[1906]: WARNING: storage 'HA-iSCSI' is not online
Mar 15 17:42:23 px1 iscsid: connection1:0 is operational after recovery (5 attempts)

So, at least, something is working. I use a virtual IP for both nodes that iSCSI connects to.

However, I am getting "Communication failure" when running the iSCSI target off the second node.
Running "pvesm lvmscan", when iSCSI is connected second node, gives:
Code:
root@px1:~# pvesm lvmscan
pve
vg_storage

"pvesm lvmscan" is alot faster when iSCSI is running of the master node.


Any pointers? What can I do to debug?
 
We have used iscsi with drdb before but we have moved to ceph. Just curios about your ceph performance. The wiki says to build with three nodes but you used two. Also it states to use ssd for journal, one disk per osd. We have build serveral ceph clusters on ubuntu and centos and have had good performance. Now with 2.3 the backups and snapshots work perfect via the gui
 
We have used iscsi with drdb before but we have moved to ceph. Just curios about your ceph performance. The wiki says to build with three nodes but you used two. Also it states to use ssd for journal, one disk per osd. We have build serveral ceph clusters on ubuntu and centos and have had good performance. Now with 2.3 the backups and snapshots work perfect via the gui

How was ceph performance with your setup?

I have only two storage servers available, so I guess DRBD is my only option.
However, I have configured them with pacemaker, and it works really well. Failover is almost instant.

I'm also tinkering with flashcache + DRBD. Anybody tried that?
 
actually the recommendation is to have 4 OSDs (= hard disks) per SSD.

the reason for ceph needing 3 or 5 or more servers is that its a cluster. having 2 storage servers should work just fine, but you still need to run at least 3 mons for quorum, but a mon can run on any server (doesnt have to be a storage box) since they "only" do management.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!