Suggestions for SAN Config

Go for Intel DC S35xx (cheap) or Intel DC S37xx (more expensive but better performance and durability)
We use the DCS3510's in some of our servers. We did have one go bad after two weeks. But besides that they've been great.
 
I'm personally partial to FreeNas, mainly due to there being a commercial company behind it, that pays a larger amount of developers, the larger community (altho they have a large amount of anti-social members), and seem to have a healthier commit rate.

Whats the reasoning behind all SSD ? is it just reliability ? or does Speed factor into this as well ??
We've used both FreeNas and Nas4Free. Both have been solid. I also helped a buddy deploy an IXSystems solution in a DC and those guys were great to work with.

SSD's are purely a reliability play. We're based on California, but our data centers are spread throughout the country. While we can always use the DC's remote hands service, we prefer to deploy systems that will be as reliable as possible. And since we don't need a lot of storage, SSD's aren't cost prohibitive.
 
Before settling on FreeNas/Nas4Free you should consider Omnios + Napp-it too. You get native in-kernel ZFS and an iSCSI implementation which is more stable, scaleable and outperforms both istgt and cld. The integration with Proxmox is also better than the two others.
 
Thanks Mir. Napp-it looks like another good option. We've had it running on our labs, but never put it into production. Do you run it in production with OpenVZ or LXC?
 
I have a 3.4 in production and a 4.1 in lab.

For OpenVZ and LXC you simply export a dataset through NFS and you a good to go. For KVM use Zvol (ZFS_over_iSCSI: Supports snapshots and (linked) clones. Also supports thin provisioned Zvols)
 
Forgot to mention: If you use thin provisioned Zvols you should choose scsi disks and virtio-scsi controller because comstar (the iscsi daemon in Omnios) supports the scsi unmap command. This means the trim command is honored by comstar and therefore trimmed blocks will be released to the pool from the Zvol.
 
That seems to be the common consensus. Obviously we'll build the NAS to be as reliable as possible, but it seems like the component that would be the single point of failure to the entire cluster should be redundant.
 
No. UPS is enough for me.

Sorry but what does your UPS help if you have fire or something else

remember murphy's law so always have the data mirrored to an other datacenter

how you will do firmware upgrade or something else on storage system where you need to reboot the controller? I don't think the solution is to stop all vms on all hosts. Simply switch to other datacenter and do maintaince on the local storage controller.
 
Sry, i do not. As i said we do not use LXC at work, and Gluster only for experimental Lab stuff with kvm containers (different from your usecase)


Q: What connectivity do your proxmox-Nodes have ? 1G, 10G, infiniband ?

The reason i keep asking is as follows:
When ever you use a SAN, Ceph or Gluster you want to go with a separate Storage network, or a single network, that is properly sized and properly managed by QOS.



For Gluster this specifically is because of the following:
http://blog.gluster.org/2010/06/video-how-gluster-automatic-file-replication-works/
Basically, when ever you write a file to Gluster, your bandwith gets devided by the Number of "SAN's with Gluster" on top.

So lets say you have a 1G pipe and 2 Sans, your left with 0.5G or 62.5 MB/s of Bandwith outgoing from your Proxmox-Node, when you write from it. That is shared on 50 CT's. Thats why i asked earlier if you have any metrics to share on your current usage of your storage-subsystem.

It is also important the other way around. Lets say you have 2 Gluster Nodes, each with a single dedicated 1G pipe, And you have 3 Proxmox-Nodes attached to them. When only one Proxmox-Node is reading a large amount of files, you statistically end up with 2G worth of bandwith (or 250 MB/s for 50 CT's), but if all 3 proxmox-Servers are using the Gluster-Storage, you are looking at 2G/3=0.66G or 83 MB/s for 50 CT's. which is btw 1.6 MB/s per CT.

Not sure that will work, but thats why knowing your current metrics is important.


I'd self-build a node with Gluster before i'd go an buy a ready-Made san (and maybe set up Gluster ontop of it), or go with something like netapp.

A lot cheaper. The reason is not just base cost, but also running cost.
This is because you can leave out all the redundancy features, and spec it to exactly your needs all you need is case+mainboard+cpu+ram+psu+Disks/Flash+nic(s). You size em exactly as you need em for your use-case
Then just setup your favourite Linux + ZFS + Gluster and your done.
Need more redundancy ? just add another Gluster-node.


Here's some io info from our busiest production server:

root@proxmox:~# iostat -d -x 5 3
Linux 2.6.32-39-pve (proxmox) 02/26/2016 _x86_64_ (24 CPU)

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.03 670.24 1.16 501.29 15.81 8258.38 32.94 0.81 1.62 6.30 1.61 0.10 5.27
dm-0 0.00 0.00 0.01 11.72 0.16 46.88 8.02 0.03 2.14 10.56 2.13 0.07 0.08
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 1.13 1.11 1.13 0.07 0.00
dm-2 0.00 0.00 1.18 1159.83 15.64 8211.50 14.17 0.46 0.39 6.57 0.39 0.04 5.21

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 474.20 0.00 390.60 0.00 7222.40 36.98 0.05 0.12 0.00 0.12 0.06 2.34
dm-0 0.00 0.00 0.00 1.80 0.00 7.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 863.00 0.00 7215.20 16.72 0.13 0.15 0.00 0.15 0.03 2.40

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 492.00 0.00 452.00 0.00 7472.00 33.06 0.26 0.57 0.00 0.57 0.05 2.42
dm-0 0.00 0.00 0.00 122.60 0.00 490.40 8.00 0.38 3.13 0.00 3.13 0.02 0.20
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 821.40 0.00 6981.60 17.00 0.15 0.19 0.00 0.19 0.03 2.34
 
Q: Have you checked how much of your 8-12 TB data is Cold-Data and how much is Hot data ? Might make a big difference in terms of Cache-Sizing or the need of Raid-Levels on the NAS (Raiz2 / Raid10)




Am i reading this correctly ? You are doing <=600 Writes /s ? and no reads ?
That would mean your 3-Node Cluster does less then 2k Writes/Reads, right ?



The following is assuming you are doing the typical 5% ultra-Hot, 5% hot and 90% Cold-Data.

If this is true, i'd do a self build ZFS based NAS or Gluster setup with enough HDD's in RaidZ2 (min. 5x4 TB for your 12TB Goal) and dedicated SSDs for L2Arc and ZIL. I'd add at least 2 Spare HDD's and a minimum of 2 SSD for ZIL and L2Arc (256+ GB - no need to go crazy here) and make it all consumer-grade. This should be MORE then enough performance for what your iostat indicates and allow for growth in terms of additional IO.

Not having 24/7 easy access to the server and having to rely on DC support staff, might make me a bit weary tho, which might convince me to shell out the $$$ for enterprise-grade SSD', but even then i'd rather go with add additional spare SSD's and maybe a couple more spare HDD's. Consider that ZIL and L2Arc are adaptive, meaning that if they are not present, your setup will still work, but will only be able to rely on your RAM and unable to offload "overflow" to the ZIL / L2Arc.



In short you want at least:
2x OS-Drive
7x 4TB Disk (5x4TB for 12 GB goal with RaidZ2 + 2 spares)
2x 512GB SSD (so you can have one fail for Zil and l2arc)
Appropriate Sized RAM
should all fit into a 12-bay 2HE unit.


Or (if you want more redundancy in lieu of Enterprise-Grade Drives):
2x OS-Drive
10x 4TB HDD (12 GB RaidZ2 + 5 Spares - or any other config)
4x 512GB SSD (2x for Zil, 2x for L2arc)
Appropriate Sized RAM
Fits a 16-bay 3HE Case.


should still be cheaper then going all enterprise gear or even all enterprise Flash based.



Personal Note: we have all our servers in local datacenters (3), operated by our own staff (24/7/365), so we can switch failed drives rather quickly our selfs. We also rely heavily on redundancy in terms of additional servers and think of "pod-redundancy" instead of single -component redundancy. Thereby accepting multiple servers down at the same time due to failed Disks or PSU or network cards. Thats why we always go Consumer-Grade rather then enterprise-gear, dual PSU's, etc. A lot cheaper at scale. So keep that in mind when reading my suggestions :)
 
Last edited:
While we're throwing out suggestions ... Have you had a look at Open-E's cluster setup?
It's basically DRBD, but then supported.

I've been running it as redundant backing for VMware, because it's supported by them too. Rock solid!
 
Hi Nils. Open-E looks interesting. Do you connect to it over ISCSI?
Yes, we use it as iSCSI solution. I think it also supports FC.

Their new version of the OS uses ZFS as backend, but we havent deployed that yet, since we'd like to see site-resilience working with that setup first.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!