Suggestions for SAN Config

axion.joey · Feb 24, 2016

Hi,

We've been a long time proxmox user. We currently have 7 proxmox hosts each running approximately 150 CT's using local storage only. We're considering moving to a shared storage model and would like some ideas on the best way to configure it for maximum reliability.

Our initial thoughts are to use 3 hosts and 2 mirrored SAN's per data center. This is a very low IO environment. Our primary concern is reliability.

What are the best practices for setting up an environment like this. The primary goals are:

Ensuring data is sync'd real time between SAN's.
Ensuring that the Proxmox hosts failover in real time in the event of a SAN failure.
Configuring host failover so that if a host fails one of the other two hosts will start the LXC Containers that were running on the failed host.

Thanks in advance for your feedback.

LnxBil · Feb 24, 2016

I'm also very interested in this, because I cannot think of a supported version which is able to run LXC. I tried gfs2, but it crashed the kernel repeatedly and ocfs2 is dead for years.

But here to your questions:

1) Buy a SAN that's capable of this :-D What is your budget?
2) Again, buy a san that's capable of this too. There exist hardware that can failover in this circumstances.
3) That's build in into Proxmox and should work out-of-the-box.

This is also a big price tag, but only one filesystem can do all you want: Oracle ACFS from Oracle's Grid Infrastructure. It has build is cluster-aware muli-SAN replication and failover. I have never seen it used in Proxmox, but it should work out of the box. It's not supported for this neither by Proxmox nor Oracle, but technically it should work with a ACFS supported Kernel (not PVE, you will have to build this yourself).

I used this cluster technology for years in Oracle clusters and it is quite impressive.

axion.joey · Feb 24, 2016

Thanks so much for the quick response. I just downloaded proxmox version 4 and I'm going to start testing it. I know that versions 1-3 only allowed OpenVZ storage on local storage and NFS. I can set up redundant SAN's and configure replication on the SAN level. I was just wondering if there were any architectures that proxmox has tighter/more reliable connectivity with that also work with LXC.

For example I could set up redundant SAN's and use Proxmox's glusterfs client to connect to the SAN's. Theoretically the glusterfs client would take care of replication and real time fail over, but I've never done it in Proxmox, so I don't know if it will actually work.

Q-wulf · Feb 25, 2016

axion.joey said:
[...]
We currently have 7 proxmox hosts each running approximately 150 CT's using local storage only.
[...]
Our initial thoughts are to use 3 hosts and 2 mirrored SAN's per data center.
[...]
very low IO environment. Our primary concern is reliability.
[...]

So i am assuming you will have multiple "pods" of 3x Proxmox-nodes in multiple Datacenters.

Q1: Are these all in the same Proxmox-Cluster ? As in 3 Nodes on Datacenter A and 3 Nodes in Datacenter B
Q2: How much IO do you actually need ? Do you have a ballpark area ?
Q3: What type of local storage are you using now ?
Q4: when you say CT, you mean openvz or LXC ?
Q5: whats your connectivity between nodes and between Datacenters like ?

Without knowing the answers to the questions above (which could be throwing a curveball for this proposal), and making the assumption of a multi-Datacenter setup;
You could potentially achieve this with Ceph

I'd use a "custom Crush location hook" to split my nodes(*1) this way:

Datacenter-A
- Datacenter-A-Node-A1
  - OSD.1
  - OSD.<InsertNumber>
- Datacenter-A-Node-A2
  - OSD.2
  - OSD.<InsertNumber>
- Datacenter-A-Node-A3
  - OSD.3
  - OSD.<InsertNumber>
Datacenter-B
- Datacenter-B-Node-B1
  - OSD.4
  - OSD.<InsertNumber>
- Datacenter-B-Node-B2
  - OSD.5
  - OSD.<InsertNumber>
- Datacenter-B-Node-B3
  - OSD.6
  - OSD.<InsertNumber>
Datacenter-C
- Datacenter-C-Node-C1
  - OSD.7
  - OSD.<InsertNumber>
- Datacenter-C-Node-C2
  - OSD.8
  - OSD.<InsertNumber>
- Datacenter-C-Node-C3
  - OSD.9
  - OSD.<InsertNumber>

(*1) These could be either Proxmox + Ceph Nodes that you use, or Standalone Ceph-Nodes. Lets call em "Storage-Nodes" to make it easier to understand.

I'd then create Custom Crush Rules for my Pools:

1 Rule per Datacenter for Backups, whereby it first puts the Data on Local OSD's and then replicates them over to the other datacenters (make it size =3 so you get a copy of each file on each datacenter)
- Create 1 Pool per datacenter based on these rules And assign em as Pools for only the Proxmox-Nodes in said Datacenter.
Create a Rule for every Datacenter for Data-Pools (for CT's you only want to migrate inside the same Datacenter)
- create (a) pool(s) on those nodes. And assign em only to proxmox nodes sitting in the same datacenter where you have your Crush-Rule pointing to.
Create a rule Datapools that are not Datacenter-specific. This is for containers you will be able to migrate between Datacenters. Make sure you keep a copy of every file in each datacenter (replication size = datacenter amount). Also make sure that you put your first Copy into Storage on this specific local Datacenter, then replicate to the other datacenters. That makes sure that you will most likely be reading/writing from "Storage-Nodes" inside the Datacenter the CT is supposed to be running on (unless there is a problem and then it reads from the remote datacenters Storage-Nodes)
- Create pools based on these rules and assign em in all your Proxmox-Nodes.

You'd then end up with the following Storage-Pool scheme on every Proxmox-Node, you also can have multiple pools to "partition" your available Storage based on the same Crush-Rule:

amount of "X" global Backup Pools
1. Every Backup you have is available in Every-Datacenter, so if you loose a whole datacenter worth of "Storage-Node"-Hardware due to e.g. Fire or lightning you will never loose data.
2. Use-Case:
  1. Backups
  2. ISOs
  3. Templates
amount of "Y" Datacenter specific CT-Data-pools
1. You can loose all but one "Storage-Node" in this Datacenter and the data is still available. The more "Storage Nodes" you loose, the more hits o your performance you take. Ideally you will be able to migrate/restart your VM's with only one local Storage-Node online.
2. Use-Case:
  1. Containers i have sold to customers in a specific Datacenter, marketed with "local Failover"
amount of "Z" global CT-Data-pools
1. If you loose all but one Datacenter's worth of "Storage-Nodes" you will still be able to operate.
2. If you create multiple Datacenter-Specific pools, you can still migrate the Container to another datacenter if you need to, but they will normally be running on their "assigned Datacenter" with the best performance (since thats where your primary OSD's will be sitting.
3. If that local Datacenter's "Storage-Nodes" go down, you can still use the Containers in the local Datacenter, but the Data will come from "remote Storage-Nodes", thereby be subject to the latency that this entails.
4. use-Case:
  1. Containers i have sold in a specific Datacenter, but market with "regional Failover"
  2. Core Business infrastructure you'd be self-hosting, like e.g. Mail, DNS, Website, Business-Software

please be advised that this is not a 0815-Install of Ceph, but an advanced and custom setup, not covered by most (superficial) guides and you'll either need to get some consulting help, or you'll need to study the Ceph References and Stefan Hans Blog in great detail and do some extensive testing.

Links:
http://www.sebastien-han.fr/blog/ (wealth of Information)
http://docs.ceph.com/docs/master/rados/operations/crush-map/ (ceph Crush location reference)
https://gist.github.com/wido/5d26d88366e28e25e23d (example of a Custom Crush hook)
http://docs.ceph.com/docs/master/architecture/ (probably 2-3 hours worth of reading material)

axion.joey · Feb 25, 2016

Thank you so much for your detailed response. Here are the answers to your questions.

Q1: Are these all in the same Proxmox-Cluster ? As in 3 Nodes on Datacenter A and 3 Nodes in Datacenter B.
There are 3 nodes in each data center. We don't need any synchronizing/failover between data centers. We use other tools to handle data center failures and data synchronization between data centers.

Q2: How much IO do you actually need ? Do you have a ballpark area ?
I'll pull some stats and get back to you.

Q3: What type of local storage are you using now ?
Right now we use 6 7200 RPM drives on the local hosts in a RAID 10

Q4: when you say CT, you mean openvz or LXC ?
That's correct. We're currently using Openvz containers on Proxmox version 3. We're planing on migration to Proxmox version 4 and LXC.

Q5: whats your connectivity between nodes and between Datacenters like ?
Currently the data centers are connected via VPN. We don't need realtime data sync or failover on the proxmox level between data centers.

Q-wulf · Feb 25, 2016

axion.joey said:
[...]
Q5: whats your connectivity between nodes and between Datacenters like ?
Currently the data centers are connected via VPN. We don't need realtime data sync or failover on the proxmox level between data centers.

Whats your node to node connectivity like ? 1G ? 10G ? multiple links ?

A multi-Datacenter Ceph setup is probably to complex as you already blocked that:

axion.joey said:
We don't need realtime data sync or failover on the proxmox level between data centers.

So only Datacenter Internal real-time sync and failover abilities ?
You could still use Ceph for this, but honestly, it is too much overhead for 3 Proxmox-Nodes to set up 3 additional Ceph-Nodes just for redundant SAN. If you have a lot of spare CPU/Ram on the hosts, you could look into a ceph setup with ceph sitting on the Proxmox-Nodes.
But with 50 CT's per Node (not sure what type / size /resource-usage) i'd probably not go down that route, unless i had all Flash-Disks and multiple 10G-links and boatloads of spare Memory and Cpu resources, and there was some sort of premium on the Rack-Units.

Your Initial idea of using Gluster probably hit it on the nail. Least overhead in terms of additional devices, and should work nicely performance wise. I've not done this in conjunction with LXC yet. At work we have a Lab-setup running for ages. Gluster works fine with KVM (which is what we use exclusively)

Basically you set up 2 servers per Data-Center that you stick all your disks on, then use Gluster to take care of replication. If you need more Performance, you just add more Gluster-Nodes.
Write IO is typically slower by a factor of 1 / (Number of Gluster-Nodes) compared to Read-IO when you use Gluster.

For example, I have read about setups where Companies use ZFS with Spinner-Based RaidZ-2 (2 Disks can Fail), M.2 Based L2Arc and SSD-based ZIL. They enable Compression and Deduplication, then just run Gluster on top of that with great success in terms of Failure Domains, Fault tolerance and performance.
But i have never used ZFS to that extend, nor for your usecase. Maybe worth looking into.

axion.joey · Feb 25, 2016

Thanks again. do you know if LXC supports Gluster or ZFS storage? I haven't read of anyone doing it with Proxmox yet.

LnxBil · Feb 25, 2016

I'm a little surprised - I misread your question entirely. 6 NL-SAS or even SATA disk for 3 nodes with >50 CT each? Wow.

The licenses for the techniques for live mirroring/failover and stuff like that I thought about cost normally as much as a full 2 HE of flash storage and are clearly out of scope here.

axion.joey · Feb 26, 2016

Ha ha. Yes each CT is very basic. 512MB of Ram and 8 gigs of hard drive. Nearly 0 CPU usage. We need a lot of containers, and they need to be very reliable, but the performance requirements are minimal.

The number of CT's though will continue to grow, so we want to use separate resources for storage vs Ram and compute, hence the desire to move to a SAN.

We're looking at using all SSD storage in the SAN, again only because of reliability.

LnxBil · Feb 26, 2016

Obviously also not a lot of I/O. My mobile phones NAND could probably get better random IOPS as 6 NL-SAS disks.

axion.joey · Feb 26, 2016

I'm with you 100%. That's one of the problems we're facing. We really need minimal performance, but we want to deploy redundant SAN's because we'll get over 80 support calls because of a 10 minute outage.

hec · Feb 26, 2016

I would use a NetApp and then you can decide to use NAS oder SAN. I would use NFS because of the snapshots and the need of space compared to iSCSI oder FC. What is the distance between the DCs? Maybe you can use a Metrocluster then you have SyncMirrored Aggregates. This means all data is written sync on both sides. If not Metrocluster then you could use vserver-dr.

For now we can get 100TB in a 2U shelf with SSDs (24x3.8TB). Then some more rack units for the controller thats it. Or you use 20x1.2TB SAS or 20x900GB SAS and 4x 400GB SSDs for Flashpool (caching metadata) and 2TB Flashcache for serving data from flash.

There are a lot of things you can do. If you like a stable system which simply works use netapp.

axion.joey · Feb 26, 2016

Thank Hec. Another wrinkle. We don't need large IO, and we don't need a lot of space. We only need 6 - 10TB tops. We talked to NetApp and Dell and their solutions are awesome. They have the features and reliability that we need, but they aren't really priced well for small storage/redundant solutions.

Q-wulf · Feb 26, 2016

axion.joey said:
Thanks again. do you know if LXC supports Gluster or ZFS storage? I haven't read of anyone doing it with Proxmox yet.

Sry, i do not. As i said we do not use LXC at work, and Gluster only for experimental Lab stuff with kvm containers (different from your usecase)

Q: What connectivity do your proxmox-Nodes have ? 1G, 10G, infiniband ?

The reason i keep asking is as follows:
When ever you use a SAN, Ceph or Gluster you want to go with a separate Storage network, or a single network, that is properly sized and properly managed by QOS.

For Gluster this specifically is because of the following:
http://blog.gluster.org/2010/06/video-how-gluster-automatic-file-replication-works/
Basically, when ever you write a file to Gluster, your bandwith gets devided by the Number of "SAN's with Gluster" on top.

So lets say you have a 1G pipe and 2 Sans, your left with 0.5G or 62.5 MB/s of Bandwith outgoing from your Proxmox-Node, when you write from it. That is shared on 50 CT's. Thats why i asked earlier if you have any metrics to share on your current usage of your storage-subsystem.

It is also important the other way around. Lets say you have 2 Gluster Nodes, each with a single dedicated 1G pipe, And you have 3 Proxmox-Nodes attached to them. When only one Proxmox-Node is reading a large amount of files, you statistically end up with 2G worth of bandwith (or 250 MB/s for 50 CT's), but if all 3 proxmox-Servers are using the Gluster-Storage, you are looking at 2G/3=0.66G or 83 MB/s for 50 CT's. which is btw 1.6 MB/s per CT.

Not sure that will work, but thats why knowing your current metrics is important.

I'd self-build a node with Gluster before i'd go an buy a ready-Made san (and maybe set up Gluster ontop of it), or go with something like netapp.

A lot cheaper. The reason is not just base cost, but also running cost.
This is because you can leave out all the redundancy features, and spec it to exactly your needs all you need is case+mainboard+cpu+ram+psu+Disks/Flash+nic(s). You size em exactly as you need em for your use-case
Then just setup your favourite Linux + ZFS + Gluster and your done.
Need more redundancy ? just add another Gluster-node.

axion.joey · Feb 26, 2016

Thank you. We are currently on a gig network, but are planning on deploying a 10Gig network specifically for the SAN.

I'll pull some io stats from our production environment tomorrow and will post them here ASAP.

axion.joey · Feb 26, 2016

Also regarding the self made node. That's the direction that I believe ink go in. We're planning on using Dell R720xd's with Nas4free. We're going to use all SSD's. We're still researching SSD's.

mir · Feb 26, 2016

axion.joey said:
We're still researching SSD's

Go for Intel DC S35xx (cheap) or Intel DC S37xx (more expensive but better performance and durability)

hec · Feb 26, 2016

If you need 10G switches then you should use the Arista 7150s they are very and ultra low latency switches. We use they as storage switches with netapp in datacenters.

What is you price limit for this solution? About which products you talked with NetApp? Maybe we can find a solution which is not too expensive.

Q-wulf · Feb 26, 2016

axion.joey said:
Also regarding the self made node. That's the direction that I believe ink go in. We're planning on using Dell R720xd's with Nas4free. We're going to use all SSD's. We're still researching SSD's.

I'm personally partial to FreeNas, mainly due to there being a commercial company behind it, that pays a larger amount of developers, the larger community (altho they have a large amount of anti-social members), and seem to have a healthier commit rate.

Whats the reasoning behind all SSD ? is it just reliability ? or does Speed factor into this as well ??

axion.joey · Feb 26, 2016

hec said:
If you need 10G switches then you should use the Arista 7150s they are very and ultra low latency switches. We use they as storage switches with netapp in datacenters.

What is you price limit for this solution? About which products you talked with NetApp? Maybe we can find a solution which is not too expensive.

There are so many things to consider... At this point we're looking at the Arista 7148S and a few Cisco models. We need to do some testing to determine whether we're going to use ISCSI. If we go ISCSI then we don't need multi chassi LAG's.

Wolfgang (awesome guy by the way) shared with me that LXC is supported on ISCSI. If so then we could go with Nas4Free or FreeNas with HAST and CARP.

The other option he shared is Ceph in KRDB mode. We have 0 Ceph experience, so that will take a lot of research and testing.

Suggestions for SAN Config

Active Member

Distinguished Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Renowned Member

Active Member

Well-Known Member

Active Member

Active Member

Famous Member

Renowned Member

Well-Known Member

Active Member