hardware suggestion

Alessandro 123 · Jan 8, 2017

This was partially discussed in networking forum , but this week I have to buy at least 1 server to start a new PVE cluster, so I need a final advice.

At the end, we'll go with at least 3 (maybe 5) servers in a cluster with GlusterFS as shared storage.

I'm evaluating these:
https://www.supermicro.nl/products/system/1U/1028/SYS-1028U-E1CRTP_.cfm
https://www.supermicro.nl/products/system/1U/1028/SYS-1028U-TR4T_.cfm

almost the same, but the former has 2 10GB SFP+ plus 2 1GBaseT, the latter has 4 10GBase-T.

I need the following networks (all redundant):
- public (gigabit)
- private (gigabit)
- cluster (I think gigabit, but in the networking forum, 10GbE seems to be preferred)
- storage (10GB, better if SFP+)

With the 1028U-E1CRTP+, i'll get 2 SPF+ integrated plus 2 gigabit that I could use for the cluster network (allowing live migration without shared storage, initially). When I'll add a shared storage, using an SFP+ should drop latency a lot compared to a copper 10GB

On the other side, 1028U-TR4T+ give me 10GbaseT for both public and cluster network.

As i need 8 network ports, I have to add an additional quad-port gigabit connection and this is the only add-on cart that I would like to add (i need some free slots for other components), so I can't add a dual port SFP+ and a dual port gigabit, because this mean 2 PCI cards to be added.

Any suggestion?
And what about the storage (ZFS) ? I'll use Gluster in next months/years but currently I have to start without a sharded storage. The storage *must* be SSD. As wrote in other thread, I need maximum reliability in case of failure. Our DC doesn't have h24 personal and replacing a disk in probably the only operation that needs physical access to the server. In case of holidays or vacations, a disk replacement could happen in a couple of days. This makes a simple 2-way mirror too risky. Adding an hotspare (even with autoreplace set to on) is almost useless, because you still have that disk as "cost" and still have to wait for resilvering to be protected again. Much better to create a 3-way mirror and not an hotspare. Same number of disks, same cost, but all disks are always synced. What If I create a RAID-6 with 4 disks ? One disk more than a 3-way mirror, but same reliability and bigger usable space. By using one of the abobe server, I can create 2 RAID-6 made by 4 disks each (like a RAID-60) or 2 3-ways mirrors by using 6 disks (a RAID10 but with 3 disks for each mirror)

Just a question: during a RAIDZ2 resilvering, the only disk that is receiving writes is the newer one or all disks are re-writte with the new parity? Rewriting everyhing means an extra stress for the survived disks.

mir · Jan 9, 2017

Alessandro 123 said:
Just a question: during a RAIDZ2 resilvering, the only disk that is receiving writes is the newer one or all disks are re-writte with the new parity? Rewriting everyhing means an extra stress for the survived disks.

Writes goes to the disk being resilvered but of course you will have reads on all the other disks.

mir · Jan 9, 2017

You are aware of that a 3-way mirror requires 3 full writes for each write?

Given your concern for reliability and the reasonable powerful CPU in your server I would recommend a RAID 60 setup. This will be a nice balance between reliability, maximize available space, and relatively good performance.

Alessandro 123 · Jan 9, 2017

Yes, a 3way mirror requires 3 full write but this would be the same even with raid6, where you have to write 3 times (data+parity+parity)

mir · Jan 9, 2017

Alessandro 123 said:
Yes, a 3way mirror requires 3 full write but this would be the same even with raid6, where you have to write 3 times (data+parity+parity)

No, raid6 does not require 3 full writes. raid6 requires 1 full write (stripped across disks in raid array), compute parity, and 2 times write parity. parity != data.

Alessandro 123 · Jan 9, 2017

So, a RAID-6 should have faster writes than a mirror as it is writing less data and by using multiple disks. Usually, as I have read on net, mirror are always faster than any parity raids.

In other words: a 10MB chunk (just to keep it simple) on a 4 disks RAIDZ2 mean writing 5MB on diskA, 5MB on diskB, a parity on diskC and a parity on diskD?

4 small writes vs 3 full writes. but a 3-way mirrors is more flexible, I can add or remove devices growing or reducing a 2way mirror to a 3way mirror. This is my biggest concern as a raid configuration created with ZFS can't be changed (on other hand, mdadmin or any hardware raid are able to migrate from mirror to raid5/6 or grow a raid5/6 by adding single disks)

Probably, growing a RAIDZ2 by single disks won't be an issue, even now I tend to avoid this phase with hardware raid (is not clear to me if during the expansion i totally loose redundancy or not) but I don't like to have a system with limited features.

As with ZFS all writes are full stripe writes, a RAIDZ-2 will perform equally with 4 disks or 8 disks? hardware raid usually tend to prefere more disks with parity raid (i don't know why)

mir · Jan 9, 2017

You can get answers to all your questions here: https://calomel.org/zfs_raid_speed_capacity.html

Alessandro 123 · Jan 9, 2017

I know that page, but is not clear (i've read it many times) how speeds are calculated.
Seems that 4xSSD RAIDZ-2 are faster (more than twice) than 2 SSD in RAID-1:

Code:

1x 4TB, single drive,          3.7 TB,  w=108MB/s , rw=50MB/s  , r=204MB/s
2x 256GB  raid1 mirror    232 gigabytes ( w= 430MB/s , rw=300MB/s , r= 990MB/s )
4x 256GB  raid6, raidz2   462 gigabytes ( w= 565MB/s , rw=442MB/s , r=1925MB/s )
5x 256GB  raid7, raidz3   464 gigabytes ( w= 424MB/s , rw=316MB/s , r=1209MB/s )

That would be non-sense, so I think that i'm unable to understood that doc.

Even RAIDZ-3 is faster than mirror.

Alessandro 123 · Jan 9, 2017

A parity raid faster than a single disk. Impossible.

mir · Jan 9, 2017

Alessandro 123 said:
I know that page, but is not clear (i've read it many times) how speeds are calculated.
Seems that 4xSSD RAIDZ-2 are faster (more than twice) than 2 SSD in RAID-1:

Code:

1x 4TB, single drive, 3.7 TB, w=108MB/s , rw=50MB/s , r=204MB/s 2x 256GB raid1 mirror 232 gigabytes ( w= 430MB/s , rw=300MB/s , r= 990MB/s ) 4x 256GB raid6, raidz2 462 gigabytes ( w= 565MB/s , rw=442MB/s , r=1925MB/s ) 5x 256GB raid7, raidz3 464 gigabytes ( w= 424MB/s , rw=316MB/s , r=1209MB/s )

That would be non-sense, so I think that i'm unable to understood that doc.

Even RAIDZ-3 is faster than mirror.

You forget one important thing. raidz(n) distributes writes evenly across all the disks in parallel which is why it performs better than mirror and single disk. Eg raid6 above will split every write in 4 parts and write one part to each disk simultaneously.

Alessandro 123 · Jan 9, 2017

mir said:
You forget one important thing. raidz(n) distributes writes evenly across all the disks in parallel which is why it performs better than mirror and single disk. Eg raid6 above will split every write in 4 parts and write one part to each disk simultaneously.

That's what i though but is much different than *ANY* other paper you can find online, where a mirror (or even better a single disks) will alwyas ourperform a parity raid.

That is one and only paper saying that a RAID-6 is faster than a single disk.

The same is for striped mirrors (RAID-10) universally known as the faster way to get redundancy with RAID:

4x 4TB, 2 striped mirrors, 7.5 TB, w=226MB/s , rw=53MB/s , r=644MB/s

that paper is saying that is slower (much slower) than a raid6. If it was like this, why using a RAID10 and not a RAID-6? RAID-6 is much safer, but much slower. That paper says that is faster.

fortechitsolutions · Jan 9, 2017

Just a note for what it is worth, you may wish to do your $Spend in stages, or be careful with plan, or accept that path forward may be complex. I did some test deploy work last year on Ceph Proxmox distributed 'converged' config and end of the day lesson for me on that, was not to try to do 'small' (ie, 5 node or less) cluster with this kind of storage. I ended up then testing DRBD latest and lesson from that was "not yet production ready". Ended up going to a nice conservative simple model of, Proxmox nodes with modest local disk; have them serviced by multiple parallel HA_NFS_Filers and use those to satisfy shared storage requirement. Not remotely as sexy and shiny a solution as ceph or gluster but - exceptionally solid, reliable, and bandwidth for IO still exceeded project requirements so it was no-brainer path forward.

ie, don't just assume that you will have a 100% instant and easy success to deploy a production-grade 3-5 node GlusterFS backed stack with Proxmox, just because "it should be possible" according to the docs. Or maybe you are planning to buy support subscriptions from ProxVE to help smooth things out, I am not sure.

While it can be entertaining (?) to debate ZFS vs Raid6 vs Raid whatever; at the end of the day most likely, you are ? more likely needing to meet certain functional requirements (ie, this much RAM, CPU capacity for VMs; and this much disk capacity; and this much IO throughput needed) as the core of the path forward. (ie, and less so about if the solution is implemented with ceph vs gluster vs ZFS vs iSCSI vs NFS vs ... whatever).

So at very least, for a serious project, you will have a 'pre-deploy config build test" cycle that is totally separate from the 'production build deploy' phase; and this may well include a great deal of cursing at the complexity and stability and various other bits of joy. So maybe for example you buy just 3 boxes initially, and then proceed to buy 2 more once past the first post.

For my 'testing project' last year the client just decided to burn money on renting 3 x OVH servers for a few months. We were able to get stuff with 10gig copper private connectivity which was sufficient for the tests; it was a lot cheaper than buying the full hardware and allowed to test various configs with relatively low pain point. however, I realize that burning $ on rental is really a non-appealing scenario for some people. (ie, I am not keen on it myself most of the time, but sometimes, it can make sense).

Just my 2 cents though.

Good luck with the work!

Tim

Alessandro 123 · Jan 9, 2017

I know, that's why i'm asking in a community.
You can't do all kind of tests, it's simply impossible, so asking here could avoid some common pitfalls.

I also know that using a shared storage (gluster is way easier to understand, configure and maintain than ceph) is not easy as using direct attached disks, that's why i'll start with a RAID on each server and the move to gluster when I'm ready. (i also have gluster in production for other project and is working fine)

fortechitsolutions · Jan 9, 2017

Sounds very good! Thanks for the reply and clarification. I haven't played with Gluster for a few years so maybe need to loop back and look at it again. Certainly you are right, it is good to avoid problems with help from others in the forum, where possible.

In case it is of interest, in case you are using local raid at all; I've found "Bcache" with SSD drives makes otherwise slow raid seem much (!) nicer to use in proxmox. Not expensive config but very good performance boost. But of course it does cost some storage ports in your server (ie, 2 disks for a pair of SSD cache drives means 2 less drives available for bulk storage; which might be a problem).

I'm very interested to hear how the Gluster config works out, so I hope you update the thread later as things move along.

Thanks,

Tim

Alessandro 123 · Jan 10, 2017

As wrote, these servers will be ssd only, i don't think make sense to use an ssd to cache an ssd

fortechitsolutions · Jan 10, 2017

Ah, yes, my oversight. SSD only drives, no benefit to use SSD cache I think. Unless you have for some reason 'asymmetric' disk sizes:costs or something like that.

Alessandro 123 · Jan 10, 2017

And what about the other questions? One of the biggest issue is to put a 10GbE for cluster network. without it, everything would be easier.

fortechitsolutions · Jan 10, 2017

Hi, regarding 10gig for cluster: I believe your only benefit with this, is that your share-nothing migrations between VMs will go faster than if you had 1-gig for cluster interfaces. So long as your cluster network is (very solid, supports multicast - local private networking typically fulfills both these requirements easily) - then I really can't say that from my experience, 10gig for cluster interfaces is really so important (ie, compared to 1gig cluster interface network).

Ultimately, my experience with the share-nothing migrations, is that - they work fine, if your use case expectations are 'reasonable' (ie, highly active VMs are not migrated, because the rate of change within the VM exceeds the cluster network bandwidth, so the migration process never completes). Since the share - nothing migration effectively runs as a 'background task' and the VM keeps operating while the process is underway, it is "OK" so long as you are 'not in a hurry'. Clearly if you want "5 minutes to migrate a 500gig VM" then you will not be satisfied. But if you are just trying to re-balance VMs across a cluster in a non-time-urgent-manner, then share-nothing migrations coupled with sensible use (not-massively-active-VMs) and "patience" is sufficient. Obviously if you can stop / migrate / start the VM then share-nothing migrations have 'zero extra penalty' due to VM activity (since there is none). But some folks don't like downtime on VMs / so a compromise must be found sometimes between (desire for uptime) vs (desire to migrate VMs) vs (desire to not have infrastructure complexity of shared storage).

Ultimately, I would more anticipate that 10gig interfaces are most critical if you are doing cluster storage traffic (ie, Gluster Ceph or even NFS) and you are trying to ensure good bandwidth for VM IO traffic.

Not sure if I've missed the general thrust of your question though or not.

Tim

fortechitsolutions · Jan 10, 2017

I realize, a footnote, for what it is worth. (?maybe?). My experience in terms of redundant network config requirements - ie - the strict need to have multi-path networking in your servers, switches, in order to be fault-tolerant - is that generally speaking, your risk of (server failure due to power supply faults or raid controller atypical death) is higher than (risk of a NIC death) or (risk of a switch death).

ie, generally in most common uses cases I've deployed, such as (for example) a 3-node proxmox cluster.

The need for having redundant NIC:Switch config is simply - ~zero - because the reliability of those parts of the stack, is so much higher than other components - that it is irrelevant. And the added build complexity is such that it isn't worth the effort. You are more likely to suffer an outage due to human error and configuration 'oops' during admin work (ie, maybe even worse risk because of added complexity and more moving parts in your stack).

So for the most part, I'm not a big fan of adding in extra complexity. Unless the costs of downtime are truly magnificent; the build-budget is sufficient; and the redundancy is built in sensibly; and change management is done properly (ie, environment is well documented, managed appropriately, and risk of human error is minimized, not enhanced).

Obviously if you need the extra bandwidth (ie LACP trunking to get both redundancy / fault tolerance as well as enhanced throughput) then maybe you can kill a few birds with one stone, as it were. But - anyhow. This isn't always the case either.

ie, end of the day network diagrams with redundancy at all layers of the stack can 'look appealing' (since they are of course theoretically possible) but it is important to do a proper assessment of risks in the entire stack to be sure what the best/most appropriate build configuration is.

Tim

Alessandro 123 · Jan 10, 2017

The storage network (the one used for Ceph/Gluster/NFS/Whatever) would be 10GB with SFP+
The cluster network is the one used for peacemaker and for live-migration without shared storage.

Obviously, I know that live migrate could not complete for heavy loaded VM, but not all of my VM are heavy loaded. I can migrate 99% of running VM with no issue and try to migrate the one heavy lodade. If live-migrate doesn't complete, I can stop some services (or, better, bring them read-only) and migrate again.

Other questions are: is SFP+ really needed for *storage* network ? I can choose between:

Dual SFP+ for Storage netowork plus Dual 1GB (rj45) for cluster network

or

Quad 10GB (rj45) for both cluster and storage.

The quad-10gb solution seems to be more flexible as will allow me to complete shared-nothing live migration on heavy loaded machines (10GB is 10 times faster than 1GB) but latency is worse.

hardware suggestion

Well-Known Member

Famous Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Well-Known Member

Famous Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member