SoHo-Cluster - Tips and Knowledge appreciated

Macavity

Active Member
Oct 9, 2016
17
0
41
Hi,

i had been using Proxmox v3 for some testing purposes on an old Fujitsu server and really love it.
Now i would like to set up a production "cluster" for our small office.

The most important software to be run is a Microsoft SQL-Server. It is suggested to run the SQL-Server on multiple disks (either on two seperate disks [OS, SQL-Server and .mdf-files on the first and .log and tempDB-files on the second drive] or even better each of the components on a seperate drive [only OS and SQL-server share a disk, the rest will get it's own disk). To get maximum performance it is suggested to use SSDs.

As hardware for the nodes i already have two Dell Poweredge T20 servers (Xeon E3-1225v3) and added 32GB ECC RAM (which is the maximum) to each. They came with a single 1TB 7.2k HDD.

At first i wanted to add two SSDs to each node and add these as local storage device to the VM in Proxmox until i realized that than i would be not able to move a VM in online state.

After reading a little bit in "Mastering Proxmox" i think it would be better to choose the "basic setup" consisting of two Proxmox nodes and a FreeNAS storage.

So i would like your opinions to the following idea:
- buying a third Dell T20 and adding 12GB ECC RAM (the server will have 4GB RAm when delivered and i have two 4 GB modules from the previous bought servers)
- add a RAID Adapter to it (i had always been using LSI so i would pick something that run good with the T20 like a 9260 or something)
- buy two small SSDs for the Proxmox nodes and one for the FreeNAS server
- buy two large SSDs and connect the to the RAID controller to work as RAID 0 (what will be better, let FreeNAS doing the work [software-raid] or hardware raid
- add the three 1 TB HDDs to the RAID-Controller as storage for "not so speed sensitive" VMs

In "Mastering Proxmox" the Proxmox nodes and the FreeNAS are all attached to the same Gigabit Switch.
Would it be wise to invest in three seperate Gb-NICs to seperate the filesystem traffic?

I hope this one did not get to long. It just want to make sure that i am not dissatisfied after spending more money on this, so any hints are welcome. ;)

Best regards

Macavity
 
- buy two large SSDs and connect the to the RAID controller to work as RAID 0 (what will be better, let FreeNAS doing the work [software-raid] or hardware raid
Hi,
raid-0 is a dangerous option for an production setup!

Some remarks:
with your FreeNas solution you have an single point of failure und due the two pve-nodes you have also trouble if one node die (the remaining node don't have quorum).
Why not use the third server as third node?

But the storage... on all node one big SSD and one 1TB-Sata for ceph?! (should tested before if the performance is good enough).

A year ago, I would say DRBD is the right decision for this scenario. Look at the "DRBD-8 back" thread - perhaps this will be an solution for you (depends on the pve-team).

Udo
 
Hello udo,
thank you for your reply.

Raid0 was a typo, i meant Raid1.

Ceph was not the plan (i already sorted that out - too "big" for my usage szenario), it was more like iSCSI for the VMs and NFS for ISOs.

My main target is also not a HA cluster - if one node fails i do not need to be back up in an eyelash - i would be fine if the downtime is not more than 10 - 15 minutes until i am back in business. I also do not need much storage

I have to say that DRBD seems to be the way the way for me.
After reading the wiki article about DRBD it should be best to add the following to each of the two nodes:
- a seperate Gb-NIC dor DRBD
- a BBU backed up Raid-Controller (e.g. a Dell Perc)
- two SSDs (as RAID 1) for the MSSQL-Server
- two HDDs (as RAID 1) for storage
- one small SSD for Proxmox OS

Or is the setup with a seperate SAN more reliable / future-proof / wise?

Best regards

Macavity
 
If you want SSD and DRBD you need 10 GBE, everything else is technically possible, but you'll be limited to 125 MB/sec while writing and then the benefits from your SSD are gone. If you use DRBD, you can technically run with one SSD per node, because you can read from the other node if your local disks fails. DRBD is RAID1 for networked devices.

I'd never put an os only on one disk. Just place it on your two SSDs for MS-SQL. Do you plan to store much data? If not, just buy 4 SSDs instead of 2 SSDs and 2 HDDs and use RAID5 or RAID10.

The DRBD-way is the cheapest to have a "real" shared storage with HA-capabilities (at least for data). A SAN is of course better (with 2 controllers), but also more expensive.
 
Hello LnxBil,

thank you for your reply.
Unfortunatly 10GBE NICs seem to be over my budget (or had i been doing something wrong when searching, i found only Cards > 200,-€)

Probably i think to complex. My target is to run the following VMs

- first VM: Windows w/ MSSQL-Server (this is the most important VM, SQL-DB not larger than 20GB, Speed is most important on this)
- second VM: Windows w/Remote Desktop / for data entry
- third VM: Windows w/ Remote Desktop / running periodicly some tasks for calculations
- fourth VM: Linux Mailserver
- fifth VM: Linux /w MySQL / Access to old database with historic data , powered up only when needed

Main target is to minimize downtime (HA not required) in case of hardware failure (in that case downtimne should not exceed 30 minutes),
It is also important that the MS SQL Server performs best (i planned 12 GB Ram for this VM).


Do you (or anybody else) have any advice what will be the best solution for me?
 
Hi,

I've been using DRBD on a 2 node cluster, but given the recent changes (and having had some issues with the linbit repos not being really up-to-date with the latest pve kernel) I'm now evaluating sheepdog. I've slightly changed my setup to have a 3 node cluster but with only 2 nodes running sheepdog on top of ZFS (the 3rd node is only to make sure I do have quorum under corosync). And I must say that since v1 of sheepdog came out this test cluster is running really well and has comparable performance to DRBD - way over CEPH which seems for me to simply not be made for small setups like this.

The only issue I still have, is that a dog cluster check might give some false alarms under heavy I/O load.

I would suggest you give this a try, knowing that I run sheepdog over a dedicated 1GBit link for the same reasons than yours ;-) But I would strongly suggest to throw in a good pair of enterprise class SSDs. ZFS is quite good if you give it some SSD cache (and ZIL log) and can be made less memory hungry.
 
Hi k0n24d,

thank you for the Information. I thought that sheepdog is not for production use that's why i did not look any further at it.

What GB NIC do you use?
Do all three nodes have to be equal or can i use an "old" Server as a third node? Did you follow any howto (and can probably post the link to it)? ;)
 
Hi,

well as I said it's a test cluster for now - running of-site on some rented servers. Two identical for the sheepdog part and third low end one.

Like always "production ready" is a little bit personal, but I can only say that I had much more issues with DRBD9 than with sheepdog once it reached 1.0.0. But as this is not so old, I won't tell you it's rock solid - but it desserves a try in my opinion. You can couple this with some ZFS snapshots so that you can roll back your whole cluster if needed or just get back some part of your data if you know which parts you have to restore.

Regarding MSSQL performance on sheepdog you'll have to try as I have no experience with this.

I didn't follow any real howto but based my install on https://pve.proxmox.com/wiki/Storage:_Sheepdog (collie cmd was renamed dog) and the doc on the project site (https://github.com/sheepdog/sheepdog/wiki). I used ZFS as base filesystem (just make sure xattr is set for the zfs your using). If you compare it to DRBD or CEPH the setup is really really easy.
 
ZFS is quite good if you give it some SSD cache (and ZIL log) and can be made less memory hungry.

That is simply wrong. Adding L2ARC will increase your memory needs because for every entry in your L2ARC you need an ARC entry too. Therefore you're actually lowering your ARC usefulness and you need to increase your general ARC size to have the same amount of usable ARC as before.

But yes, a good SSD will increase write performance a lot, a bad one (consumer crab) will not increase or even lower the throughput.
 
Another thing would be to have a pve-zsync replication from one to another server, but it is not synchronous replication.

If speed does not matter, just stick to DRBD (8 and maybe PVE 3.4).
 
That is simply wrong. Adding L2ARC will increase your memory needs because for every entry in your L2ARC you need an ARC entry too. Therefore you're actually lowering your ARC usefulness and you need to increase your general ARC size to have the same amount of usable ARC as before.

But yes, a good SSD will increase write performance a lot, a bad one (consumer crab) will not increase or even lower the throughput.

Let's reformulate. You can :
- improve ZFS read performance if you throw in an SSD for L2ARC cache.
- improve ZFS write performance if you throw in an SSD for SLOG/ZIL (ok before you correct me, should be mirrored on at least 2 SSDs)
- reduce memory footprint by tuning the module parameters (and not using deduplication as this get's really memory hungry)

Also reading you, it looks like adding 1 block (4Kb for exemple) to L2ARC removes one block in ARC this is also simply not true. Every L2ARC block needs 70 bytes of ARC memory (if I remember correctly). So yes adding L2ARC reduces available ARC space (if you're looking at constant memory usage) but you still get more cache at the end even if some of it is slower than with only ARC. And yes it all DEPENDS ON YOUR OWN USAGE.

Well depending on how I read you, you mixed up L2ARC and SLOG/ZIL. L2ARC has nothing to do with write performance - except that with a bad SSD you're might even get lower write performance ;-)
 
Last edited:
Hi,

thanks for all the Input.
Probably pve-zsync can be a solution for me.
Is it possible to use a seperate NIC with it, so data Transfer will not "block" the main NIC?
 
Also reading you, it looks like adding 1 block (4Kb for exemple) to L2ARC removes one block in ARC this is also simply not true. Every L2ARC block needs 70 bytes of ARC memory (if I remember correctly). So yes adding L2ARC reduces available ARC space (if you're looking at constant memory usage) but you still get more cache at the end even if some of it is slower than with only ARC. And yes it all DEPENDS ON YOUR OWN USAGE.

This is a good primer on L2Arc IMHO:
https://forums.freenas.org/index.php?threads/at-what-point-does-l2arc-make-sense.17373/#post-92302



The L2ARC index must be stored in the ARC. It uses 380bytes per entry. So take the quick thumbrule that your L2ARC shouldn't exceed 5x your ARC as the most liberal
It is based on FreeBSDs version of ZFS, but explains the mechanism very well.


regarding 70 bytes vs 380 bytes.
ZFS is not ZFS is not ZFS from distro to distro to distro.
 
https://www.illumos.org/issues/5408
https://github.com/zfsonlinux/zfs/commit/b9541d6b7d765883f8a5fe7c1bde74df5c256ff6

Look's like I was a on the low side ,the real value should be close to the 128 bytes you get on illumos as ZoL is more or less based on that implementation (I'll let you double check the ZoL implementation).

But for sure a post from january 2014 (talking about FreeBSD as you pointed out) can't be accurate anymore, as this has been optimized in december 2014 on illumos / ZoL.

PS: don't get me wrong on this, I don't wan't to start a flame war, but there have been a lot of changes and optimization in the last 2 years on ZoL so it's quite hard to not get misleading info if you don't look at the dates.
 
Last edited:
[...]
But for sure a post from january 2014 (talking about FreeBSD as you pointed out) can't be accurate anymore, as this has been optimized in december 2014 on illumos / ZoL.

PS: don't get me wrong on this, I don't wan't to start a flame war, but there have been a lot of changes and optimization in the last 2 years on ZoL so it's quite hard to not get misleading info if you don't look at the dates.

My post was not meant to be "look my number is right" post (as hopefully evidenced by the second quote) but more about how the concept of L2Arc works (which hasn't changed - and is pretty well explained by the thread quoted by me)

It basically boils down to this:
You Arc is your performance booster due to the high bandwidth and low latency being its main characteristic. When you add L2Arc you do not just "generate" more of your performance booster. L2Arc is slower and has higher latency (in most cases due to the native of storage media used). And whats worse is that you actually do happen to use Arc as storage for every entry you put on your L2Arc. What ever the value (70 bytes vs 128 vs 380 bytes per entry), there eventually happens to be a point at which you get one of two situations:
a) don't get to use your Arc because it is all used up for L2Arc (so in essence instead of having a Ram Based performance booster your now totally reliant on a SSD/NVME performance booster)
b) get even worse then SSD/NVME performance booster, because all these entries for L2Arc do not fit in your Arc anymore.

And that is not even acconting for the fact that ARC gets bypassed

Lets not forget that i posted all this in a thread where the OP will be using a Server Proxmox that has 12 GB of Ram. Its quite likely that he'll hit the barrier rather sooner then later.


ps:
i personally do not use ZFS on Linux beyond Raid-1 Proxmox setups. I use it heavily on FreeNas (based on FreeBSD). Most of my Business use-cases do not require L2Arc (because i can stick 100's of Gigs of Ram into these machines). Those that do i tend to go with Ceph on NVME bases Pools.

my conclusion:
use the hammer that is best suited for the job.
 
Well depending on how I read you, you mixed up L2ARC and SLOG/ZIL. L2ARC has nothing to do with write performance - except that with a bad SSD you're might even get lower write performance ;-)

No, I haven't mixed it up. Your L2ARC is written heavily, or how should your data magically appear on the disk? SLOG/ZIL is another thing - also important - but keep in mind that it is only for synchronous writes, every asynchronous write is bypassing ZIL and will first hit your ARC, then your L2ARC and then your disk (unless it is not bypassing ARC).

@Q-wulf said exactly what I tried to explain, so thanks for that.
 
I probably miss interpreted the replies, and I'm going to stop here as our discussion is really floating away from the initial question.

every asynchronous write is bypassing ZIL and will first hit your ARC, then your L2ARC and then your disk (unless it is not bypassing ARC).

If you feel that L2ARC is populated by write requests, I suggest you to read https://blogs.oracle.com/brendan/entry/test especially the end.

For sure data is written to the L2ARC device - but not based on async write requests (of course an exception would be if you need to invalidate a cached block). The basic idea is to go from ARC to L2ARC if the data would be evicted because your ARC is full.

@Q-wulf, I do understand what you say, and I mostly agree with your comments, but from my understanding the OP is talking about Xeon E3-1225v3 with 32G (which is the maximum). so depending on his workload it might be interesting for him to add SOME L2ARC as he can't throw in more memory. And you are right it is always better to go for more memory IF you can.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!