GlusterFS on new cluster?

cathode

Renowned Member
Jun 9, 2015
3
0
66
Hi All,

We have a new deployment that I think would benefit from some sort of hyperconverged design. The initial plan was to build a replicated FreeNas array using a csystem from supermicro and add that as iSCSI storage to an existing 6 node cluster - however in testing I'm seeing a TON of CTLD errors which doesn't fill me with confidence, but that could just be related to my testbench being underpowered for ZFS of course:

System:
SYS-6029P-TRT

Specs for those interested:
Dual Xeon 4110 @2.1GHZ (8core 16thread)
Supermicro AOC-STGN-i2S
96GB RAM
2x32GB Satadom
8x 4TB Seagate 7200RPM constellations

So I've got 2 of those on the way, I also have a third supermicro system in production:

SYS-5029P-WTR

Single Xeon 4114 @2.20GHZ (10core 20hread)
64GB RAM
Dual 1TB (Raid1) boot disk

Current design is on a gig network and I'm looking to move to 10gig

What I was thinking of doing is creating a new cluster with the 10gig capable hardware, splitting the drives from the 2 systems between the 3 servers - 4x4TB in each server in RAID10 (via the onboard RAID controller) and redistributing the Satadoms so each system has 1 of those for boot drives. Not sure if the supermicros internal raid supports hot spares (new to the platform) so that would leave 4 drives for hot/cold spares and 1 spare satadom - which is fine.

Networking would be fed into an Allied Telesis 10Gig switch via the onboard NIC's

PVE would be loaded on the satadoms and I was thinking about using GlusterFS on the Raid10 in a new cluster.

I see that the recommendation is to use the upstream GlusterFS which currently is 5.1-1 in the gluster repo.

I know that SSD's are preferred over spinning rust, however rust is what I have to work with. Am I trying a bit too hard to get bang for buck and should I just stick with the FreeNas idea?

This will run a read heavy MSSQL database and IIs combo for a school management system and also an intranet/LMS system based on Ubuntu. Ideally the exisiting VM's (20 Linux and 10 Windows) would move in due course as the current cluster is a hodge podge of hardware spanning almost 10 years.

Appreciate thoughts on GlusterFS 5.1 suitability for this design over a replicated FreeNas.

Thanks in advance.
 
In spite of spending 25 years in IT and in infrastructure / applications space, somehow I never needed to manage large arrays. To that end, I've never bothered with IOPS calculations. However over the past few weeks due to performance issues, I've been reading up a lot about it. My 3 drive Seagate Barracuda implementation doesn't run very well. It has been running ZFS which I think is partly to blame, but it's clear to me that you would need a LOT of HDD's to come anywhere near the IOPS of an SSD. Think about 80-150 IOPS per HDD vs 20,000+ IOPS per even the cheapest SSD. So in RAID-10 setup like yours you could get up to circa 600 IOPS. When you add a 10Gb network to that, you're going to be pushing it even more, particularly in a ZFS environment as I've read it. I've just gone down the whole path of looking into SLOG / ZIL devices (just disabled syncs to test out the SLOG theory) and so on and ultimately decided I'm better off with a single enterprise SSD and backups. Of course you'll have different needs, but I think it's worth investigation on your part how far you can go. Also, if someone else has a different opinion, I've love to know in case it helps my understanding.

I'm going to retry the whole thing with EXT4 again - so far that performs much much better and is less complicated. Of course, it doesn't do self healing. I've read up on GlusterFS before and it looks interesting. I agree it needs 10Gb+ on the network side to make it worthwhile - but that may put pressures on other parts of the system.
 
Totally agree with @Marshalleq, SSD at least for SLOG if you plan to use ZFS or CEPH.

Appreciate thoughts on GlusterFS 5.1 suitability for this design over a replicated FreeNas.

The GlusterFS solution should be simpler to use than the replicated FreeNAS, doesn't it?

But only having 4 disks is very, very slow - especially for a database. It all comes down to how much data is in there and can it be cached on the RDBMS host or do you need to reread it over and over. If not, the system can run just fine.

Do I understand you correctly, that ZFS is NOT used in your storage stack with GlusterFS?
 
I use Glusterfs in my home lab.

- : you can't natively put container on this storage.
- : poor performnace ( - 50% ?)

+: very simple for createe manage volume.
 
Current design is on a gig network and I'm looking to move to 10gig

What I was thinking of doing is creating a new cluster with the 10gig capable hardware, splitting the drives from the 2 systems between the 3 servers - 4x4TB in each server in RAID10 (via the onboard RAID controller) and redistributing the Satadoms so each system has 1 of those for boot drives. Not sure if the supermicros internal raid supports hot spares (new to the platform) so that would leave 4 drives for hot/cold spares and 1 spare satadom - which is fine.

Networking would be fed into an Allied Telesis 10Gig switch via the onboard NIC's

PVE would be loaded on the satadoms and I was thinking about using GlusterFS on the Raid10 in a new cluster.

Hi @cathode

Maybe you are willing to test something like this(lizardfs insted of glusterfs):
- you will need 2 decent SSD(128-250 GB) who will contain the metadata(master and shadow)
- in total you will use 4 HDD x 3 servers = 12 chunk server
- for your DBs VM you could create a shared folder like 10+2(any vblock write will be like raidz2/raid6 using 10 disks for data and 2 disk for parity - is only a example)
- this will outperform glusterfs(for any io: read/write/iops)
- this shared folder could be used in PMX like gluster
- maybe you can create different shared folder for different kind of usage(with different parity/distrib settings), like: ISO install image, backup, and so on
- you can also have something like a recycled bin if this is usefull for you!
- and you can do live-migration like for glusterfs case

In my own case with the same/equivalent config(mirror with 3 members), lizardfs was very fast for any storage operation compared with glusterfs using 1 Gbit dedicated network! I use a snmp graphing system(librenms, with mariadb) for many devices(many 24 port switches), and with glusterfs, it took many ten's of seconds until I was able to see the graph for a 24 ports switch .... with lizardfs it take less then seconds!
 
Totally agree with @Marshalleq, SSD at least for SLOG if you plan to use ZFS or CEPH.



The GlusterFS solution should be simpler to use than the replicated FreeNAS, doesn't it?

But only having 4 disks is very, very slow - especially for a database. It all comes down to how much data is in there and can it be cached on the RDBMS host or do you need to reread it over and over. If not, the system can run just fine.

Do I understand you correctly, that ZFS is NOT used in your storage stack with GlusterFS?

Thanks for the reply, maybe I wasn't clear. I'm currently using FreeNas (4x2tb disks in a striped mirror) for about 30 VM's across 6 nodes, the VM's are mostly Linux based 90% with windows for the remaining 10%. The issue with this setup is that the failure point is the FreeNas and if that dies it will mean a restore from backup - which is on another box. I work for an EDU, so some downtime can be dealt with ie, max of 1hr.

The idea I had was simply to implement a new FreeNas array but have that be mirrored using replication to a second identically configured box. Any downtime because of failure of the main storage array would simply be resolved be connecting to the second storage server while the first one was rebuilt. This would then become the primary, when the issue with the first storage server was resolved it would be re-added and reconfigured as the copy with data being replicated over a dedicated 10Gbps link.

Using Gluster would alleviate a lot of manual intervention here, but it may come with performance issues that may be too steep to realistically implement. But, given that about 3 months ago we added the 6th compute node the has 10Gbps onboard, the question I had was basically, do I set up a new cluster with the 10Gbps capable nodes and run hyper-converged or would I be better to run a more traditional setup and use the existing cluster on the understanding that there is a bottleneck of the network being 1Gbps....
 
Hi @cathode

Maybe you are willing to test something like this(lizardfs insted of glusterfs):
- you will need 2 decent SSD(128-250 GB) who will contain the metadata(master and shadow)
- in total you will use 4 HDD x 3 servers = 12 chunk server
- for your DBs VM you could create a shared folder like 10+2(any vblock write will be like raidz2/raid6 using 10 disks for data and 2 disk for parity - is only a example)
- this will outperform glusterfs(for any io: read/write/iops)
- this shared folder could be used in PMX like gluster
- maybe you can create different shared folder for different kind of usage(with different parity/distrib settings), like: ISO install image, backup, and so on
- you can also have something like a recycled bin if this is usefull for you!
- and you can do live-migration like for glusterfs case

In my own case with the same/equivalent config(mirror with 3 members), lizardfs was very fast for any storage operation compared with glusterfs using 1 Gbit dedicated network! I use a snmp graphing system(librenms, with mariadb) for many devices(many 24 port switches), and with glusterfs, it took many ten's of seconds until I was able to see the graph for a 24 ports switch .... with lizardfs it take less then seconds!

Thanks for the reply, I have briefly looked at LizardFS - the issue I had with it was all the data I could find on it mentioned testing, or development use with little or nothing concrete on how it performs in production. At least with Gluster, it's in use and documentation is plentiful. But I take your point and will look into it further.
 
I'm currently using FreeNas


Even on FreeNas, as I can see in google you can use HAST+ucarp, and you can achive what you need(fail-over storage). Basically if I understood, you will use only the primary HAST node on normal operation(using VIP hast / cluster IP). When the primary is down, then the secondary will be master. HAST is like a RAID1 over ethernet in real-time. But like any HA sistem with only 2 nodes, you can go in a split-brain-scenario .... !!!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!