3 node cluster setup recommendations

Magneto

Well-Known Member
Jul 30, 2017
133
4
58
45
Hi,

I have read just about every post on this forum on this topic. Some have some useful information, but I can't get all the answers that I have questions for, hence this post.

I need to setup a 3 node high available cluster to host 14 virtual machines, Linux and Windows based. This is to replace the old physical servers for each of these VM's.

One of the Virtual Machines would be a very busy Linux mailserver, another would be a Windows MS SQL Server and another VM would be a VOIP server.

The target setup should deliver excellent performance and high availability.

The planned config is as follows:
3x SuperMicro 2U, 8 bay servers with:
  • Dual XEON 12Core E52650V4 CPU
  • 128GB ECC RAM
  • 2x 1.2Tb Intel S3520 SSD drives
  • 4x 8TB SATA drives
  • 4port 10GbaseT, Intel XL710 and X557 network card
  • Supermicro SMC SATA3 128GB MLC SATA DOM with Hook
A 4th server, with the same config, will be placed in another building for backup. A 600Mbps wireless link will be used between the 2 buildings.

  1. I plan on installing Proxmox on the SuperMicro MLC SATA DOM (Disk On Module) [1], but don't know if this MLC device will last very long. The advantage of using this is that I don't waste a drive bay for the OS.
  2. The 2 SSD drives will be used for the L2ARC and ZIL, though I think an Intel 7310 would be better?
  3. The I want to setup ZFS on the drives, possibly RAIDZ2 or 2 mirrors with the SATA drives.
  4. Then I want to setup GlusterFS or CEPH on top of ZFS for the Virtual Machines to achieve high availability.
  5. Lastly, I have been thinking about connecting a CAT 6a cable directly from each server's NIC, to another server for the storage network. The 10Gbe NIC's have 4 ports so I could run a cable from Server1/Port1 to Server2/Port2, another cable from Server1/Port3 to Server3/Port1, and another from Server2/Port2 to Server3/Port2. I'm not sure if this is possible, but I'm if it is, I could eliminate two very expensive 10Gbe switches.

any comments or recommendations would be appreciated.
[1] https://www.supermicro.com/products/nfo/SATADOM.cfm
 
1. The Satadom MLC device should be ok, so long as you dont use it for VM storage.
2. seems like overkill in terms of capacity, remember L2Arc has overheads. Maybe partition the drives as follows:
- 2x 10GB mirror - Log
- 2x 100GB stripe - L2arc
- 2x mirror balance - Pure SSD pool
3. 2 mirrors will outperform RAIDZ2
4. I would not layer Ceph on top ZFS. It complicates caching etc. Its recommended to do so with Gluster, but Ceph replaces a lot of the ZFS functionality and works on completely different principles. i would suggest building your infrastructure in a nested Proxmox lab server, so you can see and observe how it functions.
5.Have a look at the following Wiki article on Full Mesh network
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
Method 1 is very simple. Each node has a cable running to every other node
 
Thanx for the response.

Why would you not run CEPH on top of ZFS?

Another option I have considered, due to the limited hard drive bays, is to use LVM instead, and then use CEPH or GlusterFS on top of it.
 
Hello,

Some thoughts :
- with a busy mail server, glusterfs is not so good using lxc containers(with imap)
- would be better if you split your mail server in 2 different kvm guest (nfs for inbox storage)
- for a mailserver and a sql server l2arc it is not so usefull(I run mailservers on top of zfs for many years)
 
Hello,

Some thoughts :
- with a busy mail server, glusterfs is not so good using lxc containers(with imap)
- would be better if you split your mail server in 2 different kvm guest (nfs for inbox storage)
- for a mailserver and a sql server l2arc it is not so usefull(I run mailservers on top of zfs for many years)
Thanx for the tips.

The current mail server runs postfix + clamav + spamassasin + custom scripts, and it will be kept this way since the sysadmins know how to manage this setup already.
The current MSSQL Servers run on Windows2010, will probably be upgraded to Windows2012 or 2016.
 
The setup for 2 kvm guest will be the same with what you have now.
Ok, maybe I'm missing something, but why split the Mailserver into 2x KVM guests? It sounds like some unnecessary overhead.
 
No... you will split the actual load on 2 different guests (on 2 different proxmox nodes). Think ... 2 postfix=2 different queue? Or what will be faster, one SpamAssassin scanner or 2? Or if one postfix will be over-loaded? The second can be still usable in this time.
 
I'm currently building a 3-node cluster myself, except I purchased all used equipment for peanuts (Supermicro dual 8-core E5-2670 , 64GB DDR3).

For the pve host drive itself, you can use NVME PCI-E card, or use a PCI-E to M.2 adapter. The PCI-E slot will power the drive. You can find them on Amazon (I purchased the Ableconn brand). The only SSD's I would consider would be Samsung/Intel/Crucial.

I'm using LSI 9211-8i cards for my storage controller with Samsung SSD's (no raid).

I purchased cheap intel dual 10GB fiber adapters on Ebay, and a Nexus 3048 switch to connect them together. The Nexus 3048 switch has 4 10GBE SFP+ slots. Alternatively, I could have purchased a Nexus 5010 for more SFP+ ports. This all works great, and was dirt cheap. You can't go wrong with Cisco and Intel gear.

Currently testing out Ceph Bluestore, and so far it performs very well. It's going to be overkill for my needs, even if most of the gear is 5+ years old.
 
I'm currently building a 3-node cluster myself, except I purchased all used equipment for peanuts (Supermicro dual 8-core E5-2670 , 64GB DDR3).

For the pve host drive itself, you can use NVME PCI-E card, or use a PCI-E to M.2 adapter. The PCI-E slot will power the drive. You can find them on Amazon (I purchased the Ableconn brand). The only SSD's I would consider would be Samsung/Intel/Crucial.

I'm using LSI 9211-8i cards for my storage controller with Samsung SSD's (no raid).

I purchased cheap intel dual 10GB fiber adapters on Ebay, and a Nexus 3048 switch to connect them together. The Nexus 3048 switch has 4 10GBE SFP+ slots. Alternatively, I could have purchased a Nexus 5010 for more SFP+ ports. This all works great, and was dirt cheap. You can't go wrong with Cisco and Intel gear.

Currently testing out Ceph Bluestore, and so far it performs very well. It's going to be overkill for my needs, even if most of the gear is 5+ years old.
Which filesystem are you using on the drives, before you setup CEPH?
 
Ok so I just watched the video on installing CEPH on a 3 node cluster with 10GBe NIC's - something similar to what I want to do. But, what is unclear in the video, seems like the 2x storage (or OSD?) drives are unpartitioned? Does that mean CEPH creates it's own partition?

My question is this: If I wanted to use 4x 8TB or even 8x 4TB SATA HDD's for storage, would I gain anything by setting up a ZFS RAIDZ1 pool + SSD L2ARC & ZIL cache?
Or even use LVM to combine the drives into a single large volume and use dm-cache to speed it up?
 
yes, ceph will create its own partitions, and no, those setups do not gain you anything performance wise. ceph supports putting the journal for each OSD on a fast (SSD) device anyway, use that if you want big spinners and fast journals.
 
  • Like
Reactions: El Tebe
yes, ceph will create its own partitions, and no, those setups do not gain you anything performance wise. ceph supports putting the journal for each OSD on a fast (SSD) device anyway, use that if you want big spinners and fast journals.
Thanx. Have you compared this to a hardware RAID card + SSD cache, and ZFS + L2Arc cache? I'd like to see what the performance differences are.

I don't have the hardware yet, so can't do this myself at this stage. Just getting as much info as I can before I buy.
 
Ceph performs better the more OSDs you give it - having one huge OSD with extra Caching and Journalling outside of Ceph is a lot worse than having lots of smaller OSDs and using Ceph's Caching and Journalling mechanisms. I would even say that 8TB disks are too big for a small Ceph cluster, because if they are even remotely full and you need to rebalance after a failed disk, you have to move too much data.
 
Ceph performs better the more OSDs you give it - having one huge OSD with extra Caching and Journalling outside of Ceph is a lot worse than having lots of smaller OSDs and using Ceph's Caching and Journalling mechanisms. I would even say that 8TB disks are too big for a small Ceph cluster, because if they are even remotely full and you need to rebalance after a failed disk, you have to move too much data.
Hi,
I have setup Ceph as per the Wiki, but it's still a bit unclear to me.

1. How do I specify how large an OSD is?
2. How do I combine the 4 single drives into a single large drive / partition?
3. So I understand that Ceph does a form of RAID over the network, is that correct? If so, is it mirrored RAID, or stripped RAID?
 
Hi,
I have setup Ceph as per the Wiki, but it's still a bit unclear to me.

1. How do I specify how large an OSD is?

by default (and recommended), a whole disk (most people use 2-4TB disks for non-SSD/NVME)

2. How do I combine the 4 single drives into a single large drive / partition?

you don't. you create one OSD per disk ;)

3. So I understand that Ceph does a form of RAID over the network, is that correct? If so, is it mirrored RAID, or stripped RAID?

if you want to greatly oversimplify, you could see Ceph as a "RAID over the network". it uses an algorithm called "CRUSH" to distribute data according to a rule set (which offers a lot of flexibility, but also a lot of potential to shoot yourself in the foot if you don't know what you are doing ;)). data is stored as objects (RADOS), but you can export a collection of such objects as regular block device (RBD).

I suggest checking out the Ceph documentation, which gives a good high-level overview: http://docs.ceph.com/docs/master/start/intro/
 
Ceph performs better the more OSDs you give it - having one huge OSD with extra Caching and Journalling outside of Ceph is a lot worse than having lots of smaller OSDs and using Ceph's Caching and Journalling mechanisms. I would even say that 8TB disks are too big for a small Ceph cluster, because if they are even remotely full and you need to rebalance after a failed disk, you have to move too much data.

Ok, so with 3x 8TB HDD's, how would you setup the OSD's for optimium performance?

I have 4x identical servers as folllows: Dual 12core XEON, 128GB RAM, 2x 300GB SSD for cache, 4x 8TB SATA for storage, 4 port 10GBe NIC.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!