Small Business Cluster with 2 New Servers and QDevice

Clint84

New Member
Nov 12, 2024
2
2
3
BACKGROUND
I work for a small business that provides a 24 hour service on our servers and requires as close to 100% uptime as possible. Our old IT company sold us 2 identical Dell R420 servers several years ago with a single 6 core processor, 4x 3.5" 600GB 10K SAS HDD in RAID10, and 16GB RAM and Server 2012 (not R2) that was ostensibly to allow virtualization for 3 VMs, but never implemented any of it and installed everything bare metal, sharing all services including DC and AD with our core business services, making it difficult to switch things to VMs due to not having the ability for down time while we completely reconfigure the server. Our core service has a program that runs on the alternate server called Cluster that basically monitors the service on Server 1 for any database changes and duplicates them, then when the service on Server 1 is unreachable it will stop Cluster and launch the database and services on Server 2. This process can take up to 10 min, which means any information that is trying to come into our core services during this period is dropped and lost forever. Another problem/inconvenience is that we then need to manually start Cluster on Server 1 when it is back online to be able to failover back to Server 1 if Server 2 has a fault.

GOAL
To have a High Availability Cluster where the VMs running our core services can automatically migrate between hosts with as little downtime as possible. We will still have a second VM in High Availability running the Cluster service so that we can migrate the services to an alternate VM while doing updates to the VM OS.

We are finally looking to upgrade our servers to something high quality and future proof. We do not have the budget for more than a couple servers so we are looking at getting a couple identical R660 or R760 servers running DDR5 RAM and 10x 3.2TB NVME drives in RAID Z2 (Z3?) with 10Gb/25Gb network. I read up on CEPH and ZFS with replication and came to the conclusion that we won't have enough hardware for CEPH, even if we splurged on another server (consensus seems to be minimum 4-5 CEPH nodes). Therefore, the goal at this time is to use ZFS replication at 1 min. intervals for our core services VMs and ~10 or 15 minutes for other VMs. For a 3rd node to achieve quorum, we will run a QDevice on another piece of hardware. Will this scenario work for high availability with ZFS replication between only two nodes?

We also have an old R720 that we came into possession with 8x 10TB SATA drives that we were planning on running TrueNAS Scale as a storage server for our office workstations and also as a backup for Proxmox snapshots. We would then backup the snapshots and workstation shares to a cloud backup service. We would probably run the QDevice as a VM in TrueNAS Scale. I don't believe that the 8 SATA disks in RAIDZ2 would provide the bandwidth needed to be a shared storage for the VMs.

Is there a better way to configure this setup without having to buy more servers? Could I technically use my old servers as part of the quorum and then use CEPH, even though they don't have an identical storage config? I could change the RAID card to IT mode and possibly get 4 SAS SSD or SATA SSD to speed up the storage.
 
  • Like
Reactions: Johannes S
First please take my words with a graint of salt: I'm a system administrator but not envolved with virtualization at the moment. I use Proxmox only in my homelab at the moment.
We are finally looking to upgrade our servers to something high quality and future proof. We do not have the budget for more than a couple servers so we are looking at getting a couple identical R660 or R760 servers running DDR5 RAM and 10x 3.2TB NVME drives in RAID Z2 (Z3?) with 10Gb/25Gb network. I read up on CEPH and ZFS with replication and came to the conclusion that we won't have enough hardware for CEPH, even if we splurged on another server (consensus seems to be minimum 4-5 CEPH nodes). Therefore, the goal at this time is to use ZFS replication at 1 min. intervals for our core services VMs and ~10 or 15 minutes for other VMs. For a 3rd node to achieve quorum, we will run a QDevice on another piece of hardware. Will this scenario work for high availability with ZFS replication between only two nodes?

If you can live with the implied dataloss (of 1-15 minutes) yes, this should work. The manual recommends setting up a dedicated cluster network for corosync though so a hickup in your replication or main network doesn't impact your cluster. It might also be a good idea to add a redundant link in case the first one gets broken: https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

I also remember that consultants reported that they setup their customers clusters with a dedicated network node-to-node connection for the ZFS replication for two reasons:
  • A problem with the regular network doesn't impact the replications
  • Since only your PVE nodes are part of this network, you can disable encryption for the transport resulting in a higher transfer performance
However this owuld need that you would need another network adapter for the corosync network (1GB should be enough if I recall correctly the discussions in this forum on such setups), so you would have at least three networks:
  • Regular network so clients can connect to the servers and the servers to the NAS
  • corosync network for cluster communication
  • storage network for replication
I'm not sure which one should get the 10Gb or 25 Gb network card, I hope somebody with actual professional experience can add this information.
Concerning RAIDZ I remember discussions where participants (from their experiences in enterprise setups) recommended against using ZFS RAIDZ levels for performance reasons. Instead zfs mirrors should be used.
We also have an old R720 that we came into possession with 8x 10TB SATA drives that we were planning on running TrueNAS Scale as a storage server for our office workstations and also as a backup for Proxmox snapshots. We would then backup the snapshots and workstation shares to a cloud backup service. We would probably run the QDevice as a VM in TrueNAS Scale. I don't believe that the 8 SATA disks in RAIDZ2 would provide the bandwidth needed to be a shared storage for the VMs.

You could also setup a Proxmox Backup Server on the NAS and using this VM for the qdevice too. Different than the cluster nodes the qdevice doesn't need to be part of the dedicated corosync network.
For the storage configuration I would setup a ZFS mirror of two SATAs for the PBS vm together with two SSDs or NVMEs as special devices. The reason is that PBS splits the data in a lot of small files (chunks) to work it's deduplication magic. Garbace collection and backup verify jobs tend to be slow on spinning disks, for this reason the manual recommends using enterprise ssds as datastores for PBS. But If one don't have the budget for two large enough SSDs using a special device mirror together with a HDD mirror as PBS datastorage will have better performance than just spinning disks. The remaining SATA devices then could be used as storage for the office workstations. Concerning the bandwidth this could work with NFS and a fast enough network link but I fear that the HDDs will be to slow for this. And like said ZFS RAIDZ get bad comments in this forum concerning it's performance (of course might still be good enough for your usecase).
One caveat though: At the moment PBS doesn't officially supports cloud storage so you would have to think whether you would have a vserver (additional expence) for this purpose or some dedicated PBS hosting service like Tuxis:
https://www.tuxis.nl/en/proxmox-backup-server/
I don't know whether they offer their service outside the EU (maybe with resellers?) though.

The other alternative of course would be to use the normal backup function of Proxmox VE, to save vz vzdump files to the NAS and store them to the cloud.
I would still setup a PBS though since it's deduplication allows to have more local backups and it supports file-restore.
Is there a better way to configure this setup without having to buy more servers? Could I technically use my old servers as part of the quorum and then use CEPH, even though they don't have an identical storage config? I could change the RAID card to IT mode and possibly get 4 SAS SSD or SATA SSD to speed up the storage.

This would propably work but since I have no real idea how to configure and size Ceph I hope somebody more competent will answer this ;)
Another possibility might be to use one or both of the old servers as proxmox backup server (see above), maybe at a offsite location? So you would use your NAS with a PBS vm as your primary backup target and one of the old servers at a offsite location as offsite backup. Or use one old server at your company as primary backup (so you can seperate the NAS from your backup server, might be a good idea in case the office workstations and NAS get compromised) and the second one as a offsite target.

PBS allows to sync from one PBS instance to another, which is quite useful for ransomware protection:
https://pbs.proxmox.com/docs/storage.html#ransomware-protection-recovery

In my homelab (yes I know) this is the way I have it setup :
- Two-Node cluster on two minipcs with ZFS storage sync my vms and containers (without dedicated network, it's still a homelab and I woudn't do this in a professional network, even a small business one)
- TrueNAS Scale on another mini-pc with a combined PBS/qdevice VM
- Vserver with PBS installed, it pulls every day the new backups from the TrueNAS PBS.

Permissions are like these:
  • The proxmox nodes are allowed to do backups on the TrueNAS PBS. They are not allowed to remove backups only to add new ones.
  • The proxmox nodes and the NAS PBS are allowed to restore or pull backups from the vserver but not to modify them
  • The vserver PBS is allowed to pull backups from the TrueNAS pbs and nothing else
  • I have a external USB disk where the NAS PBS regulary sync it's data as cold storage.
Thus even If some attacker could compromise a part of my setup in theory there should still be at least one host which wasn't compromised and thus still have older backups before the attack to restore everything from.

If I would have to setup such a thing in a professional environment I would do these things different:
  • Enterprise SSD/NVMEs as storage in a ZFS mirror (at the moment my vms are on single NVMEs )
  • Dedicated networks for corosync and ZFS replication
  • A dedicated server with a HDD/SSD mirror/special device ZFS setup as offsite backup of the NAS and PBS
  • Office-PBS on it's own server, different from the NAS
  • Of course professional hardware instead of Mini-PCs ;)


You might notice that I'm a big PBS fan please take it with a grain of salt (homelab versus company etc). If a different approach fits your your usecase/budget you should follow your thinking instead of my ramblings.

Hope this helps and best regards, Johannes.
 
  • Like
Reactions: UdoB
I also remember that consultants reported that they setup their customers clusters with a dedicated network node-to-node connection for the ZFS replication for two reasons:
  • A problem with the regular network doesn't impact the replications
  • Since only your PVE nodes are part of this network, you can disable encryption for the transport resulting in a higher transfer performance
However this owuld need that you would need another network adapter for the corosync network (1GB should be enough if I recall correctly the discussions in this forum on such setups), so you would have at least three networks:
  • Regular network so clients can connect to the servers and the servers to the NAS
  • corosync network for cluster communication
  • storage network for replication
I'm not sure which one should get the 10Gb or 25 Gb network card, I hope somebody with actual professional experience can add this information.
The servers we are looking at purchasing have 2x 1Gb ethernet LOM (LAN on Mainboard) and we were spec'ing an Intel E810 Dual port NIC that is 10/25Gb capable (my understanding is 10Gb SFP+ is interchangeable with 25Gb SFP28). We were planning on using 10Gb aggregation switches and aggregating the links. What I can do instead is create 2 new VLANS on our network switches for Storage and Corosync (we have a full Unifi setup in our rack with 10Gb SFP interconnecting our switches) and use one 10Gb SFP+ link for Storage network, one 10Gb SFP+ link for Main network for clients, and one 1Gb ethernet link for Corosync network. If needed I can get a small managed 8 port switch for the Corosync network, otherwise it would just be on one of our main switches and in its own VLAN. We haven't got the aggregation switches yet since our old servers werent 10Gb capable, so if it would be better we can go to 25Gb SFP28 aggregation switches, but they are $900 vs the 10Gb at $300, so I would only want to do that if absolutely necessary.

Concerning RAIDZ I remember discussions where participants (from their experiences in enterprise setups) recommended against using ZFS RAIDZ levels for performance reasons. Instead zfs mirrors should be used.
So if I have 10x 3.2TB of NVME disks for locally running the VMs on each machine, you do not recommend a ZFS RAID Z3? Doesn't a ZFS RAID Z3 have better performance with the ability to lose 3 drives? I use TrueNAS in my Homelab (hyperconverged under Proxmox - I haven't played with clusters yet, but plan to add some mini pcs in the future for this) and while I am not super familiar with the ins and outs of ZFS, it was my understanding that doing RAIDZ2 or RAID Z3 was the more performant method. If I did mirrors as you suggest, wouldn't that make 5 ZFS mirrored pools and I would have to allocate VMs between them? In case there is confusion that this array is also my boot media, I forgot to mention that the Proxmox Hypervisor will be installed on a Dell BOSS-N1 480GB mirrored RAID using the ext4 format. The ZFS storage would solely be for VM storage.

Edit - My mind completely spaced the option of RAID-10, the term "ZFS Mirrors" only brought to mind RAID-1. I did read after this that RAID-10 is more favorable for VMs when I was doing further research on ZFS for Proxmox.

You might notice that I'm a big PBS fan please take it with a grain of salt (homelab versus company etc). If a different approach fits your your usecase/budget you should follow your thinking instead of my ramblings.
I have looked at PBS, wasn't sure if it required the subscription or if it was also free use like Proxmox, but haven't had a chance to dive in. I was considering doing a PBS VM on the TrueNAS box like you mentioned, although I was thinking copying it to a couple mirrored NVME, then rsync it to the spinning rust pool for long term storage, hadn't considered the special device route. I was also just debating using the built-in backup function of Proxmox to save VM copies to a shared storage drive. We use Acronis Cyber Protect cloud backup for our current servers and was going to look at the best way to sync our storage server to that once the backups were made, but also looking into the possibility of switching to a more well known solution that directly connects to Proxmox such as Veeam. Then we can do live restores or file level restores. I read that most people who have tried PBS and Veeam prefer the added features of Veeam. We don't have an offsite location to setup a physical server for offsite backups, otherwise that option sounds much better. I'm not going to pay the premium of running enterprise grade equipment at my home (been there and done that with my entry into homelabbing) and nobody else in the company can be reliable enough. Although, if I could get symmetrical fiber run to my house on the company dime (current service only provides max 10Mbps), and a stipend for the increased electrical cost and internet cost I may rethink it :p.

This has all been very informative and I hope someone can chime in on the possibility of using my old servers to create a 4 node cluster of CEPH (2 nodes with 10x 3.2TB NVME, 2 nodes with 4x 3.7TB SAS SSD) and an extra QDevice for Quorum. We would probably never actually fail the VMs to the old servers except in a dire situation, and only for the core services VM (although my understanding is the hardware is unsupported in Windows Server 2022+). We would just be using it to facilitate CEPH storage redundancy. However it gets setup now will be how it runs until our next major overhaul, probably in about 7-8 years, so I want to pick the option that will carry us for a while. If we did go CEPH with our old servers, I obviously don't expect them to survive a full 7-8 years, but hopefully we can get a couple more near matching servers in 3-4 years at a steep discount and cycle them in to replace the aged R420 servers.
 
Last edited:
  • Like
Reactions: Johannes S

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!