Small Business Cluster with 2 New Servers and QDevice

Clint84

New Member
Nov 12, 2024
6
4
3
BACKGROUND
I work for a small business that provides a 24 hour service on our servers and requires as close to 100% uptime as possible. Our old IT company sold us 2 identical Dell R420 servers several years ago with a single 6 core processor, 4x 3.5" 600GB 10K SAS HDD in RAID10, and 16GB RAM and Server 2012 (not R2) that was ostensibly to allow virtualization for 3 VMs, but never implemented any of it and installed everything bare metal, sharing all services including DC and AD with our core business services, making it difficult to switch things to VMs due to not having the ability for down time while we completely reconfigure the server. Our core service has a program that runs on the alternate server called Cluster that basically monitors the service on Server 1 for any database changes and duplicates them, then when the service on Server 1 is unreachable it will stop Cluster and launch the database and services on Server 2. This process can take up to 10 min, which means any information that is trying to come into our core services during this period is dropped and lost forever. Another problem/inconvenience is that we then need to manually start Cluster on Server 1 when it is back online to be able to failover back to Server 1 if Server 2 has a fault.

GOAL
To have a High Availability Cluster where the VMs running our core services can automatically migrate between hosts with as little downtime as possible. We will still have a second VM in High Availability running the Cluster service so that we can migrate the services to an alternate VM while doing updates to the VM OS.

We are finally looking to upgrade our servers to something high quality and future proof. We do not have the budget for more than a couple servers so we are looking at getting a couple identical R660 or R760 servers running DDR5 RAM and 10x 3.2TB NVME drives in RAID Z2 (Z3?) with 10Gb/25Gb network. I read up on CEPH and ZFS with replication and came to the conclusion that we won't have enough hardware for CEPH, even if we splurged on another server (consensus seems to be minimum 4-5 CEPH nodes). Therefore, the goal at this time is to use ZFS replication at 1 min. intervals for our core services VMs and ~10 or 15 minutes for other VMs. For a 3rd node to achieve quorum, we will run a QDevice on another piece of hardware. Will this scenario work for high availability with ZFS replication between only two nodes?

We also have an old R720 that we came into possession with 8x 10TB SATA drives that we were planning on running TrueNAS Scale as a storage server for our office workstations and also as a backup for Proxmox snapshots. We would then backup the snapshots and workstation shares to a cloud backup service. We would probably run the QDevice as a VM in TrueNAS Scale. I don't believe that the 8 SATA disks in RAIDZ2 would provide the bandwidth needed to be a shared storage for the VMs.

Is there a better way to configure this setup without having to buy more servers? Could I technically use my old servers as part of the quorum and then use CEPH, even though they don't have an identical storage config? I could change the RAID card to IT mode and possibly get 4 SAS SSD or SATA SSD to speed up the storage.
 
  • Like
Reactions: Johannes S
First please take my words with a graint of salt: I'm a system administrator but not envolved with virtualization at the moment. I use Proxmox only in my homelab at the moment.
We are finally looking to upgrade our servers to something high quality and future proof. We do not have the budget for more than a couple servers so we are looking at getting a couple identical R660 or R760 servers running DDR5 RAM and 10x 3.2TB NVME drives in RAID Z2 (Z3?) with 10Gb/25Gb network. I read up on CEPH and ZFS with replication and came to the conclusion that we won't have enough hardware for CEPH, even if we splurged on another server (consensus seems to be minimum 4-5 CEPH nodes). Therefore, the goal at this time is to use ZFS replication at 1 min. intervals for our core services VMs and ~10 or 15 minutes for other VMs. For a 3rd node to achieve quorum, we will run a QDevice on another piece of hardware. Will this scenario work for high availability with ZFS replication between only two nodes?

If you can live with the implied dataloss (of 1-15 minutes) yes, this should work. The manual recommends setting up a dedicated cluster network for corosync though so a hickup in your replication or main network doesn't impact your cluster. It might also be a good idea to add a redundant link in case the first one gets broken: https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

I also remember that consultants reported that they setup their customers clusters with a dedicated network node-to-node connection for the ZFS replication for two reasons:
  • A problem with the regular network doesn't impact the replications
  • Since only your PVE nodes are part of this network, you can disable encryption for the transport resulting in a higher transfer performance
However this owuld need that you would need another network adapter for the corosync network (1GB should be enough if I recall correctly the discussions in this forum on such setups), so you would have at least three networks:
  • Regular network so clients can connect to the servers and the servers to the NAS
  • corosync network for cluster communication
  • storage network for replication
I'm not sure which one should get the 10Gb or 25 Gb network card, I hope somebody with actual professional experience can add this information.
Concerning RAIDZ I remember discussions where participants (from their experiences in enterprise setups) recommended against using ZFS RAIDZ levels for performance reasons. Instead zfs mirrors should be used.
We also have an old R720 that we came into possession with 8x 10TB SATA drives that we were planning on running TrueNAS Scale as a storage server for our office workstations and also as a backup for Proxmox snapshots. We would then backup the snapshots and workstation shares to a cloud backup service. We would probably run the QDevice as a VM in TrueNAS Scale. I don't believe that the 8 SATA disks in RAIDZ2 would provide the bandwidth needed to be a shared storage for the VMs.

You could also setup a Proxmox Backup Server on the NAS and using this VM for the qdevice too. Different than the cluster nodes the qdevice doesn't need to be part of the dedicated corosync network.
For the storage configuration I would setup a ZFS mirror of two SATAs for the PBS vm together with two SSDs or NVMEs as special devices. The reason is that PBS splits the data in a lot of small files (chunks) to work it's deduplication magic. Garbace collection and backup verify jobs tend to be slow on spinning disks, for this reason the manual recommends using enterprise ssds as datastores for PBS. But If one don't have the budget for two large enough SSDs using a special device mirror together with a HDD mirror as PBS datastorage will have better performance than just spinning disks. The remaining SATA devices then could be used as storage for the office workstations. Concerning the bandwidth this could work with NFS and a fast enough network link but I fear that the HDDs will be to slow for this. And like said ZFS RAIDZ get bad comments in this forum concerning it's performance (of course might still be good enough for your usecase).
One caveat though: At the moment PBS doesn't officially supports cloud storage so you would have to think whether you would have a vserver (additional expence) for this purpose or some dedicated PBS hosting service like Tuxis:
https://www.tuxis.nl/en/proxmox-backup-server/
I don't know whether they offer their service outside the EU (maybe with resellers?) though.

The other alternative of course would be to use the normal backup function of Proxmox VE, to save vz vzdump files to the NAS and store them to the cloud.
I would still setup a PBS though since it's deduplication allows to have more local backups and it supports file-restore.
Is there a better way to configure this setup without having to buy more servers? Could I technically use my old servers as part of the quorum and then use CEPH, even though they don't have an identical storage config? I could change the RAID card to IT mode and possibly get 4 SAS SSD or SATA SSD to speed up the storage.

This would propably work but since I have no real idea how to configure and size Ceph I hope somebody more competent will answer this ;)
Another possibility might be to use one or both of the old servers as proxmox backup server (see above), maybe at a offsite location? So you would use your NAS with a PBS vm as your primary backup target and one of the old servers at a offsite location as offsite backup. Or use one old server at your company as primary backup (so you can seperate the NAS from your backup server, might be a good idea in case the office workstations and NAS get compromised) and the second one as a offsite target.

PBS allows to sync from one PBS instance to another, which is quite useful for ransomware protection:
https://pbs.proxmox.com/docs/storage.html#ransomware-protection-recovery

In my homelab (yes I know) this is the way I have it setup :
- Two-Node cluster on two minipcs with ZFS storage sync my vms and containers (without dedicated network, it's still a homelab and I woudn't do this in a professional network, even a small business one)
- TrueNAS Scale on another mini-pc with a combined PBS/qdevice VM
- Vserver with PBS installed, it pulls every day the new backups from the TrueNAS PBS.

Permissions are like these:
  • The proxmox nodes are allowed to do backups on the TrueNAS PBS. They are not allowed to remove backups only to add new ones.
  • The proxmox nodes and the NAS PBS are allowed to restore or pull backups from the vserver but not to modify them
  • The vserver PBS is allowed to pull backups from the TrueNAS pbs and nothing else
  • I have a external USB disk where the NAS PBS regulary sync it's data as cold storage.
Thus even If some attacker could compromise a part of my setup in theory there should still be at least one host which wasn't compromised and thus still have older backups before the attack to restore everything from.

If I would have to setup such a thing in a professional environment I would do these things different:
  • Enterprise SSD/NVMEs as storage in a ZFS mirror (at the moment my vms are on single NVMEs )
  • Dedicated networks for corosync and ZFS replication
  • A dedicated server with a HDD/SSD mirror/special device ZFS setup as offsite backup of the NAS and PBS
  • Office-PBS on it's own server, different from the NAS
  • Of course professional hardware instead of Mini-PCs ;)


You might notice that I'm a big PBS fan please take it with a grain of salt (homelab versus company etc). If a different approach fits your your usecase/budget you should follow your thinking instead of my ramblings.

Hope this helps and best regards, Johannes.
 
  • Like
Reactions: UdoB
I also remember that consultants reported that they setup their customers clusters with a dedicated network node-to-node connection for the ZFS replication for two reasons:
  • A problem with the regular network doesn't impact the replications
  • Since only your PVE nodes are part of this network, you can disable encryption for the transport resulting in a higher transfer performance
However this owuld need that you would need another network adapter for the corosync network (1GB should be enough if I recall correctly the discussions in this forum on such setups), so you would have at least three networks:
  • Regular network so clients can connect to the servers and the servers to the NAS
  • corosync network for cluster communication
  • storage network for replication
I'm not sure which one should get the 10Gb or 25 Gb network card, I hope somebody with actual professional experience can add this information.
The servers we are looking at purchasing have 2x 1Gb ethernet LOM (LAN on Mainboard) and we were spec'ing an Intel E810 Dual port NIC that is 10/25Gb capable (my understanding is 10Gb SFP+ is interchangeable with 25Gb SFP28). We were planning on using 10Gb aggregation switches and aggregating the links. What I can do instead is create 2 new VLANS on our network switches for Storage and Corosync (we have a full Unifi setup in our rack with 10Gb SFP interconnecting our switches) and use one 10Gb SFP+ link for Storage network, one 10Gb SFP+ link for Main network for clients, and one 1Gb ethernet link for Corosync network. If needed I can get a small managed 8 port switch for the Corosync network, otherwise it would just be on one of our main switches and in its own VLAN. We haven't got the aggregation switches yet since our old servers werent 10Gb capable, so if it would be better we can go to 25Gb SFP28 aggregation switches, but they are $900 vs the 10Gb at $300, so I would only want to do that if absolutely necessary.

Concerning RAIDZ I remember discussions where participants (from their experiences in enterprise setups) recommended against using ZFS RAIDZ levels for performance reasons. Instead zfs mirrors should be used.
So if I have 10x 3.2TB of NVME disks for locally running the VMs on each machine, you do not recommend a ZFS RAID Z3? Doesn't a ZFS RAID Z3 have better performance with the ability to lose 3 drives? I use TrueNAS in my Homelab (hyperconverged under Proxmox - I haven't played with clusters yet, but plan to add some mini pcs in the future for this) and while I am not super familiar with the ins and outs of ZFS, it was my understanding that doing RAIDZ2 or RAID Z3 was the more performant method. If I did mirrors as you suggest, wouldn't that make 5 ZFS mirrored pools and I would have to allocate VMs between them? In case there is confusion that this array is also my boot media, I forgot to mention that the Proxmox Hypervisor will be installed on a Dell BOSS-N1 480GB mirrored RAID using the ext4 format. The ZFS storage would solely be for VM storage.

Edit - My mind completely spaced the option of RAID-10, the term "ZFS Mirrors" only brought to mind RAID-1. I did read after this that RAID-10 is more favorable for VMs when I was doing further research on ZFS for Proxmox.

You might notice that I'm a big PBS fan please take it with a grain of salt (homelab versus company etc). If a different approach fits your your usecase/budget you should follow your thinking instead of my ramblings.
I have looked at PBS, wasn't sure if it required the subscription or if it was also free use like Proxmox, but haven't had a chance to dive in. I was considering doing a PBS VM on the TrueNAS box like you mentioned, although I was thinking copying it to a couple mirrored NVME, then rsync it to the spinning rust pool for long term storage, hadn't considered the special device route. I was also just debating using the built-in backup function of Proxmox to save VM copies to a shared storage drive. We use Acronis Cyber Protect cloud backup for our current servers and was going to look at the best way to sync our storage server to that once the backups were made, but also looking into the possibility of switching to a more well known solution that directly connects to Proxmox such as Veeam. Then we can do live restores or file level restores. I read that most people who have tried PBS and Veeam prefer the added features of Veeam. We don't have an offsite location to setup a physical server for offsite backups, otherwise that option sounds much better. I'm not going to pay the premium of running enterprise grade equipment at my home (been there and done that with my entry into homelabbing) and nobody else in the company can be reliable enough. Although, if I could get symmetrical fiber run to my house on the company dime (current service only provides max 10Mbps), and a stipend for the increased electrical cost and internet cost I may rethink it :p.

This has all been very informative and I hope someone can chime in on the possibility of using my old servers to create a 4 node cluster of CEPH (2 nodes with 10x 3.2TB NVME, 2 nodes with 4x 3.7TB SAS SSD) and an extra QDevice for Quorum. We would probably never actually fail the VMs to the old servers except in a dire situation, and only for the core services VM (although my understanding is the hardware is unsupported in Windows Server 2022+). We would just be using it to facilitate CEPH storage redundancy. However it gets setup now will be how it runs until our next major overhaul, probably in about 7-8 years, so I want to pick the option that will carry us for a while. If we did go CEPH with our old servers, I obviously don't expect them to survive a full 7-8 years, but hopefully we can get a couple more near matching servers in 3-4 years at a steep discount and cycle them in to replace the aged R420 servers.
 
Last edited:
  • Like
Reactions: Johannes S
The servers we are looking at purchasing have 2x 1Gb ethernet LOM (LAN on Mainboard) and we were spec'ing an Intel E810 Dual port NIC that is 10/25Gb capable (my understanding is 10Gb SFP+ is interchangeable with 25Gb SFP28). We were planning on using 10Gb aggregation switches and aggregating the links. What I can do instead is create 2 new VLANS on our network switches for Storage and Corosync (we have a full Unifi setup in our rack with 10Gb SFP interconnecting our switches) and use one 10Gb SFP+ link for Storage network, one 10Gb SFP+ link for Main network for clients, and one 1Gb ethernet link for Corosync network. If needed I can get a small managed 8 port switch for the Corosync network, otherwise it would just be on one of our main switches and in its own VLAN. We haven't got the aggregation switches yet since our old servers werent 10Gb capable, so if it would be better we can go to 25Gb SFP28 aggregation switches, but they are $900 vs the 10Gb at $300, so I would only want to do that if absolutely necessary.

Since I lack real-world-experience I will leave this to the experts here. From my limited understanding you should be fine, but I remember some of the consultants here remarked that needs to be careful in the design of the network setup, because an error might have severe implications for production performance and reliability.

So if I have 10x 3.2TB of NVME disks for locally running the VMs on each machine, you do not recommend a ZFS RAID Z3? Doesn't a ZFS RAID Z3 have better performance with the ability to lose 3 drives? I use TrueNAS in my Homelab (hyperconverged under Proxmox - I haven't played with clusters yet, but plan to add some mini pcs in the future for this) and while I am not super familiar with the ins and outs of ZFS, it was my understanding that doing RAIDZ2 or RAID Z3 was the more performant method. If I did mirrors as you suggest, wouldn't that make 5 ZFS mirrored pools and I would have to allocate VMs between them? In case there is confusion that this array is also my boot media, I forgot to mention that the Proxmox Hypervisor will be installed on a Dell BOSS-N1 480GB mirrored RAID using the ext4 format. The ZFS storage would solely be for VM storage.

As said before I lack experience in a business environment but I read previous discussions in this forum. The consensus is that RAIDZ is bad for VM performance and failure security compared to a zfs mirror and the added capacity isn't worth it. Somebody went into the trouble to do a writeup on it: https://forum.proxmox.com/threads/t...age-efficiency-you-think-you-will-get.141128/
Edit - My mind completely spaced the option of RAID-10, the term "ZFS Mirrors" only brought to mind RAID-1. I did read after this that RAID-10 is more favorable for VMs when I was doing further research on ZFS for Proxmox.


I have looked at PBS, wasn't sure if it required the subscription or if it was also free use like Proxmox, but haven't had a chance to dive in.
Like Proxmox VE it's free to use open source software with a nag screen. They have subscripion plans for access to their enterprise repo and support: https://www.proxmox.com/en/proxmox-backup-server/pricing

I was considering doing a PBS VM on the TrueNAS box like you mentioned, although I was thinking copying it to a couple mirrored NVME, then rsync it to the spinning rust pool for long term storage, hadn't considered the special device route. I was also just debating using the built-in backup function of Proxmox to save VM copies to a shared storage drive. We use Acronis Cyber Protect cloud backup for our current servers and was going to look at the best way to sync our storage server to that once the backups were made, but also looking into the possibility of switching to a more well known solution that directly connects to Proxmox such as Veeam. Then we can do live restores or file level restores. I read that most people who have tried PBS and Veeam prefer the added features of Veeam.

It depends ;) At the moment Veeams Proxmox support is for the VMs, not LXC container and as far I know the application aware backup support isn't at the level of integration Veeam has for vmware. I know that some professionals here do a kind of mixed approach: PBS for vm and lxc backups, Veeam agent in the vms for backups of SQL databases etc. Since they need less licences for Veeam, this can be quite cost-effective.
In your case you could do something similiar: PBS for the VMS and arconis for application stuff.
The built-in vzdump based backup works but doesn't do stuff like deduplication or file-level restore like PBS. Again a mixed approach might be useful: Using PBS for easier restore of vm and single files on them in the regular operating and vzdumps + Arconis/Veeam/whatever (which can also save the vzdumps of the builtin function ) for emergencys (if the PBS gets broken), offsite backup and application-aware backups.
We don't have an offsite location to setup a physical server for offsite backups, otherwise that option sounds much better. I'm not going to pay the premium of running enterprise grade equipment at my home (been there and done that with my entry into homelabbing) and nobody else in the company can be reliable enough. Although, if I could get symmetrical fiber run to my house on the company dime (current service only provides max 10Mbps), and a stipend for the increased electrical cost and internet cost I may rethink it :p.

How do you do your offsite backups at the moment than? In case of fire or another emergency I would want to have a backup which can still be used. This is especially true if your company fells victim if the rest of the infratrcuture (including the storage server) gets hacked.
This has all been very informative and I hope someone can chime in on the possibility of using my old servers to create a 4 node cluster of CEPH (2 nodes with 10x 3.2TB NVME, 2 nodes with 4x 3.7TB SAS SSD) and an extra QDevice for Quorum. We would probably never actually fail the VMs to the old servers except in a dire situation, and only for the core services VM (although my understanding is the hardware is unsupported in Windows Server 2022+). We would just be using it to facilitate CEPH storage redundancy. However it gets setup now will be how it runs until our next major overhaul, probably in about 7-8 years, so I want to pick the option that will carry us for a while. If we did go CEPH with our old servers, I obviously don't expect them to survive a full 7-8 years, but hopefully we can get a couple more near matching servers in 3-4 years at a steep discount and cycle them in to replace the aged R420 servers.
Sadly I don't know anything about Ceph since it would be complete overkill for my homelab. Hopefully some of the Ceph experts and professionals might have an answer. Personally I would use at least one of your old servers to setup a backup server with it's own dedicated access controls and tightly secured/hardened so in case your production machines get compromised. PBS doesn't need a lot of resources so they should be more than enough. You could still use the reamining three servers for a three-node PVE/Ceph cluster then.
Another option might be to create another two-node+qdevice cluster and use the remote-migrate feature of proxmox for a migration in case of an emergency: https://forum.proxmox.com/threads/vm-migration-between-two-proxmox-clusters.139467/
Again: Please take my at best theoretical knowledge with a grain of salt you are propably better off with advice from some of the professional users and consultants here ;)
 
I think I am on the path to re-think my whole process. The servers we are looking at are about $12-15K each depending on where we go, we are looking into current/last gen dell refurb with 3 year warranty. Using some of the info from above, maybe I can do the following and still keep it within the same budget:

2x R6615 AMD EPYC 1U servers using Dell BOSS-N1 480GB mirrored (RAID0) boot drive for Proxmox, 128GB of DDR5 ECC RAM, Intel E810 Dual Port 10/25Gb SFP28 NIC, and dual port 1Gb LOM. - Clustered, running VMs from Shared Storage on R760 below.

1x R7615 AMD EPYC 2U server using Dell BOSS-N1 240GB mirrored (RAID0) boot drive for TrueNAS Scale, 12x 6.4TB NVME direct drives in ZFS RAID-10, 128GB of DDR5 ECC RAM, Intel E810 Dual Port 10/25Gb SFP28 NIC, and dual port 1Gb LOM. Running a VM on TrueNAS Scale for QDevice. Running ZFS dataset as iSCSI target for VM Shared Storage using ZFS over iSCSI.

1x (old) R720 2U server using a PCI-E card with mirrored (RAID0) m.2 NVME boot drive for TrueNAS Scale, 8x 10TB SATA 7.2K Enterprise drives (Seagate EXOS), 128GB DDR ECC RAM, Intel X710DA2 dual 10Gb SFP+ NIC, and quad 1Gb LOM. Running one dataset as general storage for office workstations and second dataset as rsync target for backup of VM Dataset on R760 server above (maybe even a failover iSCSI target for redundancy if possible?) Maybe even create a VM for PBS to pull backups into TrueNAS.
How do you do your offsite backups at the moment than? In case of fire or another emergency I would want to have a backup which can still be used. This is especially true if your company fells victim if the rest of the infratrcuture (including the storage server) gets hacked.
Acronis is our offsite backup at this time. Any offsite backup would be a small Synology style unit syncing the backups from the office, but if all the servers at the office were somehow destroyed, I don't know what would take over for them to load the backup to, short of overnighting a new server from Dell, although that is at least an option to get back online. Restorals from an offsite server would be slower than any restores from Acronis, and Acronis should be able to restore us to a point pre-hack.
 
  • Like
Reactions: Johannes S
I think I am on the path to re-think my whole process. The servers we are looking at are about $12-15K each depending on where we go, we are looking into current/last gen dell refurb with 3 year warranty. Using some of the info from above, maybe I can do the following and still keep it within the same budget:

2x R6615 AMD EPYC 1U servers using Dell BOSS-N1 480GB mirrored (RAID0) boot drive for Proxmox, 128GB of DDR5 ECC RAM, Intel E810 Dual Port 10/25Gb SFP28 NIC, and dual port 1Gb LOM. - Clustered, running VMs from Shared Storage on R760 below.

1x R7615 AMD EPYC 2U server using Dell BOSS-N1 240GB mirrored (RAID0) boot drive for TrueNAS Scale, 12x 6.4TB NVME direct drives in ZFS RAID-10, 128GB of DDR5 ECC RAM, Intel E810 Dual Port 10/25Gb SFP28 NIC, and dual port 1Gb LOM. Running a VM on TrueNAS Scale for QDevice. Running ZFS dataset as iSCSI target for VM Shared Storage using ZFS over iSCSI.
The question is do you really need shared storage? ISCSI adds additional complexity. I personally think that using ZFS + storage replication is easier to use while still maintaining high availability. In most cases I think that the potential loss of one minute of data is something I can live with (one minute is the minimal sync schedule one could configure a vm for). I also remember that ZFS over ISCSI needs support from the NAS I don't know whether TrueNAS allows it: https://pve.proxmox.com/wiki/Storage:_ZFS_over_ISCSI
I remember that for smaller setups some folks recommend using NFS on a NAS as shared storage. However then you will have to live with the security implications of using NFS (shord story: It's not a secure protocol and should only be used in a tightly secured network nobody from the outside can access) and the potential performance impact (due to the overhead of another filesystem between the VM, the vm QCOW disk image files (needed for snapshots) and the disk) compared to object storage.

1x (old) R720 2U server using a PCI-E card with mirrored (RAID0) m.2 NVME boot drive for TrueNAS Scale, 8x 10TB SATA 7.2K Enterprise drives (Seagate EXOS), 128GB DDR ECC RAM, Intel X710DA2 dual 10Gb SFP+ NIC, and quad 1Gb LOM. Running one dataset as general storage for office workstations and second dataset as rsync target for backup of VM Dataset on R760 server above (maybe even a failover iSCSI target for redundancy if possible?) Maybe even create a VM for PBS to pull backups into TrueNAS.

If you use zfs send/receive you don't need rsync for the file transfer. I would use PBS instead since it's better integrated. Your mileage may vary.
Acronis is our offsite backup at this time. Any offsite backup would be a small Synology style unit syncing the backups from the office, but if all the servers at the office were somehow destroyed, I don't know what would take over for them to load the backup to, short of overnighting a new server from Dell, although that is at least an option to get back online. Restorals from an offsite server would be slower than any restores from Acronis, and Acronis should be able to restore us to a point pre-hack.
I think using Arconis as main offsite backup for your data should be fine.
 
The question is do you really need shared storage? ISCSI adds additional complexity. I personally think that using ZFS + storage replication is easier to use while still maintaining high availability. In most cases I think that the potential loss of one minute of data is something I can live with (one minute is the minimal sync schedule one could configure a vm for). I also remember that ZFS over ISCSI needs support from the NAS I don't know whether TrueNAS allows it: https://pve.proxmox.com/wiki/Storage:_ZFS_over_ISCSI
With a 24 hour service running, I would really prefer to have as close to zero downtime. If it can be in our budget, High-Availability with no loss of data is definitely the way we want to go. I am working with a recertified Dell server vendor - xByte - to determine the best cost/performance strategy between the options above and possibly a 4 server scenario with CEPH, if they can get close to our budget. We can add additional nodes in a year or two when same model servers drop a few thousand dollars more. I don't know how hard it would be to do a ZFS replication setup to start and then migrate to CEPH later when we can add more nodes. Luckily one of the system design engineers is close with a colleague who is well versed in Proxmox, CEPH, and ZFS replication. They are discussing the options, best practices, and what to avoid so that they can provide me the best quote. They have extremely friendly support and I can spec out a current gen server fully loaded with storage and 3 year Dell support for less than an exact matching server new without any storage at all.

I remember that for smaller setups some folks recommend using NFS on a NAS as shared storage. However then you will have to live with the security implications of using NFS (shord story: It's not a secure protocol and should only be used in a tightly secured network nobody from the outside can access) and the potential performance impact (due to the overhead of another filesystem between the VM, the vm QCOW disk image files (needed for snapshots) and the disk) compared to object storage.
I did read up a little on the possibility of creating an NFS, but I don't think the overhead is worth it.

If you use zfs send/receive you don't need rsync for the file transfer. I would use PBS instead since it's better integrated. Your mileage may vary.
I was definitely thinking of installing PBS as a VM on TrueNAS to do the backups, At the time of writing I wasn't sure if using TrueNAS inherent capabilities would be better than another software, but the ability for finer management of the backups is drawing me in.
 
  • Like
Reactions: Johannes S