Is compute node + backup node + Q node setup ok as somewhat HA cluster?

Rxunique

New Member
Feb 5, 2024
26
0
1
I have been using Synology & VMM for years, recently outgrew their prosumer hardware, got myself a R730 and been learning proxmox for last few weeks. From experience, things associated with file system raid, and overall architecture are super hard to change down the track, and some practically prohibitive to change.

I'm hoping gurus here can help validate my high level planning. My goal is
  1. To have 1 VM with 99% up time, other VM not critical. The 99% VM is about 200GB currently, grows 50~100GB per year, it's based on wordpress->docker->ubuntu
  2. 99.9% network up time with primary WAN + backup WAN with 4G mini PCIe card
  3. And relatively safe tinker space where stupid mistakes wouldn't affect above, well, as much as possible.

The hardware I got are:
  • Main node: R730 with 4x intel SATA SSD, more compute power than I'll ever need
  • 2nd Node: not here yet, thinking of getting R730XD LFF, or build my own i3~i5 level + 6~8x SATA
  • 3rd Node: Intel N100 mini pc 8G/128G, 4x 2.5G port
  • a couple synology NAS

My setup plan. Please correct where I'm wrong.

Main Node will have Z1 raid on SSD and runs most of the VMs most of the time, with HA on ZFS replication for certain VMs, and group to limit VM within main node & 2nd node.


2nd Node will pretty much be a failover on standby for main node, so I'm thinking a few 3.5NAS drive in Raid-Z2. In addition, I'm thinking running
  • blue iris with separate ZFS pool, not critical at all
  • Proxmox Backup Server in a VM to backup VMs of main node, also not critical

Originally I was planning to run pf-sense bare metal on the N100, but thought might be able to VM pfsense and provide 3rd node proxmox quorum , 2 birds 1 stone at the cost of extra setup complexity.


The old synology can be repurposed with Active Business Backup to further backup some of the more critical VMs


Are my understanding below correct?

CEPH would not be possible with my totally asymmetric setup, as it would require at least same pool capacity on each node, and SSD vs HDD speed difference can cause issue with ceph sync. NFS will create single point of failure, leaving ZFS replication my best option.



I can host PBS either on synology or proxmox, practically not that much difference either way, so I better put it where most drive space which will be 2nd proxmox node.



From what I learned, SSD node + spinning rust node should work reliably for VM migration on ZFS replication, but I'm unsure if it's going to be reliable to virtualize pf-sense and have it double as quorum device for my micro cluster. Gut feeling this would be too good to be true. Perhaps I should just leave pf-sense out of cluster.

Then expanding on that, many challenges like below, but but it'd be super cool if this can be pulled off. Would it be reliable (not just possible) to spin up another pf-sense VM on Node1 as passive via pf-sense built in HA, not proxmox HA.
  • the setup complexity (chance for error) is on another level
  • 4G mini PCIe wan can't be synced across active & passive, unless its in its own modem and vLan switching.
  • even if all above is resolved, then my WAN switches would become the single point of failure. Even if switches are way more stable in general
 
This could work reliabily if set up properly. Some recomendations:

- Avoid ZFS raidz for VM workloads if possible, as you'll avoid wasting a lot of space due to padding and also get much lower performance. It's just a handful of VMs, but still.

- Use two corosync links with different NICs. This is vital when using HA as the nodes will fence themselves (reboot) if they lose quorum.

- I would install PBS alongside PVE on server 2. No need to use a VM/container. Try to use local disks as datastore, if possible. Backup that "blue iris" vm to another disks were it is running from so in the event of a disk failure you don't lose both the VM and it's backups. Sync some backups to your Synology from time to time to get your backups in another location. Given the low volume of data, you probably have enough PBS performance with spinners and don't need ZFS special device.

- Place your pf-sense cluster on two VMs running on different hosts. Search for a 4G router that you can connect to your network so both fw can use it if needed.

- I would place the intel n100 node in the cluster, not just as a qdevice, allowing you run VMs on in, migrate, etc.
 
Thanks a lot for validating my idea and recommondations.

Avoid ZFS raidz for VM workloads if possible

I simply assumed ZFS is no brainer for proxmox HA. In fact, from my research, ZFS & Ceph seems like the only 2 options with HA. Thank you for pointing it out.

What would be your recommended option? I'm assuming a file system with replication feature and 1 drive tolerance.

I read a lot Dell PERC would complain about non-dell firmware SSD, and if hardware raid, the single point of failure would be the raid card, though the failure would be limited to a single node. Still, all this makes hardware raid not that interesting for my micro cluster.

Or just run VM directly on single non-raid SSD (perhaps even NVME via PCIe) with replication since its HA anyway?


two corosync links with different NICs

Corosync is a new term for me, I did some digging, it's a background service for HA. So in noob talk, use 2 network cables per node exclusive for HA, set them up redundant to each other to prevent fat fingering sole connection loose causing unwanted reboot?
 
One question I have you state your aiming for 99% uptime, this is around 3 days a year downtime.

If that's the case sounds like a Active node with backups/data sync to a secondary source and the downtime of bringing up that secondary source is fine.

If you was aiming for more 99's then your need to start thinking higher, if you only have one VM that requires to be online something like CEPH or underlying storage HA is a huge over complication for the need of a single VM.

Your probably better off looking at two VM's and running something within the VM to keep the secondary constantly synced from a website/SQL perspective constantly, and then have PFsense setup to do a failover if the primary becomes unavailable.
 
99% uptime, this is around 3 days a year downtime.

if you only have one VM that requires to be online something like CEPH or underlying storage HA is a huge over complication for the need of a single VM


That's is correct, only 1 VM is somewhat critical, it can afford, but not ideal, to go down a couple hours during work days, even a whole day during weekend. We are planning running other VMs, but not as critical.

From my experience on synology, the down time mainly comes from DSM firmware upgrade, there's been a couple times new version caused a bit panic. Manual live migration would solve this, HA is sweet spot, ceph would be overkill.

Another bottle neck is national broadband network, we have a business grade service that would go down whenever it rains too much. Hence 4G backup and pf sense

But I am aiming for higher up time with new setup, so any suggestion is appreciated.


running something within the VM

That's unfortunately impractical for us which lead to looking at hypervisor with live migration.
 
What would be your recommended option? I'm assuming a file system with replication feature and 1 drive tolerance.
ZFS but RAID1 or 10. You can't do ceph with that hardware, as you need at least 3 nodes with similar storage configuration. And yes, you can't do storage replication with just hardware raid. Well, you could create a ZFS over a hardware raid, but that is completely unsupported (and a very bad idea if you appreciate your data).

Corosync is a new term for me, I did some digging, it's a background service for HA. So in noob talk, use 2 network cables per node exclusive for HA, set them up redundant to each other to prevent fat fingering sole connection loose causing unwanted reboot?
Corosync is the core component of Proxmox cluster [1]. It is not related to HA, but HA needs a cluster, hence HA depends on corosync and not the other way around. You can have a cluster without Proxmox HA enabled for any VM. At least one dedicated network for corosync link0, the other(s) link(s) may be shared with other traffic if needed (hardware or network restrictions). If you absolutely have to share corosync links with other traffic, make sure that you prioritize it over all other traffic and place links on nics that predictably won't get saturated/disconnected at the same time (i.e. dont place links shared with storage nic for ceph/iscsi/nfs and backup nic, as both will be used heavily during backups).

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network
 
You could even have a single drive ZFS and rely on Storage Replication for data redundancy. Or a 3 or 4 way mirror! ZFS isn't perfect, but has many use cases.

I'm actually thinking that, on the surface stupid and crazy. But factor in same brand & batch drives have a much higher chance to fail together, especially during heavy load re-silver. I learned that the hard way. Also factor in ZFS has no vDev level redundancy. It might make sense.

My developing theory for ZFS stripe + mirror
  • if 2 of the same SSD in a vDev, when 1 fails, it's as good as a coin toss.
  • if 2 batch/brand of SSDs across all vDevs, still when 1 drive fails, rest of the that brand not far away, another coin toss
  • only safe way seems to be mixing several brand / batch of similar spec
In practice, I have had such issue happened twice with NAS HDD over many years, first time I was caught blind, 2nd time prepared but still quite tedious. Not too sure how likely such thing would happen to SSD in a server. Found this video talking this issue


It's not an issue of ZFS, if I had 1-2 dozen of drives to mix around, this wouldn't be a concern. In my case of a handful drives, I'm actually sitting on the fence of going stupid with Raid0 and have redundancy via replication and backup
 
The only way to protect against all those events are backups. On your terms, I believe that any kind of drive redundancy just makes you toss the coin less often, as drives will fail. Whatever you do, don't use raid0, as it multiplies the risk of losing data per each drive in the raid0.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!