Suggestions for low cost HA production setup in small company

Skimming through that page, I am surprised there is no example using Linux kernel bridges for STP meshing,
There is RSTP [1]

Running a meshing or routing daemon just adds another point of failure.
Maybe, but it does allow to use both links simultaneously while on RTSP only one is in use and the other is fallback only.

Either way, that page requires an additional high speed NIC on each node to do the connections to the other neighbor node.
Which you should have anyway, connected to two switches with MLAG/stacking to avoid the network being an SPOF. But yes, you would need 4 nics per host, two for the MESH + two for the "lan".

[1] https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#RSTP_Loop_Setup
 
  • Like
Reactions: Johannes S
In your case, the best solution to cut the costs while keeping the system easy to manage is the solution B:

2 PVE nodes with ZFS replication + QDevice

Notes:
  • There are many tutorials for this setup on the web.
  • You can run the corosync-qdevice on your PBS
  • Stay away from CEPH is you cannot afford to run 4 nodes.
 
For limited budget consideration, the Option B: 2-node + QDevice with ZFS replication may is your low cost choose. but you need will designed the qdevice connectivity with 2-nodes to avoid if any network device down, let your 2-node + QDevice will stop communication with each other!
If you want higher available, then you may considering Option C: 3-node with Ceph, that will provide batter VM data consistent then Option B! Because you don't need to tolerate the ZFS replication caused data gap between replication interval.
 
  • Like
Reactions: Johannes S
  • You can run the corosync-qdevice on your PBS

But then the backup server can be reached via ssh (without entering a password) from the cluster nodes. It's never a good thing that your backup can be accessed without authentification from the host you want to backup.

There is however a good way to work around this, described by @aaron here:

2x PVE nodes with local ZFS storage (same name)
1x PBS + PVE side by side bare metal.

The 2x PVE nodes are clustered. To be able to use HA I make sure that the VMs all have the Replication enabled. For Mailservers and other VMs where any data loss is painful, I replicate with the shortest possible interval of 1 minute. Other VMs, like a DNS server, are replicated with longer intervals.

On the PBS server I have one LXC container running which is providing the external part of the QDevice, so that the 2x PVE nodes get their 3rd vote and can handle HA and downtimes of one node.



Imho the ProxmoxVE documentation should be updated that such a setup (maybe with a VM for qdevice to have even stricter isolation) is considered best practice for small clusters.
 
  • Like
Reactions: david_tao