Design options for a 2-node cluster in production

esi_y · Feb 9, 2024

bbgeek17 said:
Happy to agree when you can point to stuff member post or official policy documentation stating it?

I was reacting to the "reduced capacity" node here.

bbgeek17 said:
A 5 second search brings this up: https://forum.proxmox.com/threads/qdevice.59876/#post-503705

This is concerning QDevice, I do not disagree with the linked post, (except for the part about odd number of nodes AND a QDevice discouraged), but you probably were getting at something else I missed.

pxmx_sys · Feb 9, 2024

@alyarb thanks for your response. What would you recommend for storage of 4 to 6 r730xd. I will run influx db with some low latency market data applications. Not sure if ceph and NFS can handle them, so it comes to SAN and local zfs solutions which I think zfs will much cheaper and easier to maintain. But I'm open to any suggestions

alyarb · Feb 9, 2024

I would post the complete specs of your current host hardware, the current resource allocations and future requirements of all of your VMs, and your budget and expectations for this expansion and any other constraints you can think of that you haven't shared. Also what country are you in?

Local+replicated ZFS will be the cheapest and give reasonable performance, but does not provide realtime hands-off HA. The storage does not scale beyond 1 box. Replication is done on an ad-hoc basis and not quite in realtime.

SAN will not be cheaper than local ZFS, and it will not be faster (all else being equal). Depending on the storage vendor's feature set you may also lose thin provisioning and snapshot capability unless explicitly provided. Additionally, it would be a mistake to assume that you get any redundancy from a single SAN enclosure. Yes, central shared storage can survive a host failure, but what about a storage or switch failure? Putting all your storage in one expensive basket does not change your situation much from where it is right now.

A good SAN solution would involve 2 or more replicated enclosures and LACP networking with a redundant switch stack. Somebody stop me if I'm off base, but the cost and overhead of SAN relative to the performance and reliability you get in return is extremely high and as a result I do not believe SAN deployments are particularly widely used in small-medium enterprise Proxmox environments. There are some very badass SAN products out there, with higher specs than my VM hosts, so I have to wonder about their price and the bottom line relative value to the small-medium end customer.

The third option, Ceph is at least a 100% free system, allowing more of your budget to be allocated towards nodes, drives, and network, and in a sufficient cluster will provide high performance, high integrity, high reliability, and dynamically configurable redundancy, but the overhead required to achieve this will seem high. Efficiency increases with cluster size. It's the most expensive free software ever made, but it will be cheaper than dedicated SANs, and it will be more reliable and come with a higher degree of automation and scalability than replicated ZFS.

A 4th option would be to proceed without any clustered physical infrastructure, and only pursue your availability/reliability requirements at the application layer. influx has its own clustering capability, so you could have totally separate influx server VMs participating in an influx cluster, and you can rig up multiples of your front-end services behind a haproxy+keepalived cluster of front-facing VMs.

App-layer redundancy can be combined with any physical layer redundancy as well. Just because your hosts and physical storage are redundant and online does not strictly mean that your prod services are always up, not hung, etc.

Just as an example, in one particular environment, I have a 5-node PVE+Ceph cluster on a 3/2 replicated storage pool, hosting 5 Percona xtraDB cluster VMs, 5 apache2 VMs, and 5 haproxy+keepalived VMs, behind a pair of pfSense VMs configured in CARP+HA. So I have a distributed hosting environment on distributed storage, with a distributed MySQL cluster on top of that being served from multiple load balanced web servers behind a dual firewall. In this scheme could you spontaneously yank drives, host, switch, or VM without noticeably impacting the front-end service, but my ratio of physical-to-prod resource is quite high (5x host, 15x data).

esi_y · Feb 9, 2024

alyarb said:
Replication is done on an ad-hoc basis and not quite in realtime.

Care to elaborate? Because from all there rest it would sound like he better runs no cluster at all.

bbgeek17 · Feb 9, 2024

alyarb said:
SAN will not be cheaper than local ZFS, and it will not be faster.

I would add the same disclaimer here as you did elsewhere - "depending on the SAN"

alyarb said:
Yes, central shared storage can survive a host failure, but what about a storage or switch failure?

By host failure - do you mean controller? If so, I agree with that part. "Storage" failure is addressed by "shared nothing" architecture, which Blockbridge provides, for example. "Switch" failure is addressed by redundant switches and LACP on each side. I am "lumping" iSCSI and NVMe/TCP SANs here, as it does not make financial sense to buy FC SAN any more.

alyarb said:
You would have to get at least 2 SANs and 2 switches in a stack and the SANs would need a reliable internal replication mechanism provided by the storage vendor, multipath i/o, somebody stop me if I'm off base, but the cost and overhead of SAN relative to the performance and reliability you get in return is extremely high and as a result I do not believe SAN deployments are particularly widely used in small-medium enterprise Proxmox environments.

You need two switches with or without SAN for proper network redundancy. You only need 2 SANs for DR purposes. A single highly available SAN setup is commonly used by all types of businesses. It all depends on criticality of the data, budget and and, in large part, performance needs.

We are mostly on the same page regarding Ceph, once an adjustment is made for number of required SAN setups.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

alyarb · Feb 9, 2024

I know BB is a very high end SAN but we don't have a real budget or spec yet. Even in a share-nothing enclosure, the metal on the outside is shared. Got to admit, I got you there.

bbgeek17 · Feb 9, 2024

alyarb said:
Even in a share-nothing enclosure, the metal on the outside is shared. Got to admit, I got you there.

You may have, not sure. We use off-the-shelf servers to create HA SAN, a standard cluster is two independent servers (with a vote). They may share metal of the rack. Are you thinking of proprietary controllers sharing a backplane, by chance?

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

alyarb · Feb 9, 2024

Yes I was speaking more of the traditional SAN black box / delivered turnkey storage solution, and you are speaking strictly about Blockbridge which is very specific and probably the best case scenario.

I swore off SAN hardware 10+ years ago, it's just been a historically poor value compared to HCI in my opinion. But I admit I'm not current on it and I like that there are innovators pushing the concept forward with new tech. I'm sure BB is simple, fast, feature rich, and turnkey for the guy who likes to buy something ready to go. I wish there was a free software version for people to play with and experiment with building their own.

pxmx_sys · Feb 9, 2024

alyarb said:
I would post the complete specs of your current host hardware, the current resource allocations and future requirements of all of your VMs, and your budget and expectations for this expansion and any other constraints you can think of that you haven't shared. Also what country are you in?

@alyarb I work for a small fintech company in Australia. The current setup is 2 x standalone proxmox in 2 sites, each 2 x Intel Xeon 4210R (20 core-40 threads) - 376G Ram. 20TB Disks (RAID0). Supermicro server. we try to put all back-office apps and databases in the virtualization environment, like databases, DNS, monitoring, ... and some important apps for our company's business. I want to setup new Proxmox system only in one site at the moment, in future, I will setup the same in another site. about budget, about $30k USD.

alyarb · Feb 9, 2024

ok, 30k isn't that much. what is the logic behind having 2 non-redundant servers in 2 sites? do you need 2 sites?

wouldn't you rather have a redundant setup in 1 site?

what is your current actual storage usage?

pxmx_sys · Feb 9, 2024

as i said before, i joined this company recently and found out these 2 standalone proxmox in 2 sites are on RAID0 storage, so there are no redundancies whatsoever. (except the backup which I move it to NAS). for new setup, I'm planning to just bring up proxmox in one site with enough redundancy in storage and compute side

alyarb · Feb 9, 2024

If this were me, I would want a 5 node 100 GbE PVE + Ceph cluster similar to what I described previously.

This is based on my suppliers and pricing in the US, which I dont want to share on the forum, so I don't know how it would work for you in your situation, or if you are a builder or more of a turnkey managed services guy, but this is what I would want to build if it were me:

3x Dell R740XD NVMe + 5 year warranty.................$18k
(dual Xeon Platinum 8165, 384 GB DDR4, 120 GB m.2 BOSS mirror, dual 1 GbE/10 GbE rNDC, dual 100 GbE Mellanox MCX556A-ECAT, dual 1100W Platinum PSU, 5 year warranty, 12x 2.5" drive caddies)

30x intel DC P4510 8 TB 2.5" NVMe SSD..................$13k

2x 100 GbE Mellanox SN2700 switches............$3k - $5k

2 more Mellanox NICs..................$800-1000 or so

I would build a minimal 3-node cluster, setup Ceph, move the prod VMs into this env, and then take the 2 original hosts, assuming they have basic NVMe capability (if not I would just sell them, or consider a Ceph solution based on SAS SSDs like PM1643a), install the 100 GbE NICs and join them to the cluster so now you are at 5 nodes.

Put 5 of the SSDs in each node, leaving 5 in spare. The 8 TB P4510 has 7.2 TB usable so x25 = 180 raw physical TB for Ceph, which gives you about 60 TB of prod storage in a 3/2 replicated pool.

I completely get that this is a stretch if you are new to PVE/Ceph and/or cant source hardware as easily as in the US.

If Ceph is not an option, then just to start, I think I would take the proxmox host B in site B, bring it into site A, and rebuild it such that it is not a RAID0. At least do a RAID10, then migrate your production VMs to this host with redundant local storage.

Then tear down host A and set it up the same way, with the same redundant storage scheme, and just setup a basic replication and forget about HA.

With what you have I think you can make it work with a basic backup + restore model. It won't be realtime HA, you will have to manage the failover and failback manually, but at least get out of this RAID0 situation.

It just occurred to me that theses hosts might have a hardware RAID and was simply installed at the PVE level as ext4?

pxmx_sys · Feb 10, 2024

@alyarb These are very great information. thanks for that. As you said, the first option is quite a stretch for a pve beginner like me. The current setup, has 2 x VGs, OS (500G) and pve VG which has about 19T. both ext4 and RAID0. As at the moment both servers are Prod.

Based on the provided information in this thread, I will buy a new server (R740XD) and will setup PVE like below:

- OS: XFS filesystem in RAID1 ~500G NVMe drives. I think because of that I will need a PERC raid controller in the server
- VM, Snapshot, ISO: 10 x 8TB SSD - ZFS RAID10 (~40TB usage which should be enough for a couple of years). Raid controller is not needed here and zfs manages redundancy.

Please correct me if I'm wrong, replication only works if the nodes are added to a cluster, which means HA needs to be setup properly with the 3rd node which can be anything (Qdisk stuff which discussed in the same thread - separate VM or server).

about backup restore, I will take weekly backups on disk and later move it to NAS. will need more research about the best backup + restore policy.

esi_y · Feb 10, 2024

pxmx_sys said:
Please correct me if I'm wrong, replication only works if the nodes are added to a cluster, which means HA needs to be setup properly with the 3rd node which can be anything (Qdisk stuff which discussed in the same thread - separate VM or server).

Replication works (with ZFS only) in a cluster only, it is not however related to HA. It is possible to have a cluster (and thus replication going) with no resources set up as HA within that cluster.

esi_y · Feb 10, 2024

I just noticed @alyarb wrote about the same above:

alyarb said:
With what you have I think you can make it work with a basic backup + restore model. It won't be realtime HA, you will have to manage the failover and failback manually, but at least get out of this RAID0 situation.

When he said "realtime HA", he meant you can be the "HA" ... if you see something going down and get notified, thanks to the replicas you can spin it up on the other node manually. Higher up, he mentioned replicas are on "ad-hoc" basis without clarification, but replicas are actually automated/scheduled, just the granularity is limited. E.g. you can set replicas going on within that ZFS every few minutes, absolute minimum is once a minute. If you have good network, since replicas just exchange snapshot deltas, they will be quick, but if you have e.g. database running there (and not on shared storage), you will have lost that past few minutes of transactions, should you be restarting from a replica.

I thought you might want to consider this before taking the final decision.

alyarb said:
It just occurred to me that theses hosts might have a hardware RAID and was simply installed at the PVE level as ext4?

Apologies if I am second guessing him here, but I think he did not necessary mean to endorse you to get PERC if it's not already there. You could have reliable mdadm RAID as well (but it's not "supported" from the ISO installer, you would need to install on top of Debian [1]) - so I do not think you will be interested.

Final note on ZFS, HW RAID controller is actually undesirable for ZFS:
https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html#hardware-raid-controllers

EDIT:
[1] https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Buster

esi_y · Feb 10, 2024

bbgeek17 said:
Happy to agree when you can point to stuff member post or official policy documentation stating it?

So I think I now understood what you meant. Would this suffice?
https://forum.proxmox.com/threads/maintaining-2-node-quorum.125021/#post-545036

Note he is not suggested any hocus pocus with a "lower capacity" node, the QDevice is - in that scenario - more reliable (for the reasons mentioned above by yourself) and it does require another piece of hardware.

Note I am not saying he should not get a 3rd node or that he should absolutely match CPU,etc. of the other nodes if he were to get one. But there's zero benefit getting a node that will not be running VMs, it just adds risk as opposed to a QDevice.

pxmx_sys · Feb 10, 2024

@tempacc346235 so we will need cluster and 3 quorum devices to setup replication properly but we won't need to add any resources (like VMs) to HA and so failover will be done manually, got it! but at the same time when we setup cluster and quorum stuff properly, I think HA resource can be added and so we can have an automated failover, for sure I will need to test it to see if this is the behavior I will need in this setup.

about PERC and RAID controller, you are right, we can't use a HW raid controller (e.g dell PERC) in both raid and HBA mode and we should choose one of them, in out case it has to be HBA mode to make sure zfs performance won't be impacted. for the OS redundancy, mdadm can be setup as you said

esi_y · Feb 10, 2024

pxmx_sys said:
@tempacc346235 so we will need cluster and 3 quorum devices to setup replication properly

No, I hope I did not add confusion with the QDevice. Since you have been suggested a 3-node setup and you are buying 3rd server and put them all into one location (if I understood you well), you will have a 3 node cluster. End of story, you are alright with that. If I misunderstood the final decision, let me know.

(My reaction to @bbgeek17 was just for the case that having a "fake" 3rd node is senseless within the given options, either have a proper one or a QDevice.)

pxmx_sys said:
but we won't need to add any resources (like VMs) to HA and so failover will be done manually, got it!

This was suggested by @alyarb, I do think it's the safer setup to have (you do not risk fencing, etc.), but ...

pxmx_sys said:
but at the same time when we setup cluster and quorum stuff properly, I think HA resource can be added and so we can have an automated failover, for sure I will need to test it to see if this is the behavior I will need in this setup.

Yes, but definitely 3 nodes cluster at the least or 2 nodes + a QDevice (sorry I refuse to call QDevice a node, it is not working like a node, it is not called so and it causes confusion if one imagines it as one).

pxmx_sys said:
about PERC and RAID controller, you are right, we can't use a HW raid controller (e.g dell PERC) in both raid and HBA mode and we should choose one of them, in out case it has to be HBA mode to make sure zfs performance won't be impacted. for the OS redundancy, mdadm can be setup as you said

I do not want to start a flame war on this one here either, but mdadm works well with Debian and PVE can be reliably installed on top of it. Similarly I gave no opinion on e.g. XFS - I do not really know what they consider "supported", I know it's e.g. how RHEL installs by default and it it's a good filesystem. I think the standard ISO installs let you choose a filesystem over LVM and some people would e.g. swear by ZFS instead anyhow (for the OS) which the default ISO installer supports. I do not like it (for the OS), I do not like how it shoves bootfiles into EFI partition, etc, then troubleshooting it.

One thing that crosses my mind with the mdadm and EFI ... it's a pretty tricky setup with Debian (without hardware RAID) to have that EFI duplicated as you cannot really have mdadm over it reliably. I do not know if you care for that one. I will not open the can of worms here and let others comment, I am sure I gave them enough to dispute.

esi_y · Feb 10, 2024

A brief reference on what's the issue with mdadm and EFI:
https://forum.proxmox.com/threads/proxmox-8-luks-encryption-question.137150/page-2#post-611562

Would I want to run it myself in production? Probably ... NOT. But maybe you do not worry about non-booting situation, you can manually fix that, maybe you just care for redundant OS SSD while already running. Or have it the EFI partition on something else entirely. Or full ZFS (including for OS)... I am sure that will be "supported".

pxmx_sys · Feb 10, 2024

tempacc346235 said:
A brief reference on what's the issue with mdadm and EFI:
https://forum.proxmox.com/threads/proxmox-8-luks-encryption-question.137150/page-2#post-611562

Would I want to run it myself in production? Probably ... NOT. But maybe you do not worry about non-booting situation, you can manually fix that, maybe you just care for redundant OS SSD while already running. Or have it the EFI partition on something else entirely. Or full ZFS (including for OS)... I am sure that will be "supported".

Correct, I guess letting zfs manages everything is the easiest way. If OS goes under, it will pull everything else with itself

Design options for a 2-node cluster in production

Renowned Member

New Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

Renowned Member

Renowned Member

New Member

Renowned Member

Renowned Member

New Member

We value your privacy