Design options for a 2-node cluster in production

pxmx_sys · Feb 9, 2024

Hi,
I have joined a new company that already has two Proxmox servers (each in a different site; single node setup and not in the cluster). Their storage is ext4 - RAID0 (Pretty sure the previous admin just clicked the next button during installation without reading the option!!). so there is no redundancy at the moment and I'm just taking backup and offsite them to a NAS.
I'm planning to setup a cluster and so combine all the VMs of the current nodes, into a 2-node cluster. So I'm going to buy 2 new servers (maybe Dell R750 because of its support, supermicro support takes too much time).
1) one option is to load 50+T SSD to each server, make ZFS filesystem and setup replication every 15min between them. when the issue happens, hopefully it will migrate the VMs to another node. for this setup, I think I need to use a separate computer for the 3rd quorum to make sure the failover can happen automatically. otherwise will need some manual intervention.
2) second option is to use ceph as shared storage from the local storage assigned to the nodes. (not sure about it as haven't used ceph)
3) third option is to buy a SAN (or even a fast 100G NAS) and share with 2 servers and setup a proper cluster.

please can you advise based on your experience which of these options is better (cost effective, easier to support, more robust). or your option is not among them, please let me know. I've done some googling and search forum, couldn't find answers to my questions, so starting this thread. apologize if it's already discussed here.

bbgeek17 · Feb 9, 2024

You are correct that you need a 3rd member/quorum to achieve proper cluster status.

You should also consider that creating a cluster across "sites" requires these sites to have very low latency between each other. And to do everything by the book, you need to put the quorum in the 3rd site.

ZFS replication driven by PVE is limited to intra-cluster, so if you want to use that the source and target nodes need to be in the cluster. Its a fine option if you are ok with 15min data loss. Its a hybrid of DR and HA, in my opinion, not really achieving either 100%.

Ceph is a popular option. It has native PVE support, meaning that PVE packages Ceph and provides GUI/CLI to manage it across nodes in the cluster. Similar latency caution applies.

SAN is also an option for shared storage within the cluster at a site. Depending on the type of SAN, you may need to sacrifice Snapshot functionality and thin provisioning.

NAS is a popular option, assuming you mean NFS. Generally, it means you will store VM disks/images in QCOW format. There will be some performance penalty to take/delete snapshots with that. How that plays with your Site layout is impossible to say due to the lack of data.

In short, you've done more research than many people. You've found all the options available to you. Each comes with its pros and cons. Only you can decide what works best for you. The factors are: criticality of data, skill level, budget, support requirements, region, and many more, not the least of each "hit by a bus" factor.

Good luck in your choice.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

pxmx_sys · Feb 9, 2024

Thanks so much @bbgeek17 . will explained. Just a clarification I will plan to install this 2 node cluster in one site and in one rack to make sure the performance will be good. I've seen some 100G NAS (which as you said it's NFS share). If NFS locking doesn't hurt the running VMs, I think it can be a good option. Also not sure ceph is a good option here, given it's only 2 nodes and it's not going to be expanded in the future.

bbgeek17 · Feb 9, 2024

Even for a local cluster you need 3 nodes, or a QDevice running outside the nodes. 2-node clusters are for home-labs and experiments.

If one node in this cluster goes down, the other will be in R/O mode or will fence itself. Cluster must have quorum, in a two node cluster - quorum is 2. In 3 node its also 2, so either one node + vote survives, or both prod nodes (if vote is down).

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

esi_y · Feb 9, 2024

bbgeek17 said:
You are correct that you need a 3rd member/quorum to achieve proper cluster status.

I suppose both of you mean a QDevice, in which case this can be off site, and definitely does not require low-latency, in fact it is on TCP 5403 which last time I checked was not well documented.

esi_y · Feb 9, 2024

bbgeek17 said:
Even for a local cluster you need 3 nodes, or a QDevice running outside the nodes. 2-node clusters are for home-labs and experiments.

QDevice is definitely something I would prefer, but just to be clear, there's a way to run 2-node cluster with specific corosync settings two_node: 1 - see also man 5 votequorum [1].

It's something I would not consider "home only", but if you can at least run a QDevice, you absolutely should prefer THAT, which you sound like you can.

[1] https://manpages.debian.org/unstable/corosync/votequorum.5.en.html

pxmx_sys · Feb 9, 2024

Thanks @tempacc346235 . At the moment I am in the phase of high-level design, so I will check the cluster specifics later during implementation. My main question was about the good design architecture for a production environment, in one site and one rack (so low latency) with about 50TB storage requirement. In the same design, I need to include the backup solution as well and I will need to move it to another site. In the future (maybe next year), I'm going to create a similar cluster in our DR site as well. So I will need your design experience and expertise here

esi_y · Feb 9, 2024

pxmx_sys said:
Thanks @tempacc346235 . At the moment I am in the phase of high-level design, so I will check the cluster specifics later during implementation. My main question was about the good design architecture for a production environment, in one site and one rack (so low latency) with about 50TB storage requirement. In the same design, I need to include the backup solution as well and I will need to move it to another site. In the future (maybe next year), I'm going to create a similar cluster in our DR site as well.

I think it's good to know that the QDevice (unlike third node) can totally be off the site if that works better for you. Or you can have third node (onsite).

pxmx_sys said:
So I will need your design experience and expertise here

You better ask @bbgeek17 as I am the one who often suggests "unsupported" (by PVE) options.

esi_y · Feb 9, 2024

bbgeek17 said:
Even for a local cluster you need 3 nodes, or a QDevice running outside the nodes.

It's only now that it got to me you thought of "outside" as in "not a VM" of the cluster itself. That's some experience here, apparently. So yeah, the QDevice can be onsite or offsite, but it makes no sense to have it a VM. The third node has to be on-site.

EDIT: BTW, the QDevice could be a VM of *another* cluster, so if you have e.g. 2 sites, you can indeed have your QDevice as a VM of the *other* cluster. I would not do it vice versa at the same time, though. Ideally it's separate, possibly with your backup server even.

bbgeek17 · Feb 9, 2024

I wanted to clarify a few things:

- the topic started with, what I understood, installing a new PVE cluster across two sites. When doing so, and using best practices, the vote (regardless of packaging) must be placed in 3rd site equally reachable from first two. If the vote is located in either site, a break in communication between sites will lead to loss of quorum with all the expected consequences.
- conversation moved on to having a cluster per site. In this scenario the vote should be in the same site as first two nodes of the cluster for the same reason as above (ie reducing number of potential fault points that can lead to loss of quorum)
- the mentioned "two_node" setting, as evident from the referenced manual page is specific to "votequorum" daemon/technology, which is not the same as "qdevice" in the Corosync realm. Deviating from the supported, and hence tested, standard configuration.
- "qdevice" daemon is part of Corosync packaging that was split off to allow a smaller footprint installation when only this portion is needed. It participates in Corosync voting protocol. A delay in sending a vote will lead to cluster event. Hence low latency/fast response is required, the same as full Corosync package install.

@pxmx_sys your requirements are very basic, as far as deployment goes, my advice is to stick to PVE documented and supported configuration. Otherwise, a year from now there will be a new forum member as puzzled by the solution in place as you were by RAID0.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

esi_y · Feb 9, 2024

bbgeek17 said:
- the mentioned "two_node" setting, as evident from the referenced manual page is specific to "votequorum" daemon/technology, which is not the same as "qdevice" in the Corosync realm. Deviating from the supported, and hence tested, standard configuration.

I lost you here, when you have a QDevice in use, your quorum provider will be votequorum. When you have a QDevice, it runs qnetd and there's qdevice service running on the nodes. But yes, two_node settings is without a QDevice. In terms of "supported" by PVE, QDevice is also tested in a limited way by PVE, despite very beneficial. Especially they do not even endorse lastmanstanding and other valid setups.

NB I do not see how a QDevice can make a cluster of two nodes less stable, even if down or its network is cut-off.

bbgeek17 said:
- "qdevice" daemon is part of Corosync packaging that was split off to allow a smaller footprint installation when only this portion is needed. It participates in Corosync voting protocol. A delay in sending a vote will lead to cluster event. Hence low latency/fast response is required, the same as full Corosync package install.

A QDevice absolutely does NOT need low latency (towards the rest of the nodes), it can be on the other side of the planet than the cluster, it can even vote in multiple clusters at the same time, each somewhere else.

bbgeek17 · Feb 9, 2024

tempacc346235 said:
NB I do not see how a QDevice can make a cluster of two nodes less stable, even if down or its network is cut-off.

In a 2 node cluster the purpose of QDevice is to protect the cluster when one of the full members is down. If both "full" members are up, qdevice can be on the other side of the planet or off. The point is to protect against an unexpected. Just take two node cluster with qdevice, shutdown one of the nodes and then either slow down the communication or introduce packet loss between surviving member and qdevice.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

esi_y · Feb 9, 2024

bbgeek17 said:
In a 2 node cluster the purpose of QDevice is to protect the cluster when one of the full members is down. If both "full" members are up, qdevice can be on the other side of the planet or off. The point is to protect against an unexpected.

Yes. And I do not see how it could do any additional harm, which I do not think you argue against.

bbgeek17 said:
Just take two node cluster with qdevice, shutdown one of the nodes and then either slow down the communication or introduce packet loss between surviving member and qdevice.

It appears the argument presented by you is that co-located QDevice is better than a QDevice far away for latency reasons. This is not correct. For the very reason you presented yourself above how the votequorum works with qdevice service.

The qdevice service (running on each node, not the device itself, where qnetd is running) does provide votes to the quorum. It does so based on its own settings, see e.g. man 8 corosync-qdevice. Those are the settings under quorum.device, the defaults are timeout 10 seconds and sync_timeout 30 seconds.

If you had bad experience with the setup (on the other side of the planet) before, you may want to tinker with quorum.device.heuristics, but the blanket statement above which to me implies QDevice requires co-location for some sense of stability is incorrect.

pxmx_sys · Feb 9, 2024

bbgeek17 said:
@pxmx_sys your requirements are very basic, as far as deployment goes, my advice is to stick to PVE documented and supported configuration. Otherwise, a year from now there will be a new forum member as puzzled by the solution in place as you were by RAID0.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

@bbgeek17 please can you elaborate on "supported configuration". I need a solution with enough redundancy. Should I just go with one node, zfs raid1 and a backup solution? Then what about when the node fails or a maintenance needed on the node. We are hosting some critical production systems on pve like databases which can't afford downtime during the day.

esi_y · Feb 9, 2024

pxmx_sys said:
@bbgeek17 please can you elaborate on "supported configuration". I need a solution with enough redundancy. Should I just go with one node, zfs raid1 and a backup solution? Then what about when the node fails or a maintenance needed on the node. We are hosting some critical production systems on pve like databases which can't afford downtime during the day.

I think this was all part of the reaction on the cluster setup. I know nobody asked, but in my book two nodes is better than one with RAID or ZFS mirror, simply because you have more redundant parts. Even if you have PSUs, etc. properly redundant, with two nodes there's more of the redundancy. It will be highly recommended to have a QDevice with a 2-node cluster.

Note that if you do not plan to use High Availability, having two nodes lose quorum will NOT disrupt e.g. running VMs in any way.

bbgeek17 · Feb 9, 2024

pxmx_sys said:
@bbgeek17 please can you elaborate on "supported configuration". I need a solution with enough redundancy. Should I just go with one node, zfs raid1 and a backup solution? Then what about when the node fails or a maintenance needed on the node. We are hosting some critical production systems on pve like databases which can't afford downtime during the day.

Assuming a single site/rack. A 3 node cluster on a fast, stable network. Where 3rd node is one of the following:
- A full PVE node capable of hosting VMs
- A reduced capacity PVE node (lower CPU, Memory, no additional storage) that is part of the cluster but excluded from HA via policy
- A qdevice running on a separate server, possibly your backup server

Regarding storage, you have to decide if 15, or even 5min, loss of data due to delay in ZFS replication is acceptable to business recovery. If its not, then you need a different storage solution (Ceph, SAN, NAS). A non-ZFS storage will also influence your investment in hardware and potentially alter options for the 3rd node.

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

pxmx_sys · Feb 9, 2024

Thanks @bbgeek17, it completely makes sense

esi_y · Feb 9, 2024

bbgeek17 said:
Assuming a single site/rack. A 3 node cluster on a fast, stable network. Where 3rd node is one of the following:
- A full PVE node capable of hosting VMs

bbgeek17 said:
- A reduced capacity PVE node (lower CPU, Memory, no additional storage) that is part of the cluster but excluded from HA via policy

This one is not "officially supported" as far as I know either.

bbgeek17 said:
- A qdevice running on a separate server, possibly your backup server

This is not a node, I am ~~not~~ nitpicking here, but if the OP were to open a support ticket it is important he has a 2-node cluster with a QDevice, not a 3-node cluster. But I agree it is not any worse than the first option.

bbgeek17 · Feb 9, 2024

tempacc346235 said:
This one is not "officially supported" as far as I know either.

Happy to agree when you can point to stuff member post or official policy documentation stating it?

A 5 second search brings this up: https://forum.proxmox.com/threads/qdevice.59876/#post-503705

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

alyarb · Feb 9, 2024

Setups this small are not going to allow you to transparently continue operation during failure or maintenance. You will have to supervise the failover and failback of your production services between the replicated VMs.

I don't know how you can get "full" DR + HA capability of a large hyperconverged environment cooked down to 2 nodes with any sanity.

To me, the expectation versus provided spec suggests an imbalance in the cost/performance/reliability/redundancy analysis, and as is often the case, the customer is advised to consider a greater number of more modest hosts over a smaller number of top-heavy hosts.

2 brand new Dell R750 sure does sound rather expensive, and I bet one could get 4 or even 6 R730XD for about the same.

Design options for a 2-node cluster in production

New Member

Distinguished Member

New Member

Distinguished Member

Active Member

Active Member

New Member

Active Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

New Member

Active Member

Distinguished Member

New Member

Active Member

Distinguished Member

Active Member