Dual-Server Remote Field Installation

jmbldwn

New Member
Mar 11, 2025
5
0
1
I am considering using Proxmox VE to provide redundancy and failover for a set of services I am running in a remote field installation.

The idea is to use two identical servers in a cluster, each with enough capacity to run all of my services, in a configuration where if one server fails the other will take on the full load until the dead server can be replaced, and I am ok with the temporary loss of redundancy during that time.

Ideally, the two servers will be all of the hardware I need at each location. I'll make sure I have enough RAM/CPU/SSD for everything I'm running.

My research into Proxmox so far has led me to a few questions that I could use some help with:

Storage: By default the storage is per-server. I want to have a simple way for a small file system that holds images, scripts, etc. to be replicated across both servers in the cluster. My read of CEPH is that it might be overkill for this. There's also a built-in cluster file system Proxmox uses, but it doesn't look like I can use this for my data. What's the simplest way to achieve this?

High Availability: The docs imply that I need 3 voting servers to be quorate. I don't want to have a 3rd server of any kind if I can avoid it. Is there any reason why a dual-server cluster wouldn't fail over correctly if one server died or is taken offline?

Any other considerations?
 
Hi,

Storage: By default the storage is per-server. I want to have a simple way for a small file system that holds images, scripts, etc. to be replicated across both servers in the cluster. My read of CEPH is that it might be overkill for this. There's also a built-in cluster file system Proxmox uses, but it doesn't look like I can use this for my data. What's the simplest way to achieve this?
Probably by using ZFS with Storage replication. The interval can be turned down to 1 minute, which - while some data loss might occur - minimizes the period. Ceph also needs at least 3 nodes, which even than it isn't an optimal setup for HA.

High Availability: The docs imply that I need 3 voting servers to be quorate. I don't want to have a 3rd server of any kind if I can avoid it. Is there any reason why a dual-server cluster wouldn't fail over correctly if one server died or is taken offline?
Clustering needs at least 3 nodes, split brain is a key word here. See also Cluster requirements, although you can use a QDevice to get around that.
 
Thanks, @cheiss, very helpful.

Can I run the QDevice on the same hardware? I would put a QDevice on each server, so I'd have two servers and two QDevices in the cluster. This would mean if I have a server failure, I drop to 1 server and 1 QDevice until the server is fixed/replaced.

Any issues with that?
 
I would put a QDevice on each server, so I'd have two servers and two QDevices in the cluster.
That wouldn't change the situation and make much sense. You'd effectively still have 2 nodes with 2 votes each => same problem. Also, one normally wants to have an uneven number of nodes/votes in a cluster - see Supported Setups, which explains all that in detail.

The QDevice must be external to be effective. It's a very simple, cheap service and can run on any kind of hardware - it just needs a connection to the cluster nodes. E.g. many people run it on some cheap SBC or even their router.
 
Yeah, I realize it's not ideal for majority voting, but all I'm trying to accomplish with Proxmox is to have a hot standby. Having a 3rd device may help with the majority voting but it also adds another component I have to deploy and maintain at the remote location.

So I understand the cost/benefit here, what's the worst that could happen if I deploy 2 servers in a cluster without at 3rd voter?
 
So I understand the cost/benefit here, what's the worst that could happen if I deploy 2 servers in a cluster without at 3rd voter?
TLDR: Useless.

In layman's terms, when one of the two servers goes "down" the remaining one, has no idea why he can't reach the other one; is it because the other one is down & at fault, or maybe it is his own fault since "he is having a NW issue" in reaching the other one. This means he will "fence" himself as a form of protection, so he is also not going to be exactly usable. When you add a third device, the voting makes sure who is the "odd man out".

Hope that helps - somewhat.

It shouldn't be too much to add a Qdevice - even in the field.
 
Last edited:
  • Like
Reactions: UdoB
Having a 3rd device may help with the majority voting but it also adds another component I have to deploy and maintain at the remote location.
It actually does not have to be in the same physical location, e.g. running it somewhere else over the internet via VPN would be fine too - the QDevice is not that latency-sensitive as the actual Proxmox VE nodes.

So I understand the cost/benefit here, what's the worst that could happen if I deploy 2 servers in a cluster without at 3rd voter?
As soon as one node goes down, there is no quorum anymore in the cluster - since the second node (aka. the remaining cluster) now only has 50% of the votes - which is not the majority. At this point, the cluster basically goes readonly, at least the management plane, to avoid any data corruption.

Having high availability - which you want - is simply not possible with two nodes.
 
  • Like
Reactions: UdoB
Interesting. I didn't consider that it could be a cloud server somewhere, however, I'd be concerned that we could get stuck if the Internet connection is down during the failure resolution.

I understand the technical concern over a network glitch where each server decides it's alone, but that's unlikely in this configuration. I was hoping there would be a way of telling each server it's ok to take over if it can't contact the other.

It's not all bad. I was sizing my servers for 100% load so we could run with one only during the outage. But if we go with three servers, we can assume a minimum of two for the outage, so I can size them for 50% load instead.
 
I understand the technical concern over a network glitch where each server decides it's alone, but that's unlikely in this configuration. I was hoping there would be a way of telling each server it's ok to take over if it can't contact the other.
These 2 sentences are basically contradictory. If you are not worried of a NW communication between the two servers you are never going to need a solution raised in your second sentence.

Another thing you have to realize, that with your suggested solution "of telling each server it's ok to take over if it can't contact the other", you could / will end up with both servers running the same VMs/LXCs simultaneously. If this is not a problem for your workload (on most instances this is), then just setup 2 servers running everything all the time. Then you will have the redundancy you seek.

I may not be fully aware of your situation / needs, but I'm just commenting based on what I understand.
 
These 2 sentences are basically contradictory. If you are not worried of a NW communication between the two servers you are never going to need a solution raised in your second sentence.

Another thing you have to realize, that with your suggested solution "of telling each server it's ok to take over if it can't contact the other", you could / will end up with both servers running the same VMs/LXCs simultaneously. If this is not a problem for your workload (on most instances this is), then just setup 2 servers running everything all the time. Then you will have the redundancy you seek.

I may not be fully aware of your situation / needs, but I'm just commenting based on what I understand.
The key difference is only one instance of each VM can be active, as they are managing specific resources in the field, so I can't have two instances running at the same time. I just need a way for another server to take over if one falls, and keep the VM instances running.

The failure mode I'm concerned about is not the communication between the servers, just a single server dying, where I want the other to take over. But it looks like proxmox wasn't designed to handle this 2-server case.

I think I'm good with either adding a QDevice or a third server, but I wanted to make sure I fully understand the options.
 
The key difference is only one instance of each VM can be active
and exactly this is what requires a quorum process to decide where this instance is allowed to run safely ;) like written above - lots of people use something like a raspberry pi or similar hardware to run the qdevice. the qdevice going down is also not an issue unless one of the servers goes down as well at the same time.
 
The failure mode I'm concerned about is not the communication between the servers, just a single server dying, where I want the other to take over. But it looks like proxmox wasn't designed to handle this 2-server case.
To expand on this: How would a server be able to distinguish that the other server either died or just lost connection to it? It's equal from the perspective of each server.
E.g. Imagine the network connection between those two failed for whatever reason. In that case, either server has no way of knowing what it should do.
"split-brain" is a often-mentioned keyword here.

Thus generally (safe) clustering/HA is not possible with two nodes.
 
  1. Cluster SetupSince the setup is for two nodes, the third server should be added as a qdevice to achieve quorum. HA can only be configured when this setup is complete.

If qdevice is not used, when one server fails, the voting rights will be divided, and both nodes might claim to be the master, which increases the risk of the cluster breaking. (The condition is to maintain an odd number, at least three nodes). However, even if the cluster breaks, there is no impact on the data.


  1. ReplicationReplication works when configured with a single node and ZFS storage using commands. However, in case of issues, you will need to manually power on the VMs on the replica server.

It seems that going with the first option is the right choice.