[Solved] Why do i need 3 Nodes for an HA Cluster?

fireon

Distinguished Member
Oct 25, 2010
4,445
452
153
Austria/Graz
deepdoc.at
In the wiki i found the information, for an HA cluster you need at least 3 nodes. Can anyone tell me why i can't use this with 2?

Regards
fireon
 
Last edited:
Re: Why do i need 3 Nodes for an HA Cluster?

A cluster needs to know which nodes are accessible/available (turned on and networking is fine) and which ones arent (power failure, network failure... something). If you have 3 nodes and one node fails, both remaining nodes can still communicate with each other and "reach the conclusion" that node #3 is unavailable because both #1 and #2 cant talk to it. If the cluster was only 2 nodes and both nodes cant see each other... both nodes will consider the other node offline and take meassures accordingly. Now if both are actually still in working condition and "only" the network between them broke down, they will both try to start all VMs that need to be kept running which is very likely to result in complete corruption of the VMs disk images on the storage (assuming the storage is still reachable).

This is not a very technical explanation, but you get the gist.
 
  • Like
Reactions: Lukas Moravek
Re: Why do i need 3 Nodes for an HA Cluster?

Thank you very much, now I understand it. It was absolutly not clear for me before. This explains the configuration of an DRBD.

Regards
fireon
 
But what is the exact technical explanations for requirement of minimum 3 node HA system with shared storage, can anyone explain specifically.
 
But this was explained already here quite well in mo_'s reply, what are you missing?

Note that nowadays one can use the "external vote support" to use just two PVE nodes with one external simple linux host providing the third cluster vote, see:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support

In any case, if you run HA with just two PVE nodes (+ external vote of course), both nodes need to have 50% of free spare resources, i.e. memory and also CPU and possibly network and storage throughput, so that they can safely take over the others nodes virtual guests in case of a fallout.
 
  • Like
Reactions: Johannes S
But what is the exact technical explanations for requirement of minimum 3 node HA system with shared storage, can anyone explain specifically.
If you have only two node which for whatever reason do not see each other you have a potential split brain situation, the same VMs are running on both sides. (think 2 shops selling the same article to different customers)
To avoid that the running side in an HA Environment must have a majority (2 out of 3 or 3 out of 5 ...),
1 out of 2 is not a majority. (you can have 2 proxmox nodes and a additional quorum device)
 
  • Like
Reactions: Johannes S
But what is the exact technical explanations for requirement of minimum 3 node HA system with shared storage
You'll want exactly one instance to work with the data. If PVE wouldn't get this right,
two things would happen dependent of how many copies of the data (and where they) are stored. Let's say we have a database.
- If you got multiple copies (Ceph, ZFS replica), the same database could be started on the other machine and operate on the other copy, thus creating a split brain situation.
- If you got only one copy (central NAS), multiple instances of the program could write on the same data, thus corrupting the data.

To prevent this in HA you always want to kill the smaller cluster partition(s), only the one biggest survives and is allowed to work with the data.
But how to define "the biggest" partition? If we say the whole cluster is 100% and we want to be sure that only one partition remains working, we need to have more than 50% of the cluster in the partition. Why not exactly 50%? Well, you can have 2*50%, so that is not one(!) biggest partition.

2*50% is a stalemate. You cannot define in an automatic way which one may work with the data. If a cluster is in this state and it has HA enabled, both servers will restart if they don't see each other anymore. Without HA they would continue to operate, but you cannot start VMs anymore.
This is where the external vote (as @t.lamprecht wrote) will become handy, so you won't get a 50/50 anymore.
 
  • Like
Reactions: Johannes S
But my confusion is why external shared storage required, to keep the vm's and may be the data will be safe as it is kept on additional hardware in case of server/node failure or is there any specific reason related to HA setup?
 
But my confusion is why external shared storage required, to keep the vm's and may be the data will be safe as it is kept on additional hardware in case of server/node failure or is there any specific reason related to HA setup?
Some shared storage is required to allow recovery of a VM/CT on another node if the previous node fails, the storage does not have to be external though.

One option is to use Ceph for storage and set that up directly on the Proxmox VE nodes. That way the redundancy from the storage and the PVE system is also coupled and there are less external HW/Components involved. A downside is that you need to account for the resources required by Ceph and have some more memory and available CPU load allocated for it; but you save the extra HW for the external storage, so should be still cheaper.

If you want to read more about ceph then see https://pve.proxmox.com/pve-docs/chapter-pveceph.html
 
Another Option might be to use the storage replication feature:
https://pve.proxmox.com/wiki/Storage_Replication

It works like this:
- The disk images of the vm are are replicated after a schedule ( default is 15 minutes, can be reduced to one Minute) to the other cluster node/s
- Thus in case of a missing node the vm can be launched on one of the remaining nodes
- This also reduces transfer times when migrating vms or lxc containers to another node


Some caveats though:
- This only works with zfs as storage on every node you want to replicate from/to and will obviouvsly need enough space on both hosts
- The vm/lxc will NOT be in exactly the same state but the one of the last succesful sync ( ok if you can live with loosing the work since that point in time, otherwise not )
- It's ( together with a qdevice) a nice solution for two-node Setups ( homelab, small businesses, etc ) but as soon as you have three real nodes it's time to think about different options like a NAS/ceph or whatever fits better.

Hope this helps, best regards, Johannes
 
Last edited:
can it be like this logically as follows:

1. All running vm's will be on external shared storage.

2. and 3 nodes will only carry proxmox os and external storage will be mount as nfs on three nodes.

I was planning to keep my actual vm's/data on to the seperate hardware that's why and may be connectivity between nodes and storage through some 10g switches.
 
Another Option might be to use the storage replication feature:
https://pve.proxmox.com/wiki/Storage_Replication

Yeah, but then when it comes to issues on re-launch, you hit the fact it's not like a normal shared storage:
https://forum.proxmox.com/threads/what-is-wrong-with-high-availability.139056/#post-620923

And even if it was, PVE does not cater for even intermittent unavailability of storage on auto-recovery. There's elaborate process then for you as user to ensure that e.g. glusterfs is available because .. your problem, really.
 
can it be like this logically as follows:

1. All running vm's will be on external shared storage.

2. and 3 nodes will only carry proxmox os and external storage will be mount as nfs on three nodes.

I was planning to keep my actual vm's/data on to the seperate hardware that's why and may be connectivity between nodes and storage through some 10g switches.

It can, but then you have single point of failure that single storage, so I wonder what's the benefit of cluster.
 
that storage must have RAID with dual controller, you may suggest any alternate option in aspect of shared storage only
 
I would say, the way you started your question - you made it sound like you have a resilient shared storage already (however you define it) and you would be fine to run that workload on fewer than 3 machines.

Then you had that discussion about quorum, I am not sure if 3 nodes really make sense (if the 4th machine, the storage one, is the one on which everything depends) - if you are familiar with some other cluster architectures where STONITH block device is used, PVE cannot do that.

The way Proxmox puts forward its offering, either you have:
- CEPH - dubious at 3 nodes in my opinion; or
- ZFS - for this few nodes it would be okay with the replication, but:
-see my post and link above; - it is not realtime => terrible for e.g. DB workloads;
-it is not meant to be used on RAID controllers;
-not my favurite filesystem for performance); or
- something else shared external to PVE.

I wonder two things:
- why don't you run that workload on a single machine with dual controller - you can still use Proxmox VE (no cluser); or
- use something else (which is fine with master - slave architecture, can do STONITH BD, etc.) with that shared storage you are happy with.
 
Last edited:
  • Like
Reactions: Johannes S
I want to achieve HA setup properly in production env with proxmox, I have lots of vm with db and critical applications currently bbut that are in standaalone servers, as i already stated I am planning live data/vm should be in external storage and nodes will hook up as per requirement in case of any failure, say like this hardware i have

1. 3 identical server

2. one enterprise external storage.
 
I want to achieve HA setup properly in production env with proxmox, I have lots of vm with db and critical applications currently bbut that are in standaalone servers, as i already stated I am planning live data/vm should be in external storage and nodes will hook up as per requirement in case of any failure, say like this hardware i have

1. 3 identical server

2. one enterprise external storage.

You literally press me for this, but I would not use Proxmox VE if I am contrained by all that you stated. If I was paid by Proxmox and wanted to have a happy customer, I would sell you on having 5 nodes minimum and CEPH (which is considered shared storage).
 
But if you ask whether PVE will work in the setup with NFS, the answer is, it will.

I would just not keep an architect with that on my payroll.

It can, but then you have single point of failure that single storage, so I wonder what's the benefit of cluster.
 
If you can live with the NAS as Single point of failure it should work.
But like esi_y said this is not the most reliable solution. Non the less a lot of people and businesses do this and seems to be happy with it so who am I to judge?

If you want more reliability you would need to build a kind of NAS cluster where each node will replicate it's data between them and PVE will talk to it through sone kind of loadbalancer.

TrueNAS might offer such setups but I'm not sure on this.

However: This would need an additional investment ( for the second NAS) so the question is whether this money isn't better invested in getting the needed Hardware for Ceph.

Or look for another hypervisor/cloud solution.
 
but if i used enterprise ready nas as external storage with redundancy then what is the technical issue i really do not understand
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!