3-node HA cluster - what happens if 2 nodes fail?

victorhooi · Apr 25, 2018

Hi,

I'm planning out a 3-node Proxmox HA cluster, using the same 3 nodes for a Ceph cluster as well.

My question is - how many nodes can fail?

For example, assuming that you have over 3x required storage on the Ceph cluster (i.e. less than 33% disk space used) - what happens if 2 nodes fail? Will it still be able to migrate the VMs to the one remaining node?

I can add additional witness boxes using Raspberry Pis if required.

Regards,
Victor

Klaus Steinberger · Apr 25, 2018

1 node can fail

An additional witness box does not help, as you need n/2+1 nodes as quorum to avoid split brain situations (which will be really bad).
So a 4 node cluster will need 3 nodes alive, so the witness box does not help in this case.
The witness box is a good idea to run a 2 node cluster ( 2 nodes + witness).

It needs at least 5 nodes to survive 2 failed nodes.

But I can ease you, a 3 node cluster is a very stable thing. 2 failing nodes is a extremly seldom thing. I run many cluster systems mostly in 3 node configuration since more than 15 years (Scientific Linux since SL4, and nowadays proxmox), and never had a situation in which 2 nodes failed, except one case there somebody did something very unusual.

victorhooi · Apr 25, 2018

Aha, thank you for clarifying.

Would it help to have additional witness boxes? (They're just Raspberry Pis, so reasonably cheap).

So we could do 3 VM servers and 2 Raspberry Pis?

Not sure if Ceph has witness voxss as well.

Klaus Steinberger · Apr 25, 2018

this is complete unneccessary, and wll not help with CEPH anyway.

If you have 3 nodes running CEPH you will need 2 of them up anyway, as the usual rule for pool's is min 2 may 3
That means at least two copies of a object have to be available, and they need to be on different hosts.
And be warned, don't fiddle with that, especially don't lower the minimum!

victorhooi · Apr 25, 2018

Right - so assuming that we have 3 "full" VM nodes (each running Proxmox and Ceph) - that means there's no advantages to adding any witness nodes (Raspberry Pi's, right?).

So we can tolerate a single node failure - but any more than that, and we'll need to restore the VM from backups, or something similar, correct?

Is it possible to convert a member of the Proxmox HA cluster to run a VM standalone, if that happens? (i.e. for DR).

Klaus Steinberger · Apr 25, 2018

There is no advantage in Raspberry's in your configuration.

No you will not need to restore, just get one of the failed nodes up again. If your hardware is so flaky that you fear so much failures, then the hardware should be deployed to the trash bin before you ever put a system on it.

With usual enterprise server systems with redundant power supplies and a RAID1 for system disk, we never had hardware failures of more than one node at same time in over 15 years of operation (on multiple HA cluster systems with 3 node cluster).

fortechitsolutions · Apr 29, 2018

For what it is worth, you may wish to reconsider your design philosophy slightly? I looked at a similar project build ~1.5 years ago and my general feeling was that
-- Ceph was possible with a 'modest size' cluster (ie, 3-5 nodes of proxmox with ceph 'hyperconverged storage pool')
-- but the ceph performance was not 'great' until you got into a bigger sized deployment, period. Ceph really likes lots (!) of disks and lots of nodes for spreading out the load.

So instead of using ceph as the baseline build plan, we ended up with
-- 3 x nodes in a cluster [and building multiple parallel 3-node clusters as desired for more capacity. If needed; or maybe slight increase in cluster size, ie 3>4>5 node cluster size.]
-- local shared-nothing hardware raid storage
-- regular VM Backups to a separate shared NFS storage tank (nightly)
-- local HW Raid backed storage gave nice robust IO performance; this architecture was 'very simple' and 'just worked'
-- outages were really a non-issue, because - the servers had redundant PSU, redundant HDDs / HW raid. The real-world risks of failure/downtime due to (CPU RAM or other non-redundant component fault) was significantly lower than (added risk of outages due to increased complexity, higher confusion, greater maintenance impact, human error factors) - if the design was 'more complex and elegant' aka using ceph (or ZFS .. or DRBD .. also which were options I pondered).

the fact we can do zero-downtime VM migrations in share-nothing clusters makes it possible to do "VM re-balancing"
the fact that NFS shared storage "VM Backup Tank" is easy, works really well - makes 'disaster recovery" possible easily / with modest downtime.
clearly we don't have HA fault tolerance here, but again it is a matter of assessing the <actual risks> of your environment, the impact of <different build design on HA:Tolerance:fault:Risk:FailChance> and <what really is the bad outcome from a brief outage to a VM?>

Just my 2 cents!

Maybe ceph / or other things have changed tremendously in the last ~1.5 years - I am not certain - would be happy if others feel desire to comment!

Tim

------------------------------------------------------------
Tim Chipman
FortechITSolutions
http://FortechITSolutions.ca

"Happily using Proxmox to support client projects for nearly a decade"

Klaus Steinberger · Apr 29, 2018

ceph is very dependent on latency. It works very well with at least 10 Ge Network and fast SSD OSD's or HDD OSD's backed by SSD Journals also with modest size clusters.

Search

Search

3-node HA cluster - what happens if 2 nodes fail?

victorhooi

Active Member

Klaus Steinberger

Renowned Member

victorhooi

Active Member

Klaus Steinberger

Renowned Member

victorhooi

Active Member

Klaus Steinberger

Renowned Member

fortechitsolutions

Renowned Member

Klaus Steinberger

Renowned Member