Node hardware redundancy for cluster

Gizmot

New Member
Dec 20, 2020
2
0
1
41
In the case of a standalone simple node, wisdom would tell the minimum configuration is to use 2 boot drive ("RAID1"), 2 datastore drives ("RAID1"), 2 PSUs and 2 NICs.
The point of this redundancy is maximized runtime.

Now, if you switch to a 3+ nodes HA setup, you end up with at least everything 3x.

While it does not hurt, except for your wallet, is there a good argument for dual PSU, dual NICs and dual boot drive in a HA setup? From a reliability and financial point of view, it seems to make more sense to put your money on more nodes instead. Am I missing something?
 
There is a simple question you need to ask yourself: what impact are you willing to suffer (e.g. any hardware failure is a node failure)?
And maybe: what is your SLA?

If you run kubernetes clusters / containerized apps that might be a valid approach. Apps running in s don't always have this stateless approach so either you spend on hardware or you spend on logic on the app-layer (e.g. guest cluster) which often adds a lot of administrative overhead.
 
There is a certain appealing factor for hardware failure == node failure in a way that you simply take that node out to fix and replace with an other one.
I guess the deciding factor is the node price. When your nodes are expensive dual CPU Xeon with 200+TB ram each, I guess the redundant hardware pricing is irrelevant.
What I had in mind is inexpensive 1U servers for remote locations.
 
Again: what is your expectation / SLA and your requirement?
There is no definitive answer to that question.
I have customers with awesome stability on one nodes, because it is just stupid simple setup.
On the other hand patching is a pain, because you need to shutdown services. That brings me back to the SLA question.

And especially for robo sites (where you can't get easily or without travel-hassle) the approach of a node redundancy might not be favourable. Having a system which can withstand certain failures will relax your need to jump right into the car to fix the gear before a second node fails and potentially drives the cluster down.