Node hardware redundancy for cluster

Gizmot · Dec 20, 2020

In the case of a standalone simple node, wisdom would tell the minimum configuration is to use 2 boot drive ("RAID1"), 2 datastore drives ("RAID1"), 2 PSUs and 2 NICs.
The point of this redundancy is maximized runtime.

Now, if you switch to a 3+ nodes HA setup, you end up with at least everything 3x.

While it does not hurt, except for your wallet, is there a good argument for dual PSU, dual NICs and dual boot drive in a HA setup? From a reliability and financial point of view, it seems to make more sense to put your money on more nodes instead. Am I missing something?

apoc · Dec 20, 2020

There is a simple question you need to ask yourself: what impact are you willing to suffer (e.g. any hardware failure is a node failure)?
And maybe: what is your SLA?

If you run kubernetes clusters / containerized apps that might be a valid approach. Apps running in s don't always have this stateless approach so either you spend on hardware or you spend on logic on the app-layer (e.g. guest cluster) which often adds a lot of administrative overhead.

Gizmot · Dec 20, 2020

There is a certain appealing factor for hardware failure == node failure in a way that you simply take that node out to fix and replace with an other one.
I guess the deciding factor is the node price. When your nodes are expensive dual CPU Xeon with 200+TB ram each, I guess the redundant hardware pricing is irrelevant.
What I had in mind is inexpensive 1U servers for remote locations.

apoc · Dec 21, 2020

Again: what is your expectation / SLA and your requirement?
There is no definitive answer to that question.
I have customers with awesome stability on one nodes, because it is just stupid simple setup.
On the other hand patching is a pain, because you need to shutdown services. That brings me back to the SLA question.

And especially for robo sites (where you can't get easily or without travel-hassle) the approach of a node redundancy might not be favourable. Having a system which can withstand certain failures will relax your need to jump right into the car to fix the gear before a second node fails and potentially drives the cluster down.

Search

Search

Node hardware redundancy for cluster

Gizmot

New Member

apoc

Famous Member

Gizmot

New Member

apoc

Famous Member

We value your privacy