With new HA-Disarm Feature is there a Documentation for NUT Setup on Clusters?

Feb 1, 2024
24
0
6
Hi everyone,

with the new HA-Disarm feature introduced in Version 9.2, is there any official documentation or best-practice guide for setting up NUT (Network UPS Tools) in clustered environments?

I opened a support Ticket a while ago and i got pointed to that feature, but at the moment i couldnt find any Documentation for it to be implemented in NUT.

Here is the original Support request:
I'm planning to implement NUT at my cluster. The setup contains 5 servers which are also running Ceph. At first I planned on running NUT primary on the first node and the rest as secondaries. But then I realized, when 2 nodes shut down everything works fine. But when the third one goes down, the other 2 would fence themselves and just restart, but not shutdown.
 
Hi, thanks for the link.
i know about it. It describes the basic functionality for the disarm function.

I was hoping for a documentation for a safe Shutdown procedure as a power outage is happening. I would like to follow (at best from the developer) a guide to setup nut in that cluster. Also best practices and known limits would be nice. something like this article:
https://pve.proxmox.com/wiki/Performance_Tweaks#Disk_Cache
 
Hi,

I think it's mostly depend on your exact setup, like if you have ceph or not, if you use a SAN, local storage, etc.

Example for me, I let the power cut the node. I don't try a safe shutdown.

Best regards,
 
  • Like
Reactions: Johannes S
I totally understand the dependencies for individual setups.
But i also think a guide for officially supported technologies and their pitfalls would be nice.

imho a general guideline for cluster maintenance (and i count powerloss with UPS Backup as towards that) / safety for such cases would help many people to secure their setups and dont make mistakes which could be avoided.
 
Hi,

If I'm not mistaken, there's already information in the official documentation regarding maintenance (node and ceph).

As for the UPS (with Nut, for example), since it's not part of the Proxmox core but rather a modification of the server manager, I don't see what more Proxmox could recommend, as it's primarily the manager who will have to decide on the desired behavior.

This looks more like a risk assessment (by the manager) than a "general guideline" that Proxmox should provide.

For example, we host nodes/clusters in data centers where we cannot have the state of the UPS, so we prefer to let the nodes/cluster have a power cut than having a power cut when he's trying to shut down (we feel it's less dangerous that way).
For us we take the risk of the power cut because we use ceph (so we lose compute & storage at the same time) and have 2 PBS (1local 1remote), so we assume that the power cut won't harm data (if so, we use the backup) and we can just lunch the whole cluster after the power cut (we already have this event 1 time so we know it can survive a power cut).

Note: This is just my opinion

Best regards,
 
  • Like
Reactions: Johannes S