Install NUT directly on Proxmox VE and control guests from here

Hi @philkunz , got a few questions on your shutdown tool.
We have 2 nodes with 2 UPSes (SNMP enabled), so I can see you can add multiple UPS to a single node.
1) how does it handle multiple UPSes? I can see each UPS device looks to have a threshold, but then theres a group you can add UPSes to with its own thresholds. If one UPS goes on battery and the other is on power, what happens? Similarly if both UPSes go into battery mode, is thresholds based on whichever is lowest or aggregate or higher? Is this based on if they are in a group or added as standalone?
2) How does it handle Proxmox clusters? In our ESXI environment when the shutdown event occurred sometimes it would try to migrate the VMs to the other host, even though both are being asked to shutdown. If I setup HA on my VM's does your tool prevent HA migration and try to clean shutdown the VM's as quick as possible and prevent HA movement?
Regards
Damien
 
Last edited:
Hi Damien,
As of v5.7.0, this is how it works:

1) Multiple UPSes / groups​

You can attach actions either directly to individual UPS devices, or to a group.
  • UPS-level actions are evaluated per UPS, independently.
  • Group-level actions are evaluated across all members after a full poll cycle.
The important bit is: group thresholds are not aggregated. There is no lowest/highest/average battery calculation across the group.
Instead, each group action evaluates each member UPS against that action’s own thresholds:
  • redundant
    • power-change logic treats the group as “on battery” only when all member UPSes are on battery
    • threshold-based actions fire only when all member UPSes are on battery and below that action’s thresholds
  • nonRedundant
    • power-change logic treats the group as “on battery” when any member UPS is on battery
    • threshold-based actions fire when any member UPS is on battery and below that action’s thresholds
So for your examples:
  • If one UPS is on battery and the other is still on mains:
    • direct UPS actions on that affected UPS can still trigger based on that UPS alone
    • a redundant group action will not trigger yet
    • a nonRedundant group action can trigger once that affected UPS crosses the configured threshold
  • If both UPSes are on battery:
    • redundant waits until both are below the group action threshold
    • nonRedundant fires as soon as either one is below the group action threshold
So the behavior depends on where you attach the action:
  • on the UPS = standalone/per-UPS logic
  • on the group = group mode logic
For a dual-UPS redundant host, I would usually recommend:
  • put alerting actions on the individual UPSes if you want
  • put the actual Proxmox shutdown + host shutdown on the group in redundant mode
One extra safety behavior: destructive group actions (proxmox, shutdown) are suppressed if a required group member is unreachable, so it does not make a shutdown decision on partial data.

2) Proxmox clusters / HA​

NUPST is still node-local, not a cluster-wide coordinator.
What it does:
  • shuts down the VMs and LXCs on the local node
  • waits for graceful shutdown
  • force-stops remaining guests if configured
  • then the host shutdown action can run after that
What changed for HA-managed guests:
  • there is now an HA-aware mode: proxmoxHaPolicy: "haStop"
  • in that mode, HA-managed guests are told to go to stopped through the Proxmox HA layer instead of only sending plain qm/pct shutdown
That is the correct path if you want HA-managed guests to stop cleanly without HA treating them as failed and trying to move/restart them elsewhere.
What it still does not do:
  • no cluster-wide shutdown orchestration
  • no global HA disable
  • no coordination between hosts beyond “each host handles itself”
So for a 2-node cluster, the intended setup is:
  • run NUPST on both nodes
  • use proxmoxHaPolicy: "haStop" on the Proxmox action
  • place the Proxmox action before the host shutdown action
That way each node stops its own HA/non-HA guests properly, then shuts itself down, rather than trying to evacuate workloads during a power event.

If you think parts of this should be solved differently, feel free to start a discussion about how things should be and why. We want to make nupst better for everyone.

Regards,
Phil
 
  • Like
Reactions: damiengm
Hi @philkunz
Thanks for the very detailed answer, this helps alot in how I set the thresholds up. I was on 5.6, so I upgraded to get the HaPolicy stuff. It doesn't look like there is an edit action command on the CLI so I edited the config file and restarted the service. So a suggestion would be an action edit command? (like editing the groups & ups worked great). Apart from that the changes you have quickly added (and the tool in general) means I can use this instead of figuring out NUTs.

As for how you are handling the HA/cluster environment, It think it works fine for a small environment like ours (I can have two terminals open and repeat commands as necessary). I just need it to shutdown cleanly and quickly as possible. For large clusters I cannot provide any guidance but I'm sure they would like something sitting in the Datacenter view to control it all/automate etc. Is it feasible to copy the config file to multiple nodes, (ie the ids don't cause issues if they are duplicate? It doesn't seem to me as nupst doesn't do host to host comms afaict)

BTW, our APC Smart-UPS 3000 using the Network Management Card via SNMPv1 reports runtime in ticks, not the more common minutes as suggested by your wizard. I got surprised I had 1000's of minutes left on my UPS, did I just download more battery :)? This of course may be a one off, I don't know, but it was easy to change the setup of the UPS via the CLI.

Now I just have to find downtime to test it all. >_<

Regards
Damien
 
Hi @philkunz
Had a problem with your nupst tool the other day. I had twos hosts piugged into a switch which one of the UPS were plugged into, and I has to do a power cycle on the switch. nupst then said its current status of the UPS was unknown, and so shut down all our VMs. it did non however started a shutdown of the nodes.
Secondary, your website on code.foss.global cannot be reached, actually I can't even get to code.foss.global (CONNECTION_REFUSED)....
Regards
Damien
 
@damiengm -> code.foss.global reachability: the team did some datacenter/routing/ingress work the last few days since we got some upgrades in the uplinks. Might have been this and some outdated policy whats publicly reachable and what not was live for some time. Should be fine now. We also have new monitoring coming up that will catch those issues faster.

Regarding the nupst issue: I found the cause: Proxmox guest shutdown actions could still run when the UPS became unreachable, even though host shutdown was already protected. That is fixed in nupst now.

The team also added a configurable fail-safe: you can now set an unknownMinutes threshold, so nupst will not act immediately on a switch/network outage, but can still shut down after the UPS has been unknown/unreachable for a configured number of minutes.

This is released as @serve.zone/nupst@5.12.0
 
Last edited:
hi @philkunz
Thanks for the quick update and code fix release, but I still can't get to code.foss.global (chrome error: ERR_CONNECTION_CLOSED) , same for community.foss.global. Regards, Damien
 
Ah yes true. We had to take it offline again. Team is currently implementing some DDOS mitigation. Somehow we got discovered by multiple scrapers at the same time on the weekend, rendering everything unusable and our ddos protection was not up for the job, and the database got hammered. Will report back, once everything is back up with mitigations in place. Sorry for the inconvenience. We will also have a mirror up on gitlab.com, that will be used as fallback by any tooling like nupst.