UPS - Shutdown entire cluster

David Herselman · Jan 9, 2021

We have nut-server successfully monitoring a UPS with nut-client running on all nodes.
When power goes away it correctly and simultaneously initiates 'init 0' on all nodes but this then causes problems.

Nodes that only provide Ceph storage shut down before VMs are given a chance (yes, qemu agent everywhere) and HA tries to migrate VMs around the cluster but then puts them in an error state as storage requests start timing out as Ceph placement groups become inactive.

We want HA, we want migrate on shutdown (rolling upgrades) and we want fallback for when uograded nodes reboot. Is there a technique to once-off shutdown all VMs and then shutdown nodes so that we don't have to clear HA error states on guests when it all starts up again?

JoeDragon · Jan 10, 2021

Your storage is running in the VM?
Also can you set an delay on the storage nodes for shutdown on UPS power? + also maybe set ceph maintenance mode

David Herselman · Jan 10, 2021

Cluster is using Ceph to provide storage to VMs. I'm essentially trying to find a way to stop all guests, cluster wide and only then shut down nodes with fencing disabled. I presume Proxmox possibly needs a sort of maintenance mode where the high availability service needs to temporarily be disabled as we want everything to start again at power up.

This is for an on-premise Proxmox cluster at a client. Yes they have generators and large UPS but the generators are set not to fire up automatically over weekends.

ie: We could easily script most of this sequencing (stop guests, wait 5 minutes, stop local ha resource manager processes and then initiate shutdown off all nodes but then the VMs would all remain off at start again.

apoc · Jan 10, 2021

Orchestrated shutdowns are always a challenge. I have not come across any build-in solution for this in the last decades. The reason for this is probably that it is very complex to establish. Especially foresee all the things that could go wrong during the process. I did a project at a customer which ended up in 6months work to achieve what was necessary and cover all scenarios (even with failure of the primary ups server).

The best thing you could do IMHO is doing a sequence from a central ups server which controls the process. That makes sure that timings will work out.
E.g.
- shutdown all VMs (are there any dependencies within the VMs???)
- after time X kill remaining VMs
- shutdown compute nodes
- shutdown CEPH nodes

Everything else is a gamble...

billybob321 · Jul 22, 2021

tburger said:
Orchestrated shutdowns are always a challenge. I have not come across any build-in solution for this in the last decades. The reason for this is probably that it is very complex to establish. Especially foresee all the things that could go wrong during the process. I did a project at a customer which ended up in 6months work to achieve what was necessary and cover all scenarios (even with failure of the primary ups server).

The best thing you could do IMHO is doing a sequence from a central ups server which controls the process. That makes sure that timings will work out.
E.g.
- shutdown all VMs (are there any dependencies within the VMs???)
- after time X kill remaining VMs
- shutdown compute nodes
- shutdown CEPH nodes

Everything else is a gamble...

This sounds like a good concept. I wonder if anyone has done this.

Perhaps it would be a good job for a raspberry pi -

alexskysilk · Jul 22, 2021

I dont really see any point to issuing an init 0 in the first place- whats the benefit of doing it vs letting it die naturally IF battery life is exceeded?

what you CAN do instead is issue a stopall to nodes bearing vms (https://pve.proxmox.com/pve-docs/api-viewer/index.html#/nodes/{node}/stopall.) IF UPS runs out of power all machines will stop but no damage should occur since no vm disk is mounted. It will also assure that all nodes will be in power out state which means as long as they are set to power-on after power loss is set they will immediately boot upon power restoration. as long as your vms are in HA groups they will restart on power recovery.

Search

Search

UPS - Shutdown entire cluster

David Herselman

Renowned Member

JoeDragon

New Member

David Herselman

Renowned Member

apoc

Famous Member

billybob321

Member

alexskysilk

Distinguished Member