UPS - Shutdown entire cluster

Jun 8, 2016
344
69
68
47
Johannesburg, South Africa
We have nut-server successfully monitoring a UPS with nut-client running on all nodes.
When power goes away it correctly and simultaneously initiates 'init 0' on all nodes but this then causes problems.

Nodes that only provide Ceph storage shut down before VMs are given a chance (yes, qemu agent everywhere) and HA tries to migrate VMs around the cluster but then puts them in an error state as storage requests start timing out as Ceph placement groups become inactive.

We want HA, we want migrate on shutdown (rolling upgrades) and we want fallback for when uograded nodes reboot. Is there a technique to once-off shutdown all VMs and then shutdown nodes so that we don't have to clear HA error states on guests when it all starts up again?
 
Your storage is running in the VM?
Also can you set an delay on the storage nodes for shutdown on UPS power? + also maybe set ceph maintenance mode
 
Cluster is using Ceph to provide storage to VMs. I'm essentially trying to find a way to stop all guests, cluster wide and only then shut down nodes with fencing disabled. I presume Proxmox possibly needs a sort of maintenance mode where the high availability service needs to temporarily be disabled as we want everything to start again at power up.

This is for an on-premise Proxmox cluster at a client. Yes they have generators and large UPS but the generators are set not to fire up automatically over weekends.

ie: We could easily script most of this sequencing (stop guests, wait 5 minutes, stop local ha resource manager processes and then initiate shutdown off all nodes but then the VMs would all remain off at start again.
 
Orchestrated shutdowns are always a challenge. I have not come across any build-in solution for this in the last decades. The reason for this is probably that it is very complex to establish. Especially foresee all the things that could go wrong during the process. I did a project at a customer which ended up in 6months work to achieve what was necessary and cover all scenarios (even with failure of the primary ups server).

The best thing you could do IMHO is doing a sequence from a central ups server which controls the process. That makes sure that timings will work out.
E.g.
- shutdown all VMs (are there any dependencies within the VMs???)
- after time X kill remaining VMs
- shutdown compute nodes
- shutdown CEPH nodes

Everything else is a gamble...
 
Orchestrated shutdowns are always a challenge. I have not come across any build-in solution for this in the last decades. The reason for this is probably that it is very complex to establish. Especially foresee all the things that could go wrong during the process. I did a project at a customer which ended up in 6months work to achieve what was necessary and cover all scenarios (even with failure of the primary ups server).

The best thing you could do IMHO is doing a sequence from a central ups server which controls the process. That makes sure that timings will work out.
E.g.
- shutdown all VMs (are there any dependencies within the VMs???)
- after time X kill remaining VMs
- shutdown compute nodes
- shutdown CEPH nodes

Everything else is a gamble...

This sounds like a good concept. I wonder if anyone has done this.

Perhaps it would be a good job for a raspberry pi -
 
I dont really see any point to issuing an init 0 in the first place- whats the benefit of doing it vs letting it die naturally IF battery life is exceeded?

what you CAN do instead is issue a stopall to nodes bearing vms (https://pve.proxmox.com/pve-docs/api-viewer/index.html#/nodes/{node}/stopall.) IF UPS runs out of power all machines will stop but no damage should occur since no vm disk is mounted. It will also assure that all nodes will be in power out state which means as long as they are set to power-on after power loss is set they will immediately boot upon power restoration. as long as your vms are in HA groups they will restart on power recovery.
 
  • Like
Reactions: gabbegubben

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!