Clean shutdown of whole cluster

gurubert

Distinguished Member
Mar 12, 2015
1,077
273
153
Berlin, Germany
www.heinlein-consulting.de
Hi,

what is the recommended procedure to shutdown a complete PVE cluster including HA resources?

The manual only talks about maintenance of single nodes, but sometimes it is necessary to shutdown everything.

We have observed that a simple shutdown on all nodes is not sufficient as HA fencing reboots some nodes first and then they are online again.
 
Hello,

We have an enhancement opened in our Bugzilla [0]. And here is also a thread in our forum about the same request [1] maybe you will get more info from the people who share their experiences.


However, we recommend stopping all HA, pve-ha-crm, and pve-ha-lrm services/resources running on the cluster, then wait for all the HA resource top running, after that you can issue a shutdown command or from PVE GUI.

[0] https://bugzilla.proxmox.com/show_bug.cgi?id=3839
[1] https://forum.proxmox.com/threads/shutdown-of-the-hyper-converged-cluster-ceph.68085/
 
Last edited:
Hello,

We have an enhancement opened in our Bugzilla [0]. And here is also a thread in our forum about the same request [1] maybe you will get more info from the people who share their experiences.


However, we recommend stopping all HA, pve-ha-crm, and pve-ha-lrm services/resources running on the cluster, then wait for all the HA resource top running, after that you can issue a shutdown command or from PVE GUI.

[0] https://bugzilla.proxmox.com/show_bug.cgi?id=3839
[1] https://forum.proxmox.com/threads/shutdown-of-the-hyper-converged-cluster-ceph.68085/

Please expand on this. The Bugzilla enhancement has basically zero activity. The forum post linked is all over the place as far as recommendations go. The answer given here is pretty vague and doesn't cover bringing the cluster back online.

Please provide clear, concise, step by step instructions on how to both properly shut down and properly start up a hyper converged proxmox cluster running ceph.
 
Last edited:
Thank you!

For others looking for this, here is what's currently waiting to get into the published docs:

Shutdown {pve} + Ceph HCI cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


To shut down the whole {pve} + Ceph cluster, first stop all Ceph clients. These
will mainly be VMs and containers. If you have additional clients that might
access a Ceph FS or an installed RADOS GW, stop these as well.
Highly available guests will switch their state to 'stopped' when powered down
via the {pve} tooling.

Once all clients, VMs and containers are off or not accessing the Ceph cluster
anymore, verify that the Ceph cluster is in a healthy state. Either via the Web UI
or the CLI:

Code:
ceph -s

To disable all self-healing actions, and to pause any client IO in the Ceph
cluster, enable the following OSD flags in the **Ceph -> OSD** panel or via the
CLI:

Code:
ceph osd set noout
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set nodown
ceph osd set pause

Start powering down your nodes without a monitor (MON). After these nodes are
down, continue by shutting down nodes with monitors on them.

When powering on the cluster, start the nodes with monitors (MONs) first. Once
all nodes are up and running, confirm that all Ceph services are up and running
before you unset the OSD flags again:

Code:
ceph osd unset pause
ceph osd unset nodown
ceph osd unset nobackfill
ceph osd unset norebalance
ceph osd unset norecover
ceph osd unset noout

You can now start up the guests. Highly available guests will change their state
to 'started' when they power on.
 
Last edited:
TYVM for this i'm going in vacation and i wanted to shutdown my home cluster for the duration as i won't be needing it.

I always found fragmented informations about a prolonged powering down of a ceph cluster and this seems to finally adress it.

This leaves me with a question though:
Is there a way to automate this? it's not really my problem yet as within my customer i don't run ceph "yet" but i do have a shutdown procedure tied to a NUT server for when the power goes down more than 20 minutes.

To safely shut it down i imagine the only way is to write a script?
This would mean however that a simple unattended power up from remote would not work withou then unsetting those flags correct?
this would mean that a customer without IT knowledge would be unable to simpli "turn the server back on"
 
This would mean however that a simple unattended power up from remote would not work withou then unsetting those flags correct?
this would mean that a customer without IT knowledge would be unable to simpli "turn the server back on"
Unless you write a startup-script /-job that waits for all nodes in the cluster to be on for like 2 minutes (to prevent issues on a repeated power-failure), unsets the flags, then starts to boot the VM's/Containers/Similarly.
 
Amateur question: for my home setup (3 nodes, ZFS, replication, HA) I need something quick for when the power goes out and I only got a few minutes UPS. Could I not just `pvecm expected x` with `x` being an absurdly high number, and then shut down the nodes from the Proxmox UI? That should prevent any failover attempts during shutdown and later when I bring them back up. And once all nodes are back up I'd set the normal value of the quorum.

Does that track?
 
Could I not just `pvecm expected x` with `x` being an absurdly high number,

Does that track?
No. You can't set it higher than actually possible:
Code:
~# pvecm expected 99
Unable to set expected votes: CS_ERR_INVALID_PARAM

Not tested: I would just shut down all nodes at the same time. That should not trigger HA-migrations...
 
Not tested: I would just shut down all nodes at the same time. That should not trigger HA-migrations...
Hmm... That makes all kinds of assumptions. And just sets one up for issues on startup, in case the systems do not all roughly use the same time (stray fsck, some hardware for which the driver takes extra time on boot, ...)

There really should be something along the lines of `ha pause/disable` and `ha resume/enable`. Or better yet a "cluster disable & shutdown" where in restart the UI requires a "click" to re-enable.

Obviously that doesn't make much sense for large installs. But I bet home/shop/lab clusters with a small number of nodes number many thousands.
 
Hmm... That makes all kinds of assumptions.
Well, the documentation tells me in https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_shutdown_policy :

Conditional​

The Conditional shutdown policy automatically detects if a shutdown or a reboot is requested, and changes behaviour accordingly.

For Policy=Migrate it says
Once the Local Resource manager (LRM) gets a shutdown request and this policyis enabled, it will mark itself as unavailable...

I believe the PVE devs have it implemented in a sane way - they are professionals :-)

The actual behavior for a given cluster may be hard to test because timing is relevant - it needs a minute (or two) to actually trigger HA activity.
 
The actual behavior for a given cluster may be hard to test because timing is relevant - it needs a minute (or two) to actually trigger HA activity.

Just for curiosity I tested some things to disturb a cluster - or at least a HA resource. This is a long post, read it for entertainment or skip it. At the end I can reproduce a problematic situation in my test-setup...

I have:
  • a pure virtual Test-Cluster of six nodes (pna,...pnf), backed by really-slow proof-of-concept HC-Ceph
  • two nodes (pnd/pne) build a HA-group "HATEST" with one VM (antix) up (on pnd) and running "ping" in a terminal - for visual monitoring :-)
(First I had three nodes in that HA-group. But the ultimate test is to shut down all relevant nodes. With three of them I would lose both PVE-Quorum and additionally Ceph (configured for size=4/min_size=2) would be really unhappy. So I stepped back to only two nodes for HA.)

Now I can test:

1) manual migration between those HA-nodes
  • works as advertised, without any hickup
2) shutdown node (pnd) with that test-VM running
  • Guest VM got shutdown = NOT migrated (because Policy=Conditional) and then restarted on another node (pne)
That's of course not what I wanted. Only now I changed Policy Conditional --> Policy Migrate :-)

3) for curiosity: manually migrate from pnd to pnf = a Node outside that defined HA-group
  • migration works!
  • but then the VM migrates back from that non-HA-node to pnd after some seconds
4) shutdown node (pnd) with the test-VM running
  • this triggers Migration (pnd-->pne)
  • guest stays "up"; no hickup
Powering up again that test-node migrates the VM back to its original node (pnd)

All above tests were in normal working parameters and worked as documented. I did them to confirm this for my test-setup.



Now for the more interesting tests; the baseline is:
  • all six nodes are up and running
  • the test-VM is on pnd, up and running of course, with visible "ping"

5) shutdown all nodes in that HA-group all at once - triggered via GUI with three seconds delay for clicking around; the node with the test-VM (pnd) first
  • surprise: the VM migrated out of the defined group onto node pna. No hickup, but unexpected :-)
That's documented behavior..., now I set the flag "restricted" in my HATEST-group. :-)

The VM migrated back to the origin automatically, as expected.

Back to normal, all Nodes up; VM on pnd


6) shutdown all Nodes in that HA-group all at once - triggered via GUI with three seconds delay; the node with the test-VM (pnd) first
  • migration did NOT start
  • shutdown of the secondary node (pne) finished
  • the guest keeps running on the original node!
  • shutdown of pnd obviously got cancelled, I waited five minutes
This situation is NOT clean:
  • the VM still is running on pnd
  • the node pnd is "grayed out" (pvestatd?)
  • but I can click "Shutdown" from the WebGUI of another node --> no reaction
  • ssh works
  • "qm list" confirms VM is here
Troubleshooting:
Code:
root@pnd:~# shutdown -h now 
Failed to set wall message, ignoring: Transport endpoint is not connected
Call to PowerOff failed: Transport endpoint is not connected

root@pnd:~# systemctl start  pvestatd
Failed to start pvestatd.service: Transaction for pvestatd.service/start is destructive (systemd-binfmt.service has 'stop' job queued, but 'start' is included in transaction).
See system logs and 'systemctl status pvestatd.service' for details.

root@pnd:~# systemctl start pvescheduler.service 
Failed to start pvescheduler.service: Transaction for pvescheduler.service/start is destructive (dev-dm\x2d2.swap has 'stop' job queued, but 'start' is included in transaction).
See system logs and 'systemctl status pvescheduler.service' for details.

While I know how to force shutdown of that system... what would be a good way to continue???
Code:
root@pnd:~# qm shutdown 29101
Requesting HA stop for VM 29101  # DID WORK

And now the node shutdown too!

Shutting down both nodes of this two-node-HA-group did not behave well.

Power-on both nodes. The HA Request State is now "stopped", which feels just wrong. Probably because of my manual "qm shutdown".

Back to normal: all six nodes up; test-VM visible with "ping" on pnd.


7) same as 5+6 but in reverse order: first shutdown the unused HA-node pne, then pnd with the test-VM; again with ~three seconds time in between
  • pne shuts down quickly
  • same as in 6) : "qm shutdown" works and the node shut down also.
Now I possibly found the culprit: no agent in the test-VM! ACPI seems not be sufficient for my goal? NOW install qemu-guest-agent!

Back to normal: all six nodes up; test-VM visible with "ping" on pnd. Now WITH guest-agent...


8) same as 7 = shutdown pne + pnd --> seems like the guest-agent does not change anything
  • the secondary HA-node pne shut down
  • the node with the VM stays up and the VM keeps running
  • in the WebGui both nodes are shown down after some minutes; both not manageable, same as before; VM still running!
  • again: ssh --> "qm shutdown" required to shutdown VM and then the node follows automatically
Back to normal: all six nodes up; test-VM visible with "ping" on pnd.


9) slow down: just shutdown the currently-not-used pne. Then shutdown the only left node in the HATEST-group, running the test-VM
  • shutdown pne works - of course!
  • shutdown pnd does NOT work, it is plainly ignored!?
  • after (exactly) five minutes it reached again that strange "grayed out" (not: red cross), "not manageable, but VM keeps running" state.
  • without ssh I could do nothing
  • the VM is still running... for 30(!) minutes. Then it shuts down!
  • after this long timeout: missing services are restarted; the WebGUI is responsible again; the VM got restarted
I have no good explanation for the 30 Minutes knock-out. It feels suprising and the behaviour is not what I want. For an emergency shutdown (UPS) this is probably bad.



Reproducable result of this test-sequence as of now:

When I trigger "shutdown" on the last standing node of a HA group with one VM running I encounter an problematic state: after five minutes that node is not manageable in the WebGui anymore but the VM keeps running on this very node. After 30 minutes the system is manageble again.

Is this the expected behaviour? I am not sure...

PS: all nodes have two virtual OSDs for Ceph. I ignored that on purpose as "size=4/min_size=2" should keep critical trouble away.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!