I've been struggling with this for a week or more. I have a 2-node cluster, plus a qdevice. It works great. No quorum issues or anything; however, I can't get the cluster to shutdown correctly when the power goes out.
One cluster node is the NUT "master" and has the UPS attached via USB. The other node is a NUT slave. The qdevice is a raspberry pi on a different UPS (for the purposes of this question, you can assume the power on the raspberry pi never goes out.)
Side note: I have "HA Settings," "shutdown_policy" set to "freeze". This is the setting that suits my situation best. When I shutdown a node, no guests should automatically migrate. When the node is started-up again, the same HA guests that were started before should start again on the node.
What happens when the power goes out (or I pull the plug from the wall) is, the UPS tells the master node and the slave node that it's on battery. A bit later, if the UPS hasn't gone back to mains power, the slave and master decide to shutdown. "shutdown now -h" on the slave shuts down all the "regular" (non-HA) guests and then freezes the HA guests and finally shuts down the node. This NUT slave node goes down perfectly.
Now, at this point, the remaining node still has quorum (the qdevice is there and working.) However, when it goes to shutdown ("shutdown now -h"), it loses quorum before it's able to freeze the HA guests. corosync-qdevice.service is no longer running. Eventually (I'd say about a minute later,) the node restarts (it doesn't shutdown), not having shutdown the HA guests. This is terrible because, first it isn't cleanly shutting down the HA guests, but also it isn't turning off the UPS (that is the final step of the NUT master shutdown sequence.) The node ends up in a weird boot loop, while the UPS is still on battery.
It feels like this could work if corosync-qdevice.service was kept alive during node shutdown, somehow, but I'm just speculating at this point and I really need expert input on how this is supposed to work. Surely, Proxmox cluster shutdown on power failure is a solved problem. Is there a way to make the corosync-qdevice service stick around longer?
Just for more information: if I manually stop all the guests on both nodes (as in: use the GUI "bulk stop" feature) and then pull the plug, everything works as expected, except that the HA guests aren't frozen, they're stopped (bulk stop) so need to be manually started after power is restored. Quorum is never required by the master node because no HA guests need to be shutdown.
Thanks for any and all help/input!
One cluster node is the NUT "master" and has the UPS attached via USB. The other node is a NUT slave. The qdevice is a raspberry pi on a different UPS (for the purposes of this question, you can assume the power on the raspberry pi never goes out.)
Side note: I have "HA Settings," "shutdown_policy" set to "freeze". This is the setting that suits my situation best. When I shutdown a node, no guests should automatically migrate. When the node is started-up again, the same HA guests that were started before should start again on the node.
What happens when the power goes out (or I pull the plug from the wall) is, the UPS tells the master node and the slave node that it's on battery. A bit later, if the UPS hasn't gone back to mains power, the slave and master decide to shutdown. "shutdown now -h" on the slave shuts down all the "regular" (non-HA) guests and then freezes the HA guests and finally shuts down the node. This NUT slave node goes down perfectly.
Now, at this point, the remaining node still has quorum (the qdevice is there and working.) However, when it goes to shutdown ("shutdown now -h"), it loses quorum before it's able to freeze the HA guests. corosync-qdevice.service is no longer running. Eventually (I'd say about a minute later,) the node restarts (it doesn't shutdown), not having shutdown the HA guests. This is terrible because, first it isn't cleanly shutting down the HA guests, but also it isn't turning off the UPS (that is the final step of the NUT master shutdown sequence.) The node ends up in a weird boot loop, while the UPS is still on battery.
It feels like this could work if corosync-qdevice.service was kept alive during node shutdown, somehow, but I'm just speculating at this point and I really need expert input on how this is supposed to work. Surely, Proxmox cluster shutdown on power failure is a solved problem. Is there a way to make the corosync-qdevice service stick around longer?
Just for more information: if I manually stop all the guests on both nodes (as in: use the GUI "bulk stop" feature) and then pull the plug, everything works as expected, except that the HA guests aren't frozen, they're stopped (bulk stop) so need to be manually started after power is restored. Quorum is never required by the master node because no HA guests need to be shutdown.
Thanks for any and all help/input!
Last edited: