automated checks for things that make fencing and HA migration fail

Aug 6, 2014
136
3
18
got fencing and HA migration working, but theres some tricky parts. id like to make a playbook(ansible) to check for these. i noticed a couple things to check for.

in one case rgmanager was not running for some reason. restart from gui. this prevented HA migration to that node. the error says the node does not exist. whats cool is when i tested this with clusvcadm -r, the guest ended up on another node instead of failing the migration. dont know if that was a bug or feature but i like that behavior. seems more of a feature the more i think about that.

in another, a certain node was red when you logged into the web front end from the other two, or the other two were red when you logged in from that one. service pve-cluster restart fixed it. how do you check for that condition in the shell or from a script?

is there anything else to check for? how would this be done?

do failure domains make it more robust? when i finally got it working, it was without one. the docs i tracked down said there was already a default failure domain, which is why i havent tried it yet.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!