Expected behavior from watchdog-mux with a networking outage? (HA, Corosync, and Softdog fencing)

hackinthebox

New Member
Sep 13, 2020
8
25
3
42
What’s the expected behavior here?

I have a 3-node cluster with dedicated physical corosync network, and a 2nd faster network for storage and networking. The corosync network is configured to failover to the fast network if interrupted.

High availability is configured on guests with shared storage. The fast network is configured as the migration network. HA shutdown behavior is set to ‘migrate’.

The softdog kernel module works out of the box with the default settings. I was able to test this by killing the watchdog-mux process, and immediately seeing logs panic about no writes to /dev/watchdog0 followed by an expected migration and node reboot.

However... *** If for example, I disconnect the network on the fast network interface (by pulling the cable or disabling the switch port) ... nothing happens. Zero. Except “no route to host” errors and the inability to interact with the node or its guests. Because the corosync network is still intact, all the nodes and guests look green and happy from the web gui, even though theyre gone.

Is this expected behavior??

Even though the network issue is not caused by the node necessarily, I would still expect services to migrate to an available node. The corosync network is still intact, and the shared storage should make at least a cold “relocation” reasonable.

It seems illogical for the guests to just sit there in a stale state and should not need manual intervention. That makes me think I’m missing something.

If this IS expected behavior, thats ok. I’ll create a service in systemd to monitor IP’s on the interface, then “automatically manually” migrate guests to an available node through any means necessary with a robust script. I’m aware of the “watchdog” package in the repository that allows more granular configuration... but a) I feel that will create conflicts with the existing watchdog-mux, and b) no one can get more granular for exactly what I want it to do than scripts written exclusively for it.

Thanks for any feedback about this issue before I start reinventing things that may already exist!
 
yes, this is expected behaviour. fencing will only kick in if a node loses quorum.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!