detect failover of VMs and containers

hr556

Member
Jan 29, 2021
20
2
8
What is the best way to detect a cluster failover, meaning that my replicated VMs get started on another node? In /var/log/syslog I found the following maybe relevant messages, but don't know on which message to look after:

Bash:
May 27 17:10:22 bohr corosync[2084]:   [MAIN  ] Completed service synchronization, ready to provide service

May 27 17:13:22 bohr pve-ha-lrm[2416]: successfully acquired lock 'ha_agent_bohr_lock'
May 27 17:13:22 bohr pve-ha-lrm[2416]: watchdog active
May 27 17:13:22 bohr pve-ha-lrm[2416]: status change wait_for_agent_lock => active
 
Hi,
the failover is handled by pve-ha-crm. The log on the node that was HA master at the time (I can't tell you which one it is) will contain log messages about the failover/recovery operation. Why do you want to detect the single failover, i.e. what is the actual issue you're trying to solve?

Note also that there is a mail notification when a node gets fenced (but not for every single guest failover).
 
Hi,
the failover is handled by pve-ha-crm. The log on the node that was HA master at the time (I can't tell you which one it is) will contain log messages about the failover/recovery operation. Why do you want to detect the single failover, i.e. what is the actual issue you're trying to solve?

Note also that there is a mail notification when a node gets fenced (but not for every single guest failover).
Thanks for the info! Aim is to get notified when VMs/Containers get spawn up on other nodes due to failover, for montioring reasons. No issues so far, just getting insights of the PVE cluster.

What I see on the former active node "einstein":
Bash:
May 27 17:59:25 einstein pve-ha-crm[3437]: starting server
May 27 17:59:25 einstein pve-ha-crm[3437]: status change startup => wait_for_quorum
May 27 18:00:00 einstein pve-ha-crm[3437]: status change wait_for_quorum => slave
May 27 18:04:15 einstein pve-ha-crm[3437]: received signal TERM
May 27 18:04:15 einstein pve-ha-crm[3437]: server received shutdown request
May 27 18:04:16 einstein pve-ha-crm[3437]: server stopped
May 27 18:04:17 einstein systemd[1]: pve-ha-crm.service: Succeeded.
May 27 18:04:51 einstein pve-ha-crm[2907]: starting server
May 27 18:04:51 einstein pve-ha-crm[2907]: status change startup => wait_for_quorum
May 27 18:05:36 einstein pve-ha-crm[2907]: status change wait_for_quorum => slave

At the same time, on the node "bohr" which got active:
Bash:
May 27 18:36:57 bohr pve-ha-crm[3694]: starting server
May 27 18:36:57 bohr pve-ha-crm[3694]: status change startup => wait_for_quorum
May 27 18:38:37 bohr pve-ha-crm[3694]: status change wait_for_quorum => slave

So it seems that "pve-ha-crm" is for getting to know when cluster-nodes start/stop, but not when VMs get active due to replication / node failure. Otherwise there would be some event around "May 27 17:10..."
 
Thanks for the info! Aim is to get notified when VMs/Containers get spawn up on other nodes due to failover, for montioring reasons. No issues so far, just getting insights of the PVE cluster.
Maybe you can just monitor the HA status API endpoint /cluster/ha/status/current? There you see the LRM status and service status and node, so you can e.g. correlate if a service changes nodes and the LRM of the previous node is not active anymore.

What I see on the former active node "einstein":
Bash:
May 27 17:59:25 einstein pve-ha-crm[3437]: starting server
May 27 17:59:25 einstein pve-ha-crm[3437]: status change startup => wait_for_quorum
May 27 18:00:00 einstein pve-ha-crm[3437]: status change wait_for_quorum => slave
May 27 18:04:15 einstein pve-ha-crm[3437]: received signal TERM
May 27 18:04:15 einstein pve-ha-crm[3437]: server received shutdown request
May 27 18:04:16 einstein pve-ha-crm[3437]: server stopped
May 27 18:04:17 einstein systemd[1]: pve-ha-crm.service: Succeeded.
May 27 18:04:51 einstein pve-ha-crm[2907]: starting server
May 27 18:04:51 einstein pve-ha-crm[2907]: status change startup => wait_for_quorum
May 27 18:05:36 einstein pve-ha-crm[2907]: status change wait_for_quorum => slave

At the same time, on the node "bohr" which got active:
Bash:
May 27 18:36:57 bohr pve-ha-crm[3694]: starting server
May 27 18:36:57 bohr pve-ha-crm[3694]: status change startup => wait_for_quorum
May 27 18:38:37 bohr pve-ha-crm[3694]: status change wait_for_quorum => slave

So it seems that "pve-ha-crm" is for getting to know when cluster-nodes start/stop, but not when VMs get active due to replication / node failure. Otherwise there would be some event around "May 27 17:10..."
Both of these were not the master at the time, but slaves. Please also check your other nodes.
 
Got it, many thanks! The third cluster node "planck" became master:

Bash:
May 27 17:12:07 planck pve-ha-crm[1424]: successfully acquired lock 'ha_manager_lock'
May 27 17:12:07 planck pve-ha-crm[1424]: watchdog active
May 27 17:12:07 planck pve-ha-crm[1424]: status change slave => master
May 27 17:12:07 planck pve-ha-crm[1424]: node 'einstein': state changed from 'online' => 'unknown'
May 27 17:13:07 planck pve-ha-crm[1424]: service 'ct:100002': state changed from 'started' to 'fence'
May 27 17:13:07 planck pve-ha-crm[1424]: node 'einstein': state changed from 'unknown' => 'fence'
May 27 17:13:17 planck pve-ha-crm[1424]: successfully acquired lock 'ha_agent_einstein_lock'
May 27 17:13:17 planck pve-ha-crm[1424]: fencing: acknowledged - got agent lock for node 'einstein'
May 27 17:13:17 planck pve-ha-crm[1424]: node 'einstein': state changed from 'fence' => 'unknown'
May 27 17:13:17 planck pve-ha-crm[1424]: service 'ct:100002': state changed from 'fence' to 'recovery'
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!